CIKB poster_2015

Philip Morris International R&D, Quai Jeanrenaud 5, 2000 Neuchâtel, Switzerland pmi.com, pmiscience.com

Agile Development and Fast Data Integration in a Large R&D CompanyAntonio Castellon1, Pavel Pospisil2

1blue-infinity, Geneva, Switzerland, 2Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland

Research and Development departments in large companies like Philip Morris International (PMI) deal with extremely large quantities of data in a variety of formats, including primary scientific data, results from chemical, biological and toxicological assays, and data from clinical investigations. All these data are recorded in different formats, derived from scientific and laboratory management systems, data warehouses and documents. Now the question is, how do we provide users with a convenient, single-point entry overview for all results… and fast?Here we present an approach to perform data integration using the best technologies and architectures currently available, including OSGi architecture combined with Micro-services, document and graph-based databases and web based user interfaces. The first prototype was completed by just one scientist and one IT developer in less than 3 months.

The product is named ChemoInformatics KnowledgeBase (CIKB). CIKB is chemocentric database that (concretely within PMI) assembles data for chemical constituents present in the aerosol of our products (e.g. conventional cigarettes, e-cigarettes) and associates them with both internal and publicly available scientific data. It is chemocentric since it takes the most essential node of information related to the chemical substance, however it can be expanded to any other-centric direction later.

1. Concept 2. Architecture

3. Project Management 4. ConclusionPros

• Rapid development of new features• Reduction in complexity with the use of separate services• Highly flexible and adaptable environment enabling easy integration

of evolving user requirements using new languages and/or frameworks from each service

• Easy to redistribute the services to several servers according to computational and bandwidth requirements

• Easy deployment; only requires installation of new services and not the full application

• The graph database allows better adaptability and flexibility in the schema model than in traditional RDBMs

• A very good decoupled application between the view and data layers using an AngularJS framework

• Easy redistribution of the work packages into future teams

Cons

• Requires a multidisciplinary developer or a group of developers with skills in different environments (server, languages, frameworks, etc.)

• Although graph databases allow flexible model generation, it is crucial to maintain regular discussions with scientists regarding the desired model, to define the correct data type (node/relationship/property/etc.)

• The continuous integration of diverse data types derived from different scientific disciplines requires an ongoing learning process

The architecture solution is an OSGi platform for core services, incorporating Micro-services for the remainder, in order to manage challenges (ambiguity, inconsistency, vagueness, incompleteness) due to the complexity of the data models and the features required by the users. Furthermore, it was essential to provide users with different interfaces with the least effort required.

Less is MoreAccording to the rule that “less is more,” the application development started with minimal user requirements based upon common sense. The complex features were defined as ‘epics’ in the tracking system JIRA. The simple features were defined less formally within internal documents, email communications and meeting notes.

CIKBGraph Database

CHEMICAL CLUSTERING

CHEMOINFORMATICS

BIOINFORMATICS

CALCULATED DATA

IN VITRO TOXICOLOGY

ANALYTICAL CHEMISTRY

IN VIVO TOXICOLOGY

SYSTEM TOXICOLOGY

PMI MEASURED

DATA

PROPERTIES CALCULATION

AEROSOL PHYSICS

PRODUCT PORTFOLIO

FLAVOR & SENSORY DATA

AEROSOL CHEMISTRY

PMI INTERNAL DATA

One comparative, graphical,

user-friendly interface

FLAVOR PROPERTIES

REGULATORY LISTS

TOXICITY DATA

PUBLICLY AVAILABLE DATA

CHEMICAL PROPERTIES

To be most efficient in communicating between the Scientific & Computer worlds, two important roles were defined for this project:

As can be seen there is a high diversity of data types, including product names, toxicities and calculated data, to name but a few. It requires a degree of multidisciplinary knowledge and significant efforts to develop the correct integration strategy for a single unique repository.

The architecture employed is robust, due to the fact that each service is self-contained, having dependency upon other services only in the case of security validation features. At the same time, it provides the flexibility to create rapid solutions in response to user feedback, with the ability to modify features or add technologies.

Scientific User 1

Scientific User <n>

. . .

IT dev. 1

IT dev. <n>

. . .JIRAUser req.

User req.

IT LeaderUser + System req. = Tasks

Tasks

Scientific LeaderFeedback + bugs

SCIENTIFIC DOMAIN IT DOMAIN

Business User

. . .

Scientific LeaderIn order to cover the diverse range of scientific disciplines and also to isolate the IT developers from the complexity of scientific understanding, this role acquires the needs and wishes of scientists and evaluates the priority for each requirement from a business use point of view.

IT LeaderIn order to evaluate the complexity

for each new requirement, the IT Leader will analyze each request and

propose solutions. Tasks are then created and distributed to developers

in the form of work packages with due dates. Finally, the Scientific and

IT leaders regularly check that the solutions meet business needs.

Documents

CIKB poster_2015