Upload
antonio-castellon
View
73
Download
0
Embed Size (px)
Citation preview
Philip Morris International R&D, Quai Jeanrenaud 5, 2000 Neuchâtel, Switzerland pmi.com, pmiscience.com
Agile Development and Fast Data Integration in a Large R&D CompanyAntonio Castellon1, Pavel Pospisil2
1blue-infinity, Geneva, Switzerland, 2Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland
Research and Development departments in large companies like Philip Morris International (PMI) deal with extremely large quantities of data in a variety of formats, including primary scientific data, results from chemical, biological and toxicological assays, and data from clinical investigations. All these data are recorded in different formats, derived from scientific and laboratory management systems, data warehouses and documents. Now the question is, how do we provide users with a convenient, single-point entry overview for all results… and fast?Here we present an approach to perform data integration using the best technologies and architectures currently available, including OSGi architecture combined with Micro-services, document and graph-based databases and web based user interfaces. The first prototype was completed by just one scientist and one IT developer in less than 3 months.
The product is named ChemoInformatics KnowledgeBase (CIKB). CIKB is chemocentric database that (concretely within PMI) assembles data for chemical constituents present in the aerosol of our products (e.g. conventional cigarettes, e-cigarettes) and associates them with both internal and publicly available scientific data. It is chemocentric since it takes the most essential node of information related to the chemical substance, however it can be expanded to any other-centric direction later.
1. Concept 2. Architecture
3. Project Management 4. ConclusionPros
• Rapid development of new features• Reduction in complexity with the use of separate services• Highly flexible and adaptable environment enabling easy integration
of evolving user requirements using new languages and/or frameworks from each service
• Easy to redistribute the services to several servers according to computational and bandwidth requirements
• Easy deployment; only requires installation of new services and not the full application
• The graph database allows better adaptability and flexibility in the schema model than in traditional RDBMs
• A very good decoupled application between the view and data layers using an AngularJS framework
• Easy redistribution of the work packages into future teams
Cons
• Requires a multidisciplinary developer or a group of developers with skills in different environments (server, languages, frameworks, etc.)
• Although graph databases allow flexible model generation, it is crucial to maintain regular discussions with scientists regarding the desired model, to define the correct data type (node/relationship/property/etc.)
• The continuous integration of diverse data types derived from different scientific disciplines requires an ongoing learning process
The architecture solution is an OSGi platform for core services, incorporating Micro-services for the remainder, in order to manage challenges (ambiguity, inconsistency, vagueness, incompleteness) due to the complexity of the data models and the features required by the users. Furthermore, it was essential to provide users with different interfaces with the least effort required.
Less is MoreAccording to the rule that “less is more,” the application development started with minimal user requirements based upon common sense. The complex features were defined as ‘epics’ in the tracking system JIRA. The simple features were defined less formally within internal documents, email communications and meeting notes.
CIKBGraph Database
CHEMICAL CLUSTERING
CHEMOINFORMATICS
BIOINFORMATICS
CALCULATED DATA
IN VITRO TOXICOLOGY
ANALYTICAL CHEMISTRY
IN VIVO TOXICOLOGY
SYSTEM TOXICOLOGY
PMI MEASURED
DATA
PROPERTIES CALCULATION
AEROSOL PHYSICS
PRODUCT PORTFOLIO
FLAVOR & SENSORY DATA
AEROSOL CHEMISTRY
PMI INTERNAL DATA
One comparative, graphical,
user-friendly interface
FLAVOR PROPERTIES
REGULATORY LISTS
TOXICITY DATA
PUBLICLY AVAILABLE DATA
CHEMICAL PROPERTIES
To be most efficient in communicating between the Scientific & Computer worlds, two important roles were defined for this project:
As can be seen there is a high diversity of data types, including product names, toxicities and calculated data, to name but a few. It requires a degree of multidisciplinary knowledge and significant efforts to develop the correct integration strategy for a single unique repository.
The architecture employed is robust, due to the fact that each service is self-contained, having dependency upon other services only in the case of security validation features. At the same time, it provides the flexibility to create rapid solutions in response to user feedback, with the ability to modify features or add technologies.
Scientific User 1
Scientific User <n>
. . .
IT dev. 1
IT dev. <n>
. . .JIRAUser req.
User req.
IT LeaderUser + System req. = Tasks
Tasks
Scientific LeaderFeedback + bugs
SCIENTIFIC DOMAIN IT DOMAIN
Business User
. . .
Scientific LeaderIn order to cover the diverse range of scientific disciplines and also to isolate the IT developers from the complexity of scientific understanding, this role acquires the needs and wishes of scientists and evaluates the priority for each requirement from a business use point of view.
IT LeaderIn order to evaluate the complexity
for each new requirement, the IT Leader will analyze each request and
propose solutions. Tasks are then created and distributed to developers
in the form of work packages with due dates. Finally, the Scientific and
IT leaders regularly check that the solutions meet business needs.