Upload
eran-chinthaka-withana
View
260
Download
2
Tags:
Embed Size (px)
Citation preview
Redefining ETL Pipelines with Apache
Technologies to Accelerate Decision
Making for Clinical Trials
Eran Withana
www.comprehend.com
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Open Source
Member, PMC member and committer of ASF
Apache Axis2, Web Services, Synapse,
Airavata
Education
PhD in Computer Science from Indiana
University
Software engineer at Comprehend Systems
About me …
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
www.comprehend.com
Clinical Trials – Lay of the land
Number of Drugs in Development Worldwide
(Source: CenterWatch Drugs in Clinical Trial
Database 2014)
Source: http://www.phrma.org/innovation/clinical-trials
www.comprehend.com
Clinical Trials – Lay of the Land
Multiple Stakeholders• Study Managers • Program Managers• Monitors• Data Managers• Bio-statisticians• Executives• Medical Affairs• Regulatory• Vendors• CROs• CRAs
Sites
Labs
Patients
Safety
EDC
Reports
● Latent● Fragmented
Data
PV DataExcel
SponsorContract Research Organization (CRO)Sites and Investigators
www.comprehend.com
For decades, clinical development
was primarily paper-based.
www.comprehend.com
Various Software and Practices Used in Each Layer
medidata
CROs and SIs
Technologies
www.comprehend.com
Clinical Trials with Centralized Monitoring
Clinical Operations
Sites
Labs
Patients
● Consolidated● Real-time● Self-Service● Mobile
Clinical Analytics &
Collaboration
Data
Safety
EDC
PV DataExcel
www.comprehend.com
Providing up-to-date answers
Executives Medical Review
CRAs Data Management
Clinical Operations
EDC
CTMS
Safety
ePro
Other
Web
Ad-Hoc
Mobile
Collab
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
www.comprehend.com
FDA, HIPAA ComplianceMetadata/Database structure synchronization
Less frequent (once a day)
Data SynchronizationMore frequent (multiple times a day)
Ability to plugin various data sourcesRAVE, MERGE, BioClinica, File Imports, DB-to-DB
Synchs
Real time event propagationsAdverse events (AEs) - the need for early
identification
Business Requirements
www.comprehend.com
Hardware agnostic for resiliency and better utilization
Repeatable deployments
Real time processing and real time events
Fault Tolerance
In flight and end state metrics for alerting and monitoring
Flexible and pluggable adapter architecture
Time travelAudit trails
Report generations
Technical Requirements
www.comprehend.com
Events all the way
Shared event bus for multiple consumers
Use of language agnostic data
representations (via protobuf)
Automatic datacenter resources
management (Mesos/Marathon/Docker)
Core Design Principles
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
www.comprehend.com
• Data processing Apache Storm and Trident, Apache
Spark and Spark Streaming, Samza, Summingbird, Scalding, Apache Falcon, Azkaban
• Coordination and Configuration Management Apache Zookeeper, Redis, Apache
Curator
• Event Queue Apache Kafka
• Scheduling Chronos, Apache Mesos, Marathon,
Apache Aurora
• Database Synchronization Liquibase, Flyway DB
• Data Representations Apache Thrift, protobuf, Avro
• Deployments Ansible
• File Management Apache HDFS
• Monitoring and alerting Graphite, StatsD
• Database PostgreSQL, Apache Spark
• Resource Isolation LXC, Docker
Technologies Evaluated
www.comprehend.com
Data Processing Technology Evaluation
Criteria Storm + Trident
Spark + Streaming
Samza Summingbird Scalding Falcon Chronos Aurora Azkaban
DAG Support
Y DAGScheduler Y Y Y Y Y N Y
DAG Nodes Resiliency
Y Y Y Y Y Y Y N Y
Event Driven
Y Y Y Y N N N N N
Timed Execution
Y Y Y Y Y Y Y Y
DAG Extension
Y Y Y Y Y Y Y Y Y
Inflight and end state metrics
Y Y Y Y Y Y Y Y Y
Hardware Agnostic
Y Y Y Y Y Y Y Y Y
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
www.comprehend.com
High Level Architecture
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
www.comprehend.com
Bare Metal Boxes
Partitioned using LXC containers
Use of Mesos to do the resource
allocations as needed for jobs
Managing Hardware
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
www.comprehend.com
Ansible
Repeatable deployments
Password management
Inventory management
(nodes, dev/staging/production)
Deployments
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
www.comprehend.com
Adapters – High Level
• Syncher is for DB structural changes Syncher creates a database schema
from the source information
Runs a generic database diff and applies those to the target database
• Seeder is for data synchronization
Uses the database schema created by Syncher
• Seeders gets jobs from Syncher or
Timed scheduler
Data Adapters
• Coordination and
Configuration
through Zookeeper
Job configuration
Connection information
Distributed locking and counters
Metric Maintenance
Last successful run
Data Adapters – Coordination and Configuration
www.comprehend.com
Data Adapters - Implementation
www.comprehend.com
Syncher Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
Schema changes to the database fails in the middle• Transaction rollback
Seeder Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
If seeding fails midway• Storm retries tuples• Failing tuples are moved to an error queue
Table and row level failues• Option to skip the tables/rows but send a report at the end
Effect on “live” tables during data synchronizations• Option to use transactions or• Use temporary tables and swap with original upon completion
Failure Modes
www.comprehend.com
Can bring in data from more data sources and more studies effectively
Run real time reports on studies and configure alerts (future)
Can configure refreshes as needed by each use case
Can throttle input and output sources at study/customer level
Ability to onboard new customers and deploy new studies with minimal human intervention
What Have We Gained
www.comprehend.com
A generic framework which
eases integration with new data sources
• For each new source, implement a method to create a
virtual schema and to get data for a given table
can scale and fault tolerant
has generic monitoring and alerting
eases maintenance since its mostly generic code
notification of important events through messages
runs on any hardware
What Have We Gained
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
www.comprehend.com
AccessibilityCustomers must be able to drop files securely (SFTP like
functionality)
Ability to access resources through URLsData storage
Scalability and RedundancyScale-out by adding nodes
Resilience against loss of nodes, data centers and replication
MiscellaneousAccess control over read/write
Performance/usage/resource utilization monitoring
Distributed File System - Requirements
www.comprehend.com
Two name nodes running in HA mode, co-located with two journal nodes
Third journal node on a separate node
Data nodes on all bare metal nodes
Mounting HDFS with FUSE and enabling SFTP through OS level features
Automatic failover through DNS and HA Proxy
HDFS with High Availability Mode
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
www.comprehend.com
Regulatory requirementsData encryption requirements for clinical data
Audit trails
Data qualitySource system constraintsCoordination between Synchers and Seeders
Distributed locks and counters
Automatic fail over when a name node fails in HDFSHDFS HA mode stores active name node in ZK as a
java serialized object, yikes !!
Challenges
Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System
ChallengesFuture Work
Overview
www.comprehend.com
Time travel
Ability to go back in time and run reports at any
given point of time
Trail of data
Containerization
In-memory query execution with Apache
Spark
Future Work
www.comprehend.com
Team
www.comprehend.com
Thank You !!
Questions …