40
Redefining ETL Pipelines with Apache Technologies to Accelerate Decision Making for Clinical Trials Eran Withana

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Embed Size (px)

Citation preview

Page 1: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Redefining ETL Pipelines with Apache

Technologies to Accelerate Decision

Making for Clinical Trials

Eran Withana

Page 2: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Clinical Trials – Lay of the land

Business and Technical Requirements

Technology Evaluation

High Level Architecture

Implementation

Managing Hardware

Deployments

Data Adapters: Implementation and Failure Modes

Distributed File System

Challenges

Future Work

Overview

Page 3: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Open Source

Member, PMC member and committer of ASF

Apache Axis2, Web Services, Synapse,

Airavata

Education

PhD in Computer Science from Indiana

University

Software engineer at Comprehend Systems

About me …

Page 4: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

Page 5: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Clinical Trials – Lay of the land

Number of Drugs in Development Worldwide

(Source: CenterWatch Drugs in Clinical Trial

Database 2014)

Source: http://www.phrma.org/innovation/clinical-trials

Page 6: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Clinical Trials – Lay of the Land

Multiple Stakeholders• Study Managers • Program Managers• Monitors• Data Managers• Bio-statisticians• Executives• Medical Affairs• Regulatory• Vendors• CROs• CRAs

Sites

Labs

Patients

Safety

EDC

Reports

● Latent● Fragmented

Data

PV DataExcel

SponsorContract Research Organization (CRO)Sites and Investigators

Page 7: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

For decades, clinical development

was primarily paper-based.

Page 8: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Various Software and Practices Used in Each Layer

medidata

CROs and SIs

Technologies

Page 9: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Clinical Trials with Centralized Monitoring

Clinical Operations

Sites

Labs

Patients

● Consolidated● Real-time● Self-Service● Mobile

Clinical Analytics &

Collaboration

Data

Safety

EDC

PV DataExcel

Page 10: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Providing up-to-date answers

Executives Medical Review

CRAs Data Management

Clinical Operations

EDC

CTMS

Safety

ePro

Other

Web

Ad-Hoc

Mobile

Collab

Page 11: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

Page 12: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

FDA, HIPAA ComplianceMetadata/Database structure synchronization

Less frequent (once a day)

Data SynchronizationMore frequent (multiple times a day)

Ability to plugin various data sourcesRAVE, MERGE, BioClinica, File Imports, DB-to-DB

Synchs

Real time event propagationsAdverse events (AEs) - the need for early

identification

Business Requirements

Page 13: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Hardware agnostic for resiliency and better utilization

Repeatable deployments

Real time processing and real time events

Fault Tolerance

In flight and end state metrics for alerting and monitoring

Flexible and pluggable adapter architecture

Time travelAudit trails

Report generations

Technical Requirements

Page 14: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Events all the way

Shared event bus for multiple consumers

Use of language agnostic data

representations (via protobuf)

Automatic datacenter resources

management (Mesos/Marathon/Docker)

Core Design Principles

Page 15: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

Page 16: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

• Data processing Apache Storm and Trident, Apache

Spark and Spark Streaming, Samza, Summingbird, Scalding, Apache Falcon, Azkaban

• Coordination and Configuration Management Apache Zookeeper, Redis, Apache

Curator

• Event Queue Apache Kafka

• Scheduling Chronos, Apache Mesos, Marathon,

Apache Aurora

• Database Synchronization Liquibase, Flyway DB

• Data Representations Apache Thrift, protobuf, Avro

• Deployments Ansible

• File Management Apache HDFS

• Monitoring and alerting Graphite, StatsD

• Database PostgreSQL, Apache Spark

• Resource Isolation LXC, Docker

Technologies Evaluated

Page 17: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Data Processing Technology Evaluation

Criteria Storm + Trident

Spark + Streaming

Samza Summingbird Scalding Falcon Chronos Aurora Azkaban

DAG Support

Y DAGScheduler Y Y Y Y Y N Y

DAG Nodes Resiliency

Y Y Y Y Y Y Y N Y

Event Driven

Y Y Y Y N N N N N

Timed Execution

Y Y Y Y Y Y Y Y

DAG Extension

Y Y Y Y Y Y Y Y Y

Inflight and end state metrics

Y Y Y Y Y Y Y Y Y

Hardware Agnostic

Y Y Y Y Y Y Y Y Y

Page 18: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

Page 19: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

High Level Architecture

Page 20: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

Page 21: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Bare Metal Boxes

Partitioned using LXC containers

Use of Mesos to do the resource

allocations as needed for jobs

Managing Hardware

Page 22: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

Page 23: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Ansible

Repeatable deployments

Password management

Inventory management

(nodes, dev/staging/production)

Deployments

Page 24: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

Page 25: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Adapters – High Level

Page 26: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

• Syncher is for DB structural changes Syncher creates a database schema

from the source information

Runs a generic database diff and applies those to the target database

• Seeder is for data synchronization

Uses the database schema created by Syncher

• Seeders gets jobs from Syncher or

Timed scheduler

Data Adapters

Page 27: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

• Coordination and

Configuration

through Zookeeper

Job configuration

Connection information

Distributed locking and counters

Metric Maintenance

Last successful run

Data Adapters – Coordination and Configuration

Page 28: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Data Adapters - Implementation

Page 29: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Syncher Connectivity to source/sink systems fail

• Retry job N times and alert, if needed

Schema changes to the database fails in the middle• Transaction rollback

Seeder Connectivity to source/sink systems fail

• Retry job N times and alert, if needed

If seeding fails midway• Storm retries tuples• Failing tuples are moved to an error queue

Table and row level failues• Option to skip the tables/rows but send a report at the end

Effect on “live” tables during data synchronizations• Option to use transactions or• Use temporary tables and swap with original upon completion

Failure Modes

Page 30: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Can bring in data from more data sources and more studies effectively

Run real time reports on studies and configure alerts (future)

Can configure refreshes as needed by each use case

Can throttle input and output sources at study/customer level

Ability to onboard new customers and deploy new studies with minimal human intervention

What Have We Gained

Page 31: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

A generic framework which

eases integration with new data sources

• For each new source, implement a method to create a

virtual schema and to get data for a given table

can scale and fault tolerant

has generic monitoring and alerting

eases maintenance since its mostly generic code

notification of important events through messages

runs on any hardware

What Have We Gained

Page 32: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

Page 33: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

AccessibilityCustomers must be able to drop files securely (SFTP like

functionality)

Ability to access resources through URLsData storage

Scalability and RedundancyScale-out by adding nodes

Resilience against loss of nodes, data centers and replication

MiscellaneousAccess control over read/write

Performance/usage/resource utilization monitoring

Distributed File System - Requirements

Page 34: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Two name nodes running in HA mode, co-located with two journal nodes

Third journal node on a separate node

Data nodes on all bare metal nodes

Mounting HDFS with FUSE and enabling SFTP through OS level features

Automatic failover through DNS and HA Proxy

HDFS with High Availability Mode

Page 35: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

Page 36: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Regulatory requirementsData encryption requirements for clinical data

Audit trails

Data qualitySource system constraintsCoordination between Synchers and Seeders

Distributed locks and counters

Automatic fail over when a name node fails in HDFSHDFS HA mode stores active name node in ZK as a

java serialized object, yikes !!

Challenges

Page 37: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Clinical Trials – Lay of the landBusiness and Technical RequirementsTechnology EvaluationHigh Level ArchitectureImplementationManaging HardwareDeploymentsData Adapters: Implementation and Failure ModesDistributed File System

ChallengesFuture Work

Overview

Page 38: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Time travel

Ability to go back in time and run reports at any

given point of time

Trail of data

Containerization

In-memory query execution with Apache

Spark

Future Work

Page 39: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Team

Page 40: Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

www.comprehend.com

Thank You !!

Questions …