Upload
russelltamr
View
705
Download
4
Embed Size (px)
Citation preview
COMBINING HUMAN & MACHINE INTELLIGENCE TO SUCCESSFULLY INTEGRATE BIOMEDICAL DATATIMOTHY DANFORD | TAMR, INC.
THE DATA INTEGRATION PROBLEM
● flat files: every file has its own columns
● bioinformatics: every tool has its own file format
● graph data: RDF, OWL, “knowledge graphs”
● proprietary / legacy formats: SAS, DBF
● relational databases: inconsistent data models
Biomedical Data Integration is aConstantly Moving Target
THE BIOMEDICAL DATA INTEGRATION PROBLEM
Fundamentally, many scientific analyses are tabularrows are ‘entities’
columns are ‘attributes’ graphs (paths) and hierarchies (part/whole) are other shapes
tables emphasize independence of entities and attributes
Tabular Datasets are a Core Data Shape
THE BIOMEDICAL DATA INTEGRATION PROBLEM
● Column-oriented: Find the matching attributes● Row-oriented: Discover duplicate entities
Data Integration Proceeds In Two Directions
THE DATA INTEGRATION PROBLEM
● One solution: hire or train data curators who understand the subject area
● Benefits: accuracy
● Problemso Low bandwidtho Difficult to scale to larger
problemso Recording decisionso Consistency between curators
Data Curation Teams Do Not Scale
THE DATA INTEGRATION PROBLEM
● Build an automated or rules-based system to perform data integration
● Benefits: scale
● Problemso Accuracy, edge-caseso Programmers do not scaleo Out-of-band communicationo Expensive to maintaino Brittle in the face of new data
Rule-based Integration Is Brittle
TAMR AUTOMATES DATA INTEGRATION
● Solution: combine learning rules with asking experts
● Modern machine learning techniqueso semi-supervised learningo active learning
● Benefits o speed of an automated systemo accuracy of human expertso auditability o responds well to changing
requirements
Use Probabilistic Rules with Active Learning
TAMR AUTOMATES DATA INTEGRATION
● Build a unified schema and link it to source attributes
● Engage subject matter experts to answer questions
● Automate data transformation
● Eliminate redundant records with de-duplication
Tamr Combines Machine Learning and Expert Feedback
● 80% of clinical data today goes unused● Clinical Data Warehouses capture legacy data● Improved analytics = better trials, less $$
Advanced Analytics, Better Clinical Trials
TAMR BUILDS LASTING VALUE
SAS
Faster Regulatory Filings
Better Clinical Analytics
Data Mining for New Indications
CASE STUDY: CLINICAL STUDY DATA
● Clinical study data integration is motivated by a single schema: CDISCo mandated by FDA for data
submissiono common schema for clinical data
warehouses
● Mostly performed by SAS scripting today
● Tamr learns attribute mapping and transformations using human feedback
An Example: Clinical Study Data Integration
Thank You