ThinkFast: Scaling Machine Learning to Modern Demands

Preview:

Citation preview

 ThinkFast: Scaling Machine Learning to Modern Demands

Hristo Paskov

The Genomic Data Deluge

• Precision Medicine Initiative: sequence 1,000,000 genomes– $215 Million in 2015 – Pilot study– Outputs 10-50 GB/person

How do we analyze all of this data to drive progress?

Massive Data Sources

NewseCommerce

Bioinformatics

100K Genomes

Social Media

The Analysis Refinement Cycle

Data12‖𝑦−𝑋𝑤‖2

2+𝜆2‖𝑤‖2

2

Model𝑥+¿=𝑥−𝛼𝑀𝛻 𝑓 (𝑥 ) ¿

SolverModel captures data 

nuance?

Solver exists, is fast 

enough?

Yes? 

! No?

More Than Just Training Models 

• Regularization paths• Model risk assessment• Interpretability

Mod

el Coefficien

t

Regularization Parameter

Brief History of Statistical Learning

Interpretability & Statistical Guarantees

ScalabilityEase of Use

Simple Models

Kernel Methods

Trees & Ensembles

Structured Regularization

Structured Regularization

Losses

RegressionClassification

RankingMotif Finding

Matrix FactorizationFeature EmbeddingData Imputation

Regularizers

SparsitySpatial/ Temporal / Manifold StructureGroup Structure

Hierarchical StructureStructured & Unstructured 

Multitask Learning…

min𝛽∈ℝ𝑑

𝐿 ( 𝑋 𝛽 )+𝜆𝑅 (𝛽 )

The Lasso’s Combinatorial Side

𝜆0

3

2

1

4

Mod

el Coe

fficien

t

The Database Perspective

The Database Perspective

Feature & label storage

The Database Perspective

Feature & label storage

Data access operations

The Database Perspective

Feature & label storage

Data access operations

ML “Query Language” min𝛽∈ℝ𝑑

𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1

The Database Perspective

min𝛽1 ,𝛽2 ,𝛽3∈ℝ

𝑑∑𝑡=1

3

[𝐿𝑡 (𝑦𝑡−𝑋 𝑡 𝛽𝑡 )+𝜆𝑡𝑅𝑡 (𝛽𝑡 ) ]+𝜔‖[ 𝛽1 𝛽2 𝛽3 ]‖∗

The Database Perspective

Feature, label and model storage

Data access operations

ML “Query Language” min𝛽∈ℝ𝑑

𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1

𝑀 1

𝑀 2

𝑀 1

𝑀 2

𝑀 3

𝑀 1

𝑀 2

The Database Perspective

min𝛽∈ℝ𝑑

𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1

𝑀 1

𝑀 2

𝑀 1

𝑀 2

𝑀 3

𝑀 1

𝑀 2

Processing Memory

Mathematical Structure

Efficient Feature Storage

“Query Language” Optimization

• Static analysis

‖𝑦−𝑋𝑤‖22+‖𝑤‖2

2

‖𝑦−𝑋𝑤‖22+‖𝑤‖1

?

‖𝑦−𝑋𝑤‖22+12 (‖𝑤‖2

2+‖𝑤‖1)

“Query Language” Optimization

• Static analysis

‖𝑦−𝑋𝑤‖22+‖𝑤‖2

2

‖𝑦−𝑋𝑤‖22+‖𝑤‖1

‖𝑦−𝑋𝑤‖22+12 (‖𝑤‖2

2+‖𝑤‖1)

?

𝜀 ( 𝑦−𝑋𝑤 )+ 12 (‖𝑤‖22+‖𝑤‖1 )

“Query Language” Optimization

• Static analysis• Runtime analysis

Some Bioinformatics Applications

• Personalized medicine, Memorial Sloan Kettering Cancer Center– 35% accuracy improvement over state-of-the-art

• Metagenomic binning and DNA quality assessment, Stanford School of Medicine– Previously unsolved problem

• Toxicogenomic analysis, Stanford University– Improved on state-of-the-art results

Upcoming

• Massive scale character level sentiment and text analysis on Amazon data– Billions of features, hours to solve a model– Efficient multitask learning

• Characterize the global limitations of learning word structure– Devise provably more efficient regularizers for uncovering structure

Recommended