21
ThinkFast: Scaling Machine Learning to Modern Demands Hristo Paskov

ThinkFast: Scaling Machine Learning to Modern Demands

Embed Size (px)

Citation preview

Page 1: ThinkFast: Scaling Machine Learning to Modern Demands

 ThinkFast: Scaling Machine Learning to Modern Demands

Hristo Paskov

Page 2: ThinkFast: Scaling Machine Learning to Modern Demands

The Genomic Data Deluge

• Precision Medicine Initiative: sequence 1,000,000 genomes– $215 Million in 2015 – Pilot study– Outputs 10-50 GB/person

How do we analyze all of this data to drive progress?

Page 3: ThinkFast: Scaling Machine Learning to Modern Demands

Massive Data Sources

NewseCommerce

Bioinformatics

100K Genomes

Social Media

Page 4: ThinkFast: Scaling Machine Learning to Modern Demands

The Analysis Refinement Cycle

Data12‖𝑦−𝑋𝑤‖2

2+𝜆2‖𝑤‖2

2

Model𝑥+¿=𝑥−𝛼𝑀𝛻 𝑓 (𝑥 ) ¿

SolverModel captures data 

nuance?

Solver exists, is fast 

enough?

Yes? 

! No?

Page 5: ThinkFast: Scaling Machine Learning to Modern Demands

More Than Just Training Models 

• Regularization paths• Model risk assessment• Interpretability

Mod

el Coefficien

t

Regularization Parameter

Page 6: ThinkFast: Scaling Machine Learning to Modern Demands

Brief History of Statistical Learning

Interpretability & Statistical Guarantees

ScalabilityEase of Use

Simple Models

Kernel Methods

Trees & Ensembles

Structured Regularization

Page 7: ThinkFast: Scaling Machine Learning to Modern Demands

Structured Regularization

Losses

RegressionClassification

RankingMotif Finding

Matrix FactorizationFeature EmbeddingData Imputation

Regularizers

SparsitySpatial/ Temporal / Manifold StructureGroup Structure

Hierarchical StructureStructured & Unstructured 

Multitask Learning…

min𝛽∈ℝ𝑑

𝐿 ( 𝑋 𝛽 )+𝜆𝑅 (𝛽 )

Page 8: ThinkFast: Scaling Machine Learning to Modern Demands

The Lasso’s Combinatorial Side

𝜆0

3

2

1

4

Mod

el Coe

fficien

t

Page 9: ThinkFast: Scaling Machine Learning to Modern Demands

The Database Perspective

Page 10: ThinkFast: Scaling Machine Learning to Modern Demands

The Database Perspective

Feature & label storage

Page 11: ThinkFast: Scaling Machine Learning to Modern Demands

The Database Perspective

Feature & label storage

Data access operations

Page 12: ThinkFast: Scaling Machine Learning to Modern Demands

The Database Perspective

Feature & label storage

Data access operations

ML “Query Language” min𝛽∈ℝ𝑑

𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1

Page 13: ThinkFast: Scaling Machine Learning to Modern Demands

The Database Perspective

min𝛽1 ,𝛽2 ,𝛽3∈ℝ

𝑑∑𝑡=1

3

[𝐿𝑡 (𝑦𝑡−𝑋 𝑡 𝛽𝑡 )+𝜆𝑡𝑅𝑡 (𝛽𝑡 ) ]+𝜔‖[ 𝛽1 𝛽2 𝛽3 ]‖∗

Page 14: ThinkFast: Scaling Machine Learning to Modern Demands

The Database Perspective

Feature, label and model storage

Data access operations

ML “Query Language” min𝛽∈ℝ𝑑

𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1

𝑀 1

𝑀 2

𝑀 1

𝑀 2

𝑀 3

𝑀 1

𝑀 2

Page 15: ThinkFast: Scaling Machine Learning to Modern Demands

The Database Perspective

min𝛽∈ℝ𝑑

𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1

𝑀 1

𝑀 2

𝑀 1

𝑀 2

𝑀 3

𝑀 1

𝑀 2

Processing Memory

Mathematical Structure

Page 16: ThinkFast: Scaling Machine Learning to Modern Demands

Efficient Feature Storage

Page 17: ThinkFast: Scaling Machine Learning to Modern Demands

“Query Language” Optimization

• Static analysis

‖𝑦−𝑋𝑤‖22+‖𝑤‖2

2

‖𝑦−𝑋𝑤‖22+‖𝑤‖1

?

‖𝑦−𝑋𝑤‖22+12 (‖𝑤‖2

2+‖𝑤‖1)

Page 18: ThinkFast: Scaling Machine Learning to Modern Demands

“Query Language” Optimization

• Static analysis

‖𝑦−𝑋𝑤‖22+‖𝑤‖2

2

‖𝑦−𝑋𝑤‖22+‖𝑤‖1

‖𝑦−𝑋𝑤‖22+12 (‖𝑤‖2

2+‖𝑤‖1)

?

𝜀 ( 𝑦−𝑋𝑤 )+ 12 (‖𝑤‖22+‖𝑤‖1 )

Page 19: ThinkFast: Scaling Machine Learning to Modern Demands

“Query Language” Optimization

• Static analysis• Runtime analysis

Page 20: ThinkFast: Scaling Machine Learning to Modern Demands

Some Bioinformatics Applications

• Personalized medicine, Memorial Sloan Kettering Cancer Center– 35% accuracy improvement over state-of-the-art

• Metagenomic binning and DNA quality assessment, Stanford School of Medicine– Previously unsolved problem

• Toxicogenomic analysis, Stanford University– Improved on state-of-the-art results

Page 21: ThinkFast: Scaling Machine Learning to Modern Demands

Upcoming

• Massive scale character level sentiment and text analysis on Amazon data– Billions of features, hours to solve a model– Efficient multitask learning

• Characterize the global limitations of learning word structure– Devise provably more efficient regularizers for uncovering structure