ThinkFast: Scaling Machine Learning to Modern Demands
Hristo Paskov
The Genomic Data Deluge
• Precision Medicine Initiative: sequence 1,000,000 genomes– $215 Million in 2015 – Pilot study– Outputs 10-50 GB/person
How do we analyze all of this data to drive progress?
Massive Data Sources
NewseCommerce
Bioinformatics
100K Genomes
Social Media
The Analysis Refinement Cycle
⨂
Data12‖𝑦−𝑋𝑤‖2
2+𝜆2‖𝑤‖2
2
Model𝑥+¿=𝑥−𝛼𝑀𝛻 𝑓 (𝑥 ) ¿
SolverModel captures data
nuance?
Solver exists, is fast
enough?
Yes?
! No?
More Than Just Training Models
• Regularization paths• Model risk assessment• Interpretability
Mod
el Coefficien
t
Regularization Parameter
Brief History of Statistical Learning
Interpretability & Statistical Guarantees
ScalabilityEase of Use
Simple Models
Kernel Methods
Trees & Ensembles
Structured Regularization
Structured Regularization
Losses
RegressionClassification
RankingMotif Finding
Matrix FactorizationFeature EmbeddingData Imputation
…
Regularizers
SparsitySpatial/ Temporal / Manifold StructureGroup Structure
Hierarchical StructureStructured & Unstructured
Multitask Learning…
min𝛽∈ℝ𝑑
𝐿 ( 𝑋 𝛽 )+𝜆𝑅 (𝛽 )
The Lasso’s Combinatorial Side
𝜆0
3
2
1
4
Mod
el Coe
fficien
t
The Database Perspective
The Database Perspective
Feature & label storage
The Database Perspective
Feature & label storage
Data access operations
The Database Perspective
Feature & label storage
Data access operations
ML “Query Language” min𝛽∈ℝ𝑑
𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1
The Database Perspective
min𝛽1 ,𝛽2 ,𝛽3∈ℝ
𝑑∑𝑡=1
3
[𝐿𝑡 (𝑦𝑡−𝑋 𝑡 𝛽𝑡 )+𝜆𝑡𝑅𝑡 (𝛽𝑡 ) ]+𝜔‖[ 𝛽1 𝛽2 𝛽3 ]‖∗
The Database Perspective
Feature, label and model storage
Data access operations
ML “Query Language” min𝛽∈ℝ𝑑
𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1
𝑀 1
𝑀 2
𝑀 1
𝑀 2
𝑀 3
𝑀 1
𝑀 2
The Database Perspective
min𝛽∈ℝ𝑑
𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1
𝑀 1
𝑀 2
𝑀 1
𝑀 2
𝑀 3
𝑀 1
𝑀 2
Processing Memory
Mathematical Structure
Efficient Feature Storage
“Query Language” Optimization
• Static analysis
‖𝑦−𝑋𝑤‖22+‖𝑤‖2
2
‖𝑦−𝑋𝑤‖22+‖𝑤‖1
?
‖𝑦−𝑋𝑤‖22+12 (‖𝑤‖2
2+‖𝑤‖1)
“Query Language” Optimization
• Static analysis
‖𝑦−𝑋𝑤‖22+‖𝑤‖2
2
‖𝑦−𝑋𝑤‖22+‖𝑤‖1
‖𝑦−𝑋𝑤‖22+12 (‖𝑤‖2
2+‖𝑤‖1)
?
𝜀 ( 𝑦−𝑋𝑤 )+ 12 (‖𝑤‖22+‖𝑤‖1 )
“Query Language” Optimization
• Static analysis• Runtime analysis
Some Bioinformatics Applications
• Personalized medicine, Memorial Sloan Kettering Cancer Center– 35% accuracy improvement over state-of-the-art
• Metagenomic binning and DNA quality assessment, Stanford School of Medicine– Previously unsolved problem
• Toxicogenomic analysis, Stanford University– Improved on state-of-the-art results
Upcoming
• Massive scale character level sentiment and text analysis on Amazon data– Billions of features, hours to solve a model– Efficient multitask learning
• Characterize the global limitations of learning word structure– Devise provably more efficient regularizers for uncovering structure