Download pptx - ThinkFast: Scaling Machine Learning to Modern Demands

ThinkFast: Scaling Machine Learning to Modern Demands

Hristo Paskov

The Genomic Data Deluge

• Precision Medicine Initiative: sequence 1,000,000 genomes– $215 Million in 2015 – Pilot study– Outputs 10-50 GB/person

How do we analyze all of this data to drive progress?

Massive Data Sources

NewseCommerce

Bioinformatics

100K Genomes

Social Media

The Analysis Refinement Cycle

⨂

Data12‖𝑦−𝑋𝑤‖2

2+𝜆2‖𝑤‖2

2

Model𝑥+¿=𝑥−𝛼𝑀𝛻 𝑓 (𝑥 ) ¿

SolverModel captures data

nuance?

Solver exists, is fast

enough?

Yes?

! No?

More Than Just Training Models

• Regularization paths• Model risk assessment• Interpretability

Mod

el Coefficien

t

Regularization Parameter

Brief History of Statistical Learning

Interpretability & Statistical Guarantees

ScalabilityEase of Use

Simple Models

Kernel Methods

Trees & Ensembles

Structured Regularization

Structured Regularization

Losses

RegressionClassification

RankingMotif Finding

Matrix FactorizationFeature EmbeddingData Imputation

…

Regularizers

SparsitySpatial/ Temporal / Manifold StructureGroup Structure

Hierarchical StructureStructured & Unstructured

Multitask Learning…

min𝛽∈ℝ𝑑

𝐿 ( 𝑋 𝛽 )+𝜆𝑅 (𝛽 )

The Lasso’s Combinatorial Side

𝜆0

3

2

1

4

Mod

el Coe

fficien

t

The Database Perspective


Feature & label storage



Data access operations




ML “Query Language” min𝛽∈ℝ𝑑

𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1


min𝛽1 ,𝛽2 ,𝛽3∈ℝ

𝑑∑𝑡=1

3

[𝐿𝑡 (𝑦𝑡−𝑋 𝑡 𝛽𝑡 )+𝜆𝑡𝑅𝑡 (𝛽𝑡 ) ]+𝜔‖[ 𝛽1 𝛽2 𝛽3 ]‖∗


Feature, label and model storage


ML “Query Language” min𝛽∈ℝ𝑑

𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1

𝑀 1

𝑀 2

𝑀 1

𝑀 2

𝑀 3

𝑀 1

𝑀 2


min𝛽∈ℝ𝑑

𝐿 ( 𝑦− 𝑋 𝛽 )+𝜆‖𝛽‖1

𝑀 1

𝑀 2

𝑀 1

𝑀 2

𝑀 3

𝑀 1

𝑀 2

Processing Memory

Mathematical Structure

Efficient Feature Storage

“Query Language” Optimization

• Static analysis

‖𝑦−𝑋𝑤‖22+‖𝑤‖2

2

‖𝑦−𝑋𝑤‖22+‖𝑤‖1

?

‖𝑦−𝑋𝑤‖22+12 (‖𝑤‖2

2+‖𝑤‖1)


• Static analysis

‖𝑦−𝑋𝑤‖22+‖𝑤‖2

2

‖𝑦−𝑋𝑤‖22+‖𝑤‖1

‖𝑦−𝑋𝑤‖22+12 (‖𝑤‖2

2+‖𝑤‖1)

?

𝜀 ( 𝑦−𝑋𝑤 )+ 12 (‖𝑤‖22+‖𝑤‖1 )


• Static analysis• Runtime analysis

Some Bioinformatics Applications

• Personalized medicine, Memorial Sloan Kettering Cancer Center– 35% accuracy improvement over state-of-the-art

• Metagenomic binning and DNA quality assessment, Stanford School of Medicine– Previously unsolved problem

• Toxicogenomic analysis, Stanford University– Improved on state-of-the-art results

Upcoming

• Massive scale character level sentiment and text analysis on Amazon data– Billions of features, hours to solve a model– Efficient multitask learning

• Characterize the global limitations of learning word structure– Devise provably more efficient regularizers for uncovering structure