Training, tuning, selecting & serving of machine learning models at scale
Peter Rudenko@peter_rud
Typical machine learning workflow
Input data
Model trainingPrediction
ETL
Preprocessing, feature engineering
Model tuning (selecting best
hyperparameters)
Data partitioning
Optimising model parameters
Low latency Batch
Automatic Machine Learning
fblearner
Deep Feature Synthesis: Towards Automating Data Science Endeavors (MIT)
Datarobot.com
Test data
Input data
Balanced vs skewed target distribution
The devil is in the detail:○ Partitioning○ Leakage○ Sample size
http://blog.mrtz.org/2015/03/09/competition.html
In [42]: ar2d = numpy.array([[1, 2, 3], [11, 12, 13], [10, 20, 40]], dtype='uint8', order='C')
In [43]: ' '.join(str(ord(x)) for x in ar2d.data)
Out[43]: '1 2 3 11 12 13 10 20 40'
In [44]: ar2df = numpy.array([[1, 2, 3], [11, 12, 13], [10, 20, 40]], dtype='uint8', order='F')
In [45]: ' '.join(str(ord(x)) for x in ar2df.data)
Out[45]: '1 11 10 2 12 20 3 13 40'
Big Data?
Criteo 1tb data:
Data size:● ~46GB/day● ~180,000,000/day● ~3.5% events rate
Raw Data:[email protected]%
Data:[email protected]%(189 GB in columnar parquet format)
Balanced classes:70GB(12 GB parquet)
Scalability! But at what COST?
“You can have a second computer once you’ve shown you know how to use the first one.” – Paul Barham
50 shades of machine learning
Supervised Unsupervised
Semi-supervised
Classification Regression Sequence prediction
Structure prediction
Reinforcement learning
Time series forecasting
Clustering Dimensionality reduction
Topic modeling
Recommendation
Online/Streaming ML
Ranking
Survival Analysis
Anomaly detection
Buzzword maker: REALTIME + BIGDATA + 1 or 2 boxes above = Profit
Model state (knowledge) vs hyperparameters
LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
* Pedro Domingos, A few useful things to know about machine learning, 2012.
Evaluation = LossFunction(Prediction, True label)
OptimizationModel parameters Hyperparameters
Combinatorial optimization:● Greedy search ● Beam search ● Branch-and-bound
Continuous optimization❖ Unconstrained ❏ Gradient descent ❏ Conjugate gradient ❏ Quasi-Newton methods ❖ Constrained ❏ Linear programming ❏ Quadratic programming
● Grid search● Random Search● Bayesian Optimization● Tree of Parzen Estimators (TPE)● Gradient based optimization
Distributed Machine Learning
Model fits in memory
Data fits in memory
Yes No
Yes
No Distributed data (hdfs, spark)
Distributed data, distributed models
Distributed Machine Learning
Data1 Model 1...DataN Model N
Model Data Parallelism
http://parameterserver.org/https://github.com/intel-machine-learning/DistMLhttp://www.dmtk.io/https://petuum.github.io/bosen.html
Model
Speed up distributed machine learning
● Approximate all the things● Update asynchronously ● Early stopping
We draw inspiration from the high-level programming models of dataflow systems, and the low-level efficiency of parameter servers.
TensorFlow: A system for large-scale machine learning
A better model when time is the constraint
Сost based optimization
Automating Model Search for Large Scale Machine Learning
Apache SystemMLAutomatic OptimizationAlgorithms specified in DML and PyDML are dynamically compiled and optimized based on data and cluster characteristics using rule-based and cost-based optimization techniques. The optimizer automatically generates hybrid runtime execution plans ranging from in-memory single-node execution to distributed computations on Spark or Hadoop. This ensures both efficiency and scalability. Automatic optimization reduces or eliminates the need to hand-tune distributed runtime execution plans and system configurations.
Ensembles● Bagging.
● Boosting.
● Blending.
● Stacking.
Dark knowledge
http://www.ttic.edu/dl/dark14.pdf https://www.youtube.com/watch?v=EK61htlw8hY
Test time prediction
● Different environment● Different hardware ● Different requirements
Types of model transferring1. Model serialization:- Bound to a single language- Bound to a single version
2. Metadata + data (Spark-2.0)(https://tensorflow.github.io/serving/) 3. PMML (http://dmg.org/pmml/v4-2-1/GeneralStructure.html) 4. PFA (http://dmg.org/pfa/index.html) 5. Code generation (h2o.ai)
http://tullo.ch/articles/decision-tree-evaluation/https://blog.acolyer.org/2016/02/29/machine-learning-the-high-interest-credit-card-of-technical-debt/https://blog.acolyer.org/2016/03/01/ad-click-prediction-a-view-from-the-trenches/Automating Model Search for Large Scale Machine Learning
Papers & articles
Thanks, QA