Upload
srisatish-ambati
View
1.732
Download
1
Embed Size (px)
Citation preview
Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy
Szilárd Pafka, PhDChief Scientist, Epoch
H2O World Conference, Mountain View Nov 2015
Disclaimer:
I am not representing my employer (Epoch) in this talk
I cannot confirm nor deny if Epoch is using or not any of the methods, tools, results etc. mentioned in this talk. The results presented in this talk should not be considered as any indication whether Epoch is using these methods, tools, results etc. or not.
I usually use other people’s code [...] it is usually not “efficient” (from time budget perspective) to write my own algorithm [...] I can find open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhanghttp://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/
Data Size for Supervised Learning
# records:<10M10M-10B>10B
Data Size for Non-Linear Supervised Learning
# records:<1M1M-100M>100M
binary classification, 10M recordsnumeric & categorical features, non-sparse
http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
http://lowrank.net/nikos/pubs/empirical.pdf
http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
http://lowrank.net/nikos/pubs/empirical.pdf
- R packages- Python scikit-learn- Vowpal Wabbit- H2O- xgboost- Spark MLlib
- R packages 30%- Python scikit-learn 40%- Vowpal Wabbit 8%- H2O 10%- xgboost 8%- Spark MLlib 6%
- R packages 30%- Python scikit-learn 40%- Vowpal Wabbit 8%- H2O 10%- xgboost 8%- Spark MLlib 6%- a few others
- R packages 30%- Python scikit-learn 40%- Vowpal Wabbit 8%- H2O 10%- xgboost 8%- Spark MLlib 6%- a few others
EC2
Distributed computation generally is hard, because it adds an additional layer of complexity and [network] communication overhead. The ideal case is scaling linearly with the number of nodes; that’s rarely the case. Emerging evidence shows that very often, one big machine, or even a laptop, outperforms a cluster.http://fastml.com/the-emperors-new-clothes-distributed-machine-learning/
n = 10K, 100K, 1M, 10M, 100M
Training timeRAM usageAUCCPU % by coreread data, pre-process, score test data
linear tops off(data size)
(accuracy)
linear tops off
more data & better algo
(data size)
(accuracy)
linear tops off
more data & better algo
random forest on 1% of data beats linear on all data
(data size)
(accuracy)
10x
http://datascience.la/benchmarking-random-forest-implementations/#comment-53599
we will continue to run large [...] jobs to scan petabytes of [...] data to extract interesting features, but this paper explores the interesting possibility of switching over to a multi-core, shared-memory system for efficient execution on more refined datasets [...] e.g., machine learning http://openproceedings.org/2014/conf/edbt/KumarGDL14.pdf
learn_rate = 0.1, max_depth = 6, n_trees = 300learn_rate = 0.01, max_depth = 16, n_trees = 1000
Non-Linear Supervised Learning
# records:<1M1M-100M>100M
Non-Linear Supervised Learning