28
Impact of hybrid optimization strategies on distributed machine learning algorithms Prateek Gaur Technische Universit¨ at Berlin, IT4BI Thesis Advisors: Max Heimel and Christoph Boden Thesis Supervisor: Dr. Volker Markl September 04, 2014 Thesis defense September 4, 2014 Prateek Gaur Hybrid Optimization September 4, 2014 1 / 27

Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Impact of hybrid optimization strategies on distributedmachine learning algorithms

Prateek GaurTechnische Universitat Berlin, IT4BI

Thesis Advisors: Max Heimel and Christoph BodenThesis Supervisor: Dr. Volker Markl

September 04, 2014Thesis defense

September 4, 2014

Prateek Gaur Hybrid Optimization September 4, 2014 1 / 27

Page 2: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Motivation

Close enough is not good enough!!

Hadoop is slower than it has to be

Performance tuning and optimizations

Hadoop and its successors are still interesting

Computation is cheap

Large scale iterative tasks pose a major threat

Existing MR-based solutions trade-o↵ performance with accuracy

Prateek Gaur Hybrid Optimization September 4, 2014 2 / 27

Page 3: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Motivation

Close enough is not good enough!!

Hadoop is slower than it has to be

Performance tuning and optimizations

Hadoop and its successors are still interesting

Computation is cheap

Large scale iterative tasks pose a major threat

Existing MR-based solutions trade-o↵ performance with accuracy

Prateek Gaur Hybrid Optimization September 4, 2014 2 / 27

Page 4: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Outline

1 IntroductionWhat is Large Scale Machine Learning?ContributionsIterations and MapReduceWhy a new Technique?

2 Proposed ApproachExisting ApproachesHybrid Training for Clustering

3 EvaluationWhere do our datasets come from?A sample UsecaseResults

4 Summary

Prateek Gaur Hybrid Optimization September 4, 2014 3 / 27

Page 5: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Large Scale ML

Descriptive => Predictive Analytics

Training examples can’t fit asingle machine

100+ features present perexample

Training data arrivescontinuously

Subsampling undesirable

Prateek Gaur Hybrid Optimization September 4, 2014 4 / 27

Page 6: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Contributions

Proposed and tested a large scale supervised and unsupervised MLtechnique using existing solutions

Can solve variety of use-cases by adjusting various properties

Simple architecture- using existing techniques

Platform Agnostic as Algorithmic levelNot comparing di↵erent computing platforms [1]

No custom distributed computing

Prateek Gaur Hybrid Optimization September 4, 2014 5 / 27

Page 7: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Iterative MapReduce

What are Iterative Learning Algorithms?

PageRank, KMeans, Neural Networks

Shortcomings

High startup costsAwkward state retainmentSingle Reducer Problem, Straggler E↵ect

Mapreduce is not designed to run iterations e�ciently

Prateek Gaur Hybrid Optimization September 4, 2014 6 / 27

Page 8: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Existing Techniques: Global/ Batch vs. Local/ Online

When all you have is a hammer thenget rid of everything that’s not a nail! [J.Lin, Twitter]

Prateek Gaur Hybrid Optimization September 4, 2014 7 / 27

Page 9: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Existing Techniques: Batch vs. Online

When all you have is a hammer thenget rid of everything that’s not a nail! [J.Lin, Twitter]

Can’t give up on accuracy: Youtube recommendationsCan’t give up on speed: Twitter’s Trending topics

Prateek Gaur Hybrid Optimization September 4, 2014 8 / 27

Page 10: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Proposed Technique: Hybrid

1 Examples divided uniformlyacross all the nodes

2 Single online pass

3 Centrally, take average of onlineresults to get a warm-start forbatch optimization

4 Centralized Batch step acrossthe cluster

Prateek Gaur Hybrid Optimization September 4, 2014 9 / 27

Page 11: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Logistic Regression

Machine Learning as an optimization problem

Prateek Gaur Hybrid Optimization September 4, 2014 10 / 27

Page 12: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Numerical Optimizations: Existing Solutions

Gradient Descentw

(t+1) = w

(t) + �(t) 1n

Pn

i=0 rl(h(xi

; ✓(t)), yi

)

batch learning

Stochastic Gradient Descentw

(t+1) = w

(t) + �(t)rl(h(x ; ✓(t)), y)

online learning

Solves the iteration problem butwhat about Single Reducer problem?

Prateek Gaur Hybrid Optimization September 4, 2014 11 / 27

Page 13: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Ensembles

Set of classifiers outperform a single classifier

Learn multiple alternative models for a single conceptCombine decisions to take the final decision

Some are sequential, not MR friendly

Boosting

Others rely on randomization

BaggingEmbarrassingly parallel

Prateek Gaur Hybrid Optimization September 4, 2014 12 / 27

Page 14: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Classification: Hybrid Training Architecture

Prateek Gaur Hybrid Optimization September 4, 2014 13 / 27

Page 15: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

KMeans

Group n d-dimensional datapoints into k disjoint sets to minimize

Batch Learning: Lloyd’s KMeans

Online Learning: Streaming KMeans

Prateek Gaur Hybrid Optimization September 4, 2014 14 / 27

Page 16: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Clustering: Hybrid Training

Prateek Gaur Hybrid Optimization September 4, 2014 15 / 27

Page 17: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Datasets

Prateek Gaur Hybrid Optimization September 4, 2014 16 / 27

Page 18: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Example Usecase: Named Entity Recognition

Identify persons from Clueweb Dataseta

aJ. Callan, M. Hoy, C. Yoo, and L. Zhao. Clueweb09 data set, 2009.

Accuracy Paradox -> Fscore

Prateek Gaur Hybrid Optimization September 4, 2014 17 / 27

Page 19: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Overview of the Results

Tested for Quality and Performance

Online training

is faster than batch but suboptimalperformance and quality can be improved by Ensembling but upto alimit

Hybrid training

achieves better performance than batch and onlineachieves better quality fasterinsensitive to the size, complexity of the dataset and the choice ofquality metricis sensitive to the choice of hyperparameters

Prateek Gaur Hybrid Optimization September 4, 2014 18 / 27

Page 20: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Classification: Precision(Avg. 3 Runs, Ensemble=4, 50 iterations)

Prateek Gaur Hybrid Optimization September 4, 2014 19 / 27

Page 21: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Classification: Scaling

Prateek Gaur Hybrid Optimization September 4, 2014 20 / 27

Page 22: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Classification: Running Time(Avg. 3 Runs, Ensemble=4, 50 Iterations)

Approx. 1 iteration required to breakeven

Prateek Gaur Hybrid Optimization September 4, 2014 21 / 27

Page 23: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Clustering Quality(Avg. 3 Runs, 10 iterations)

Prateek Gaur Hybrid Optimization September 4, 2014 22 / 27

Page 24: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Other Works

Large Scale ML

Google’s Sybil1: Improved IterativeDi↵erent computation platforms [2]: Spark, Flink, GraphLab

Iterative extensions for Hadoop

HaLoop, Twister, PrIter [3]

Hybrid

Terascale learning2

Twitter’s Summingbird3 : A framework for Integrating Batch andOnline MapReduce Computations

1T. Chandra, E. Ie, K. Goldman, T. L. Llinares, J. McFadden, F. Pereira, J.Redstone, T. Shaked, and Y. Singer. Sibyl: a system for large scale machine learning.Keynote I PowerPoint presentation, Jul, 28, 2010.

2A. Agarwal, O. Chapelle, M. Dud´ık, and J. Langford. A reliable e↵ective terascalelinear learning system.

3O. Boykin, S. Ritchie, I. O’Connell, and J. Lin. Summingbird: A framework forintegrating batch and online mapreduce computations.

Prateek Gaur Hybrid Optimization September 4, 2014 23 / 27

Page 25: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Conclusion

Contributions:

Evaluated existing techniques: Online and BatchProposed a“hybrid” technique that o↵ers the“best of both worlds”Di↵erent scenarios to show its e↵ectivenessIdentified the shortcomings of the proposed approach against theexisting ones

Hybrid approach proves promising

re-uses existing codeplatform agnostic

Github: https://github.com/gaurprateek/parallelml

Prateek Gaur Hybrid Optimization September 4, 2014 24 / 27

Page 26: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Future Work

Test and contribute to Flink and extend to Spark

Semi-supervised learning and Ranking

Implementation-based optimizations

An extra knob to chose between online and batch

Prateek Gaur Hybrid Optimization September 4, 2014 25 / 27

Page 27: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Acknowledgements

Thanks to:

My committee

My Advisors: Max Heimel and Christoph Boden

Dr. Volker Markl; academic supervisor

Dr. Ralf-Detlef Kutsche; program coordinator

Johannes Kirschnick; database queries

Prateek Gaur Hybrid Optimization September 4, 2014 26 / 27

Page 28: Impact of hybrid optimization strategies on distributed ... · Not comparing di↵erent computing platforms [1] No custom distributed computing Prateek Gaur Hybrid Optimization September

Bibliography

A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske,A. Heise, O. Kao, M. Leich, U. Leser, V. Markl et al., “Thestratosphere platform for big data analytics,”The VLDB Journal, pp.1–26, 2014.

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,“Spark: cluster computing with working sets,” in Proceedings of the2nd USENIX conference on Hot topics in cloud computing, 2010, pp.10–10.

J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, andG. Fox,“Twister: a runtime for iterative mapreduce,” in Proceedings ofthe 19th ACM International Symposium on High PerformanceDistributed Computing. ACM, 2010, pp. 810–818.

Prateek Gaur Hybrid Optimization September 4, 2014 27 / 27