40
Does Python stand a chance in today’s world of data science? Radim Řehůřek

Does Python stand a chance in today's world of data science?

Embed Size (px)

Citation preview

Page 1: Does Python stand a chance in today's world of data science?

Does Python stand a chance in today’s world of data science?

Radim Řehůřek

Page 2: Does Python stand a chance in today's world of data science?

YES

Page 3: Does Python stand a chance in today's world of data science?
Page 4: Does Python stand a chance in today's world of data science?
Page 5: Does Python stand a chance in today's world of data science?
Page 6: Does Python stand a chance in today's world of data science?

RaRe Technologies Ltd.

Page 7: Does Python stand a chance in today's world of data science?

Python vs. rest

● performance?● deployment?● logging, debugging?● workflow, integration?

Page 8: Does Python stand a chance in today's world of data science?

SVD

Page 9: Does Python stand a chance in today's world of data science?

English Wikipedia

● ~3.5M docs● ~2G words● with 100K vocab, ~0.5G matrix non-zeros

○ very sparse● small-ish, but known & accessible and out -

of-core

Page 10: Does Python stand a chance in today's world of data science?
Page 11: Does Python stand a chance in today's world of data science?
Page 12: Does Python stand a chance in today's world of data science?
Page 13: Does Python stand a chance in today's world of data science?

Spark mllib

● top level Apache project, Scala● RDDs, Resilient Distributed Datasets● ~RAM caching + execution engine● latest Spark 1.3.0 + mllib● AWS EMR cluster (4x m3.xlarge)

Page 14: Does Python stand a chance in today's world of data science?

SVD @ mllib

Page 15: Does Python stand a chance in today's world of data science?
Page 16: Does Python stand a chance in today's world of data science?

Mahout SSVD

● the “scikit-learn” of Hadoop, Java● originally on MapReduce● now Mahout Samsara @ Spark, Scala● newest Mahout 0.10.0

Page 17: Does Python stand a chance in today's world of data science?

+ “local mode” eats up all disk, then fails

Page 18: Does Python stand a chance in today's world of data science?

Google’s word2vec

unsupervised ML● Berlin is to Germany as Paris is to …?● king - man + woman = queen● which word doesn’t fit? “dinner cereal

breakfast lunch”

http://radimrehurek.com/2014/02/word2vec-tutorial/#app

Page 19: Does Python stand a chance in today's world of data science?

Word2vec @ Wikipedia

Page 20: Does Python stand a chance in today's world of data science?

C vs. NumPy vs. optimized

(+pure Python: 120x slower than baseline)

Page 21: Does Python stand a chance in today's world of data science?

Single machine parallelization

C (1/2/4 workers): 1.0x / 1.9x / 3.2xgensim: 1.0x / 1.75x / 2.85x

Page 22: Does Python stand a chance in today's world of data science?

streaming (Python generator) for input

+ amazing Python ecosystem on either end!

Page 23: Does Python stand a chance in today's world of data science?
Page 24: Does Python stand a chance in today's world of data science?
Page 25: Does Python stand a chance in today's world of data science?

word2vec @ mllib

Page 26: Does Python stand a chance in today's world of data science?
Page 27: Does Python stand a chance in today's world of data science?

scaling down for Spark

=> if scaling linearly, Spark needs a cluster of ~12 machines to break even (vs. pySpark)

Page 28: Does Python stand a chance in today's world of data science?

Deeplearning4j

David Przybilla, Idio Ltd.https://github.com/idio/wiki2vec

Page 29: Does Python stand a chance in today's world of data science?

ANN libs @ Wikipedia

Page 30: Does Python stand a chance in today's world of data science?

“Do one thing and do it well.”Doug McIlroy

"Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can."

Zawinski’s law of software development

Tools

Page 31: Does Python stand a chance in today's world of data science?

APIs & Abstractions

Page 32: Does Python stand a chance in today's world of data science?

Configuration, setup, deployment

Python 2 vs Python 3

… the real work!

Page 33: Does Python stand a chance in today's world of data science?

Java

Page 34: Does Python stand a chance in today's world of data science?

Complex pipelines● Python: Luigi (~Spotify), Pinball (~Pinterest)● Java: Apache Oozie, Azkaban...

Page 35: Does Python stand a chance in today's world of data science?

Logging

● UI, job trackers● tracebacks, continuous● configurable● human readable

Page 36: Does Python stand a chance in today's world of data science?

Navigating the tool landscape

Let it go; if it’s meant to be, it will come back.

Page 37: Does Python stand a chance in today's world of data science?

Take “progress” easy

Page 38: Does Python stand a chance in today's world of data science?

Summary

Python’s greatest differentiating factors:● +experienced full stack engineers● +pragmatic, mature tools● +HPC & scientific “baggage”● -meh deployment, orchestration, packaging● -not as much enterprise “baggage”

Page 39: Does Python stand a chance in today's world of data science?
Page 40: Does Python stand a chance in today's world of data science?

Radim Řehůřekhttp://rare-technologies.com(formerly radimrehurek.com)

@radimrehurek

Blog (mostly tech): http://radimrehurek.com/blog/