Upload
nikhil-ketkar
View
387
Download
0
Embed Size (px)
Citation preview
Predictive Models at Scale using Dumbo
Nikhil Ketkar
40k+ Brands600k+ Sellers
700+ Million Products7k+ Categories10k+ Attributes
Motivation: Problem Space @ Indix
Developing Predictive Models
Unlabelled Data
SampleHandLabel Model Predict
Data with Predicted Labels
HDFS
StatisticalModel
StatisticalModel
StatisticalModel
StatisticalModel
StatisticalModel
StatisticalModel
Predictive Models at Scale
The Two Giants
Native, C/C++ Fortran
Numpy
Scipy, Pandas, Matplotlib
scikit-learn, scikit-image, statsmodels
JVM
Java/Scala
HDFS, Hadoop MapReduce
Cascading/Scalding
PyData Ecosystem Hadoop Ecosystem
ModelPredict
The Standard Options ● Port to Java/Scala use as Library in Mapper
○ Time Consuming ○ Need to port parts of the PyData Stack○ Reduced Velocity○ Error prone
● Write a REST API/Service for the model and call from Mapper○ Slow due to Network Latency○ Deployment is a nightmare
● Use Disco
Can we do better?
● Hadoop Streaming with Typedbytes Support● Python Wrappers over Hadoop Streaming
○ Dumbo○ MRJob○ Hadoopy○ Pydoop
Reference: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
Two Minute MapReduce Refresher
Reference: https://tarnbarford.net/journal/mapreduce-on-mongo
Sample Problem: Extract MPN from Product Titles
● 0.5 Billion Product Titles● Many contain MPNs● Humans can detect
MPNs● Can a model do the
same?● Use CRF on Full Title● Use RF on Tokens
Moen CSIMC000BN Brushed Nickel Decorative Mirror Frame Corner Rosette from Mirrorscapes 000 Series Set of 4
Rohl A3608/6.5LPAPC 2 Polished Chrome Country Kitchen Low Lead Bar Faucet with Porcelain Lever Handle
Newport Brass 3 447/ORB Oil Rubbed Bronze Hand RelievedDiverter / Volume Control Handle from the Metropole Collection
Bosch HCFC2044B 1/4" SDS Plus X5L with Optimized Flute Surface Pack of 25
Sterling 7214120 Ensemble 0" x 30" Shower Receptor with Right hand Drain Pack 6
U12 23252 KUB QUATRON INDX DRILL
MPNs in Product Titles
Code Walkthrough
Code Walkthrough
Important Learnings
● Dumbo Fairly Stable, Mature and Ready for Production
● Gets the 2 giants working together!● Found just one issue over 6 months of
usage (patch submitted)● Support for Typedbytes is critical if making
predictions over binary data (Images etc.)