Stanford CS 246H Winter ‘14
Stanford CS 246H: Mining Massive Data Sets Hadoop Lab
Stanford CS 246H Winter ‘14
Machine Learning & Hadoop
Stanford CS 246H Winter ‘14
Peanut BuCer and Chocolate?
• The Promise of Big Data™ • Sounds great, but how?
• Hadoop talent pool is small • ML talent pool is Kny
• Tools and toolkits starKng to appear • Mahout, Oryx, Alpine, Ayasdi, Skytree, etc.
• Summary: Hadoop is hard, and ML is hard 1. Lots of people/companies are trying to make it easy 2. Don’t believe anyone who tells you they make it easy
Stanford CS 246H Winter ‘14
Hadoop & ML: A Brief History
• 2005 – Taste project started on SourceForge • 2007 – Mahout project started at Apache • 2008 – Taste donated to Mahout • … Kme passes … • 2012 – Myrrix is launched • 2013 – Cloudera ML project started on Github • Late 2013 – Oryx project started on Github
Stanford CS 246H Winter ‘14
Hadoop ML Family Tree
Taste
Mahout
Myrrix Cloudera ML
Oryx
Lucene
Andrew Ng
Stanford CS 246H Winter ‘14
Apache Mahout
Stanford CS 246H Winter ‘14
What is Mahout?
• “Scalable machine learning” • not just Hadoop-‐oriented machine learning • not en%rely, that is. Just mostly.
• Components • math library • clustering • classificaKon • decomposiKons • recommendaKons
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Mahout Math
• Goals are • basic linear algebra, • and staKsKcal sampling, • and good clustering, • decent speed, • extensibility, • especially for sparse data
• But not • totally badass speed • comprehensive set of algorithms • opKmizaKon, root finders, quadrature
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Caveat Emptor
• Mahout is a toolkit • There is a command line interface
• You can’t always use it
• Very oken end up wriKng code • DocumentaKon is… ahem… scant
• Best reference is Mahout in AcKon
• Varying levels of maturity • Varying levels of Hadoop support
Stanford CS 246H Winter ‘14
Matrices and Vectors
• At the core: • DenseVector, RandomAccessSparseVector • DenseMatrix, SparseRowMatrix
• Highly composable API
• Important ideas: • view*, assign and aggregate • iteraKon
m.viewDiagonal().assign(v)!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Assign? View?
• Why assign? • Copying is the major cost for naïve matrix packages • In-‐place operaKons criKcal to reasonable performance • Many kinds of updates required, so funcKonal style very helpful
• Why view? • In-‐place operaKons oken required for blocks, rows, columns or diagonals
• With views, we need #assign + #views methods • Without views, we need #assign x #views methods
• Synergies • With both views and assign, many loops become single line
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Assign
• Matrices
• Vectors
Matrix assign(double value);!Matrix assign(double[][] values);!Matrix assign(Matrix other);!Matrix assign(DoubleFunction f);!Matrix assign(Matrix other, DoubleDoubleFunction f);!
Vector assign(double value);!Vector assign(double[] values);!Vector assign(Vector other);!Vector assign(DoubleFunction f);!Vector assign(Vector other, DoubleDoubleFunction f);!Vector assign(DoubleDoubleFunction f, double y);!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Views
• Matrices
• Vectors
Matrix viewPart(int[] offset, int[] size);!Matrix viewPart(int row, int rlen, int col, int clen);!Vector viewRow(int row);!Vector viewColumn(int column);!Vector viewDiagonal();!
Vector viewPart(int offset, int length);!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Aggregates
• Matrices
• Vectors double zSum();!double aggregate(! DoubleDoubleFunction reduce, DoubleFunction map);!double aggregate(Vector other, ! DoubleDoubleFunction aggregator, ! DoubleDoubleFunction combiner);!
double zSum();!Vector aggregateRows(VectorFunction f);!Vector aggregateColumns(VectorFunction f);!double aggregate(DoubleDoubleFunction combiner, ! DoubleFunction mapper);!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Predefined FuncKons
• Many handy funcKons ABS LOG2 !ACOS NEGATE !ASIN RINT !ATAN SIGN !CEIL SIN !COS SQRT !EXP SQUARE !FLOOR SIGMOID !IDENTITY SIGMOIDGRADIENT !INV TAN !LOGARITHM!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Examples
double alpha; a.assign(alpha);
a.assign(b, FuncKons.chain( FuncKons.plus(beta), FuncKons.mult(alpha));
A =α
A =αB+β
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Sparse OpKmizaKons
• DoubleDoubleFuncKon abstract properKes
• And Vector properKes
public boolean isLikeRightPlus();!public boolean isLikeLeftMult();!public boolean isLikeRightMult();!public boolean isLikeMult();!public boolean isCommutative();!public boolean isAssociative();!public boolean isAssociativeAndCommutative();!public boolean isDensifying();!
public boolean isDense();!public boolean isSequentialAccess();!public double getLookupCost();!public double getIteratorAdvanceCost();!public boolean isAddConstantTime();!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Examples
• The trace of a matrix
• Set diagonal to zero
• Set diagonal to negaKve of row sums excluding the diagonal
m.viewDiagonal().zSum()!
m.viewDiagonal().assign(0)!
Vector diag = m.viewDiagonal().assign(0);!diag.assign(m.rowSums().assign(Functions.MINUS));!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
IteraKon
• Matrices are Iterable in Mahout
• Vectors are densely or sparsely iterable
// compute both row and columns sums in one pass!for (MatrixSlice row: m) {! rSums.set(row.index(), row.zSum());! cSums.assign(row, Functions.PLUS);!}!
double entropy = 0;!for (Vector.Element e: v.iterateNonZero()) {! entropy += e.get() * Math.log(e.get());!}!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Random Sampling
• Samples from some type
• Lots of kinds ChineseRestaurant Missing Normal !Empirical Multinomial PoissonSampler !IndianBuffet MultiNormal Sampler !
public interface Sampler<T> {! T sample();!}!!public abstract class AbstractSamplerFunction ! extends DoubleFunction ! implements Sampler<Double>!
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Mahout Math Summary
• Matrices, Vectors • views • in-‐place assignment • aggregaKons • iteraKons
• FuncKons • lots built-‐in • cooperate with sparse vector opKmizaKons
• Sampling • abstract samplers • samplers as funcKons
• Other stuff … clustering, SVD
©MapR Technologies 2013
Stanford CS 246H Winter ‘14
Other Stuff
• Matrix DecomposiKon • ClassificaKon • Clustering • RecommendaKons
Stanford CS 246H Winter ‘14
Focus: Machine Learning
Math Vectors/Matrices/SVD
Recommenders Clustering ClassificaKon Freq. PaCern Mining
GeneKc
UKliKes Lucene/Vectorizer
CollecKons (primiKves)
Apache Hadoop
ApplicaKons
Examples
See hCp://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Prepare Data from Raw content
• Data Sources: • Lucene integraKon
• bin/mahout lucenevector …
• Document Vectorizer • bin/mahout seqdirectory … • bin/mahout seq2sparse …
• ProgrammaKcally • See the UKls module in Mahout
• Database • File system
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
RecommendaKons
• Extensive framework for collaboraKve filtering • Recommenders
• User based, Item based, ALS, SlopeOne, SVD, others
• Online and Offline support • Offline can uKlize Hadoop
• Many different Similarity measures • Cosine, LLR, Tanimoto, Pearson, others
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Clustering
• Document level • Group documents based on a noKon of similarity
• K-‐Means, Fuzzy K-‐Means, Dirichlet, Canopy, Mean-‐Shik
• Distance Measures • ManhaCan, Euclidean, other
• Topic Modeling • Cluster words across documents to idenKfy topics
• Latent Dirichlet AllocaKon
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
CategorizaKon
• Place new items into predefined categories: • Sports, poliKcs, entertainment
• Mahout has several implementaKons • Naïve Bayes • Complementary Naïve Bayes • Decision Forests • LogisKc Regression (SGD)
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Freq. PaCern Mining
• IdenKfy frequently co-‐occurrent items
• Useful for: • Query RecommendaKons
• Apple -‐> iPhone, orange, OS X
• Related product placement • “Beer and Diapers”
• Spam DetecKon • Yahoo: hCp://www.slideshare.net/hadoopusergroup/mail-‐anKspam
hCp://www.amazon.com
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
EvoluKonary
• Map-‐Reduce ready fitness funcKons for geneKc programming
• IntegraKon with Watchmaker • hCp://watchmaker.uncommons.org/index.php
• Problems solved: • Traveling salesman • Class discovery • Many others
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Singular Value DecomposiKon
• Reduces a big matrix into a much smaller matrix by amplifying the important parts while removing/reducing the less important parts
• Mahout has fully distributed Lanczos implementaKon <MAHOUT_HOME>/bin/mahout svd -‐Dmapred.input.dir=path/to/corpus -‐-‐tempDir path/for/svd-‐output -‐-‐rank 300 -‐-‐numColumns <numcols> -‐-‐numRows <num rows in the input> <MAHOUT_HOME>/bin/mahout cleansvd -‐-‐eigenInput path/for/svd-‐output -‐-‐corpusInput path/to/corpus -‐-‐output path/for/cleanOutput -‐-‐maxError 0.1 -‐-‐minEigenvalue 10.0
• hCps://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+ReducKon
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
How to: Command Line
• Most algorithms have a Driver program • Shell script in $MAHOUT_HOME/bin helps with most tasks
• Prepare the Data • Different algorithms require different setup
• Run the algorithm • Single Node • Hadoop
• Print out the results • Several helper classes:
• LDAPrintTopics, ClusterDumper, etc.
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Ugly Demo II -‐ Prep
• Data Set: Reuters • hCp://www.daviddlewis.com/resources/testcollecKons/reuters21578/
• Convert to Text via hCp://www.lucenebootcamp.com/lucene-‐boot-‐camp-‐preclass-‐training/
• Convert to Sequence File: bin/mahout seqdirectory –input <PATH> -‐-‐output <PATH> -‐-‐charset UTF-‐8
• Convert to Sparse Vector: bin/mahout seq2sparse -‐-‐input <PATH>/content/reuters/seqfiles/ -‐-‐norm 2 -‐-‐weight TF -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/ -‐-‐minDF 5 -‐-‐maxDFPercent 90
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Ugly Demo II: Topic Modeling
• Latent Dirichlet AllocaKon ./mahout lda -‐-‐input <PATH>/content/reuters/seqfiles-‐TF/vectors/ -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/lda-‐output -‐-‐numWords 34000 –numTopics 10 ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics -‐-‐input <PATH>/content/reuters/seqfiles-‐TF/lda-‐output/state-‐19 -‐-‐dict <PATH>/content/reuters/seqfiles-‐TF/dictionary.file-‐0 -‐-‐words 10 -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/lda-‐output/topics -‐-‐dictionaryType sequencefile
• Good feature reducKon (stopword removal) required
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Ugly Demo III: Clustering
• K-‐Means • Same Prep as UD II, except use TFIDF weight ./mahout kmeans -‐-‐input <PATH>/content/reuters/seqfiles-‐TFIDF/vectors/part-‐00000 -‐-‐k 15 -‐-‐output <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans -‐-‐clusters <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/clusters
• Print out the clusters: ./mahout clusterdump -‐-‐seqFileDir <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/clusters-‐15/ -‐-‐pointsDir <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/points/ -‐-‐dictionary <PATH>/content/reuters/seqfiles-‐TFIDF/dictionary.file-‐0 -‐-‐dictionaryType sequencefile -‐-‐substring 20
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Ugly Demo IV: Frequent PaCern Mining
• Data: hCp://fimi.cs.helsinki.fi/data/ • ./mahout fpg -‐i <PATH>/content/freqitemset/accidents.dat -‐o patterns -‐k 50 -‐method mapreduce -‐g 10 -‐regex [\ ]
• ./mahout seqdump -‐-‐seqFile patterns/fpgrowth/part-‐r-‐00000
©Lucid ImaginaKon 2010
Stanford CS 246H Winter ‘14
Cloudera ML
Stanford CS 246H Winter ‘14
Cloudera ML
• CollecKon of Java libraries and command-‐line tools • Goal: make data scienKsts more producKve with CDH
• Exploratory data analysis • Data preparaKon • Model fi}ng • Model evaluaKon
• Apache 2.0 licensed • Developed on GitHub
• hCp://github.com/cloudera/ml
37
Stanford CS 246H Winter ‘14
Cloudera ML: Building Blocks
• Apache Hadoop • Scalable data storage (HDFS) and processing (MapReduce)
• Apache Hive • Metadata for structured data in HDFS
• Apache Crunch • Easy MapReduce pipelines
• Apache Mahout • Vector interface
• Apache Avro • SerializaKon format
38
Stanford CS 246H Winter ‘14 39
Cloudera ML Workflow: Clustering
Stanford CS 246H Winter ‘14
Cloudera ML: summary
• client/bin/ml summary -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/header.csv (local FS) -‐-‐summary-‐file examples/kdd99/s.json (local FS)
40
Stanford CS 246H Winter ‘14
Cloudera ML: summary
41
HDFS
Local FS
kddcup.data_10_percent
header.csv
1. summary
Stanford CS 246H Winter ‘14
Cloudera ML: summary
42
HDFS
Local FS
kddcup.data_10_percent
header.csv
1. summary
s.json
Stanford CS 246H Winter ‘14
Cloudera ML: summary
• s.json • Categorical features: histogram • Numerical features: distribuKon summary
43
Stanford CS 246H Winter ‘14
Cloudera ML: normalize
• client/bin/ml normalize -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐summary-‐file examples/kdd99/s.json (local FS) -‐-‐transform Z -‐-‐output-‐path kdd99 (HDFS) -‐-‐output-‐type avro -‐-‐id-‐column category -‐-‐compress
44
Stanford CS 246H Winter ‘14
Cloudera ML: normalize
45
HDFS
Local FS
kddcup.data_10_percent
header.csv
2. normalize
s.json
Stanford CS 246H Winter ‘14
Cloudera ML: normalize
46
HDFS
Local FS
kddcup.data_10_percent
header.csv
2. normalize
s.json
kdd99/
Stanford CS 246H Winter ‘14
Cloudera ML: normalize
• kdd99/part-‐m-‐0000[0|1].avro • Examples (rows)
• Part 0: 442,454 vectors • Part 1: 51,567 vectors • Total: 494,021 vectors
• Features (columns) • Before: 41 fields • Aker: 143 fields
47
Stanford CS 246H Winter ‘14
Cloudera ML: ksketch
• client/bin/ml ksketch -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐points-‐per-‐iteraKon 500 -‐-‐output-‐file wc.avro (local FS) -‐-‐seed 1729 -‐-‐iteraKons 5 -‐-‐cross-‐folds 2
48
Stanford CS 246H Winter ‘14
Cloudera ML: ksketch
49
HDFS
Local FS
kddcup.data_10_percent
header.csv
3. ksketch
s.json
kdd99/
Stanford CS 246H Winter ‘14
Cloudera ML: ksketch
50
HDFS
Local FS
kddcup.data_10_percent
header.csv
3. ksketch
s.json
kdd99/
wc.avro
Stanford CS 246H Winter ‘14
Cloudera ML: ksketch
• wc.avro • Examples (rows)
• 2 “folds” of 2501 examples • 1 iniKal example • 500 examples from each iteraKon (5 iteraKons) • Each example has an associated weight
• Features (columns) • 143 features (sKll)
51
Stanford CS 246H Winter ‘14
Cloudera ML: kmeans
• client/bin/ml kmeans -‐-‐input-‐file wc.avro (local FS) -‐-‐centers-‐file centers.avro (local FS) -‐-‐seed 19 -‐-‐clusters 1,10,25,35,45 -‐-‐best-‐of 2 -‐-‐num-‐threads 4 -‐-‐eval-‐stats-‐file kmeans_stats.csv (local FS)
52
Stanford CS 246H Winter ‘14
Cloudera ML: kmeans
53
HDFS
Local FS
kddcup.data_10_percent
header.csv
4. kmeans
s.json
kdd99/
wc.avro
Stanford CS 246H Winter ‘14
HDFS
Local FS
kddcup.data_10_percent
header.csv
4. kmeans
s.json
kdd99/
wc.avro
kmeans_stats.csv
centers.avro
Cloudera ML: kmeans
54
Stanford CS 246H Winter ‘14
Cloudera ML: kmeans
• centers.avro • 1 row for each run of k-‐means++ • 9 total runs: 1 for k=1, 2 each for k=10, 25, 35, and 45
• kmeans_stats.csv • Clustering quality scores
55
Stanford CS 246H Winter ‘14
Cloudera ML: kassign
• client/bin/ml kassign -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐centers-‐file centers.avro (local FS) -‐-‐center-‐ids 4 -‐-‐output-‐path assigned (HDFS) -‐-‐output-‐type csv
56
Stanford CS 246H Winter ‘14
Cloudera ML: kassign
57
HDFS
Local FS
kddcup.data_10_percent
header.csv
5. kassign
s.json
kdd99/
wc.avro centers.avro
Stanford CS 246H Winter ‘14
Cloudera ML: kassign
58
HDFS
Local FS
kddcup.data_10_percent
header.csv
5. kassign
s.json
kdd99/
wc.avro centers.avro
assigned/
Stanford CS 246H Winter ‘14
Cloudera ML: kassign
• assigned/part-‐m-‐0000[0|1] • Rows
• Part 0: 442,454 • Part 1: 51,567 • Total: 494,021
• Columns • Point ID (normal/aCack type, in this case) • Index in centers.avro • Assigned cluster ID • Squared distance to nearest cluster
59
Stanford CS 246H Winter ‘14
Cloudera ML: sample
• client/bin/ml sample -‐-‐input-‐paths assigned (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/kassign_header.csv (local FS) -‐-‐weight-‐field squared_distance -‐-‐group-‐fields clustering_id,closest_center_id -‐-‐output-‐type csv -‐-‐size 20 -‐-‐output-‐path extremal (HDFS)
60
Stanford CS 246H Winter ‘14
Cloudera ML: sample
61
HDFS
Local FS
kddcup.data_10_percent
header.csv
6. sample
s.json
kdd99/
wc.avro centers.avro
assigned/
kassign_header.csv
Stanford CS 246H Winter ‘14
Cloudera ML: sample
62
HDFS
Local FS
kddcup.data_10_percent
header.csv
6. sample
s.json
kdd99/
wc.avro centers.avro
assigned/
kassign_header.csv
extremal/
Stanford CS 246H Winter ‘14
Cloudera ML: sample
• extremal/part-‐r-‐00000 • Rows
• Up to 20 examples from each cluster • Examples that are furthest from the center of the cluster
• Columns • Point ID (normal/aCack type, in this case) • Index in centers.avro • Assigned cluster ID • Squared distance to nearest cluster
63
Stanford CS 246H Winter ‘14
Oryx
Stanford CS 246H Winter ‘14
2014: Lab to Factory
65
Stanford CS 246H Winter ‘14
Data Science Will Be Opera-onal Analy-cs
66
Stanford CS 246H Winter ‘14
I Built A Model. Now What?
67
Build Model Query Model Collect Input
Repeat
Stanford CS 246H Winter ‘14
I Built A Model On Hadoop. Now What?
68
Build Model Query Model Collect Input
Repeat
? ? ?
Stanford CS 246H Winter ‘14 69
Example: Oryx
Stanford CS 246H Winter ‘14 70
www.mwCl.com/wp-‐content/uploads/2013/11/IMG_5446_edited-‐2_mwCl.jpg
Stanford CS 246H Winter ‘14
Gaps to fill, and Goals
71
• Model Building • Large-‐scale • Con-nuous • Apache Hadoop™-‐based • Few, good algorithms
• Model Serving • Real-‐-me query • Real-‐-me update
• Algorithms • Parallelizable • Updateable • Works on diverse input
• Interoperable • PMML model format • Simple REST API • Open source
Stanford CS 246H Winter ‘14
Large-‐Scale or Real-‐Time?
72
Large-‐Scale Offline Batch
Real-‐Time Online Streaming
vs
Why Don’t We Have Both?
λ!
Stanford CS 246H Winter ‘14
Lambda Architecture
73
• Batch, Stream Processing are different
• Tackle separately in 2+ Layers
• Batch Layer: offline, asynchronous
• Serving / Speed Layer: real-‐Kme, incremental, approximate
jameskinley.tumblr.com/post/37398560534/the-‐lambda-‐architecture-‐principles-‐for-‐architecKng
… λ?
Stanford CS 246H Winter ‘14 74
Batch
Serving/Speed
Stanford CS 246H Winter ‘14
Two Layers
75
• ComputaKon Layer • Java-‐based server process • Client of Hadoop 2.x • Periodically builds “generaKon” from recent data and past model
• Baby-‐sits MapReduce* jobs (or, locally in-‐core)
• Publishes models
• Serving Layer • Apache Tomcat™-‐based server process
• Consumes models from HDFS (or local FS)
• Serves queries from model in memory
• Updates from new input • Also writes input to HDFS • Replicas for scale
* Apache Spark later
Stanford CS 246H Winter ‘14
CollaboraKve Filtering : ALS
76
• AlternaKng Least Squares • Latent-‐factor model • Accepts implicit or explicit feedback
• Real-‐Kme update via fold-‐in of input
• No cold-‐start • Parallelizable
YT
X
Stanford CS 246H Winter ‘14
Clustering : k-‐means++
77
• Well-‐known and understood
• Parallelizable • Clusters updateable
cwiki.apache.org/confluence/display/MAHOUT/K-‐Means+Clustering
Stanford CS 246H Winter ‘14
ClassificaKon / Regression : RDF
78
• Random Decision Forests • Ensemble method • Numeric, categorical features and target
• Very parallel • Nodes updateable • Works well on many problems
age$>$30
female? Yes
income$>$20000 Yes
Yes No
Stanford CS 246H Winter ‘14
PMML
79
• PredicKve Modeling Markup Language
• XML-‐based format for predicKve models
• Standardized by Data Mining Group (www.dmg.org)
• Wide tool support
<PMML xmlns="http://www.dmg.org/PMML-4_1"! version="4.1">! <Header copyright="www.dmg.org"/>! <DataDictionary numberOfFields="5">! <DataField name="temperature"! optype="continuous"! dataType="double"/>! …! </DataDictionary>! <TreeModel modelName="golfing"! functionName="classification">! <MiningSchema>! <MiningField name="temperature"/>! … ! </MiningSchema>! <Node score="will play">! <Node score="will play">! <SimplePredicate field="outlook"! operator="equal" ! value="sunny"/>! …! </Node>! </Node>! </TreeModel>!</PMML>!
www.dmg.org/v4-‐1/TreeModel.html
Stanford CS 246H Winter ‘14
HTTP REST API
80
• ConvenKon for RPC-‐like request / response
• HTTP verbs, transport • GET : query • POST : add input • Easy from browser, CLI, Java, Python, Scala, etc.
GET /recommend/jwills!
HTTP/1.1 200 OK!Content-Type: text/plain!!"Ray LaMontagne",0.951 "Fleet Foxes",0.7905!"The National",0.688!"Shearwater",0.3017!
Stanford CS 246H Winter ‘14
Wish List
81
• Revamp workflow • Spark / Crunch-‐like API, not raw M/R
• De-‐emphasize model building • Well-‐solved • Bring your own
• More component-‐ized • Less black-‐box service • Emphasize integraKon
• PMML, etc.
• “Pull” opKons • Ka�a? • Hive / Impala ?
Stanford CS 246H Winter ‘14
Open Source
82
github.com/cloudera/oryx!
100% Apache License 2.0
Stanford CS 246H Winter ‘14