Download pdf - Lecture 7 - CS 246h

Stanford CS 246H Winter ‘14

Stanford CS 246H: Mining Massive Data Sets Hadoop Lab


Machine Learning & Hadoop


Peanut BuCer and Chocolate?

•  The Promise of Big Data™ •  Sounds great, but how?

•  Hadoop talent pool is small •  ML talent pool is Kny

•  Tools and toolkits starKng to appear •  Mahout, Oryx, Alpine, Ayasdi, Skytree, etc.

•  Summary: Hadoop is hard, and ML is hard 1.  Lots of people/companies are trying to make it easy 2.  Don’t believe anyone who tells you they make it easy


Hadoop & ML: A Brief History

•  2005 – Taste project started on SourceForge •  2007 – Mahout project started at Apache •  2008 – Taste donated to Mahout •  … Kme passes … •  2012 – Myrrix is launched •  2013 – Cloudera ML project started on Github •  Late 2013 – Oryx project started on Github


Hadoop ML Family Tree

Taste

Mahout

Myrrix Cloudera ML

Oryx

Lucene

Andrew Ng


Apache Mahout


What is Mahout?

•  “Scalable machine learning” •  not just Hadoop-‐oriented machine learning •  not en%rely, that is. Just mostly.

•  Components •  math library •  clustering •  classificaKon •  decomposiKons •  recommendaKons

©MapR Technologies 2013


Mahout Math

•  Goals are •  basic linear algebra, •  and staKsKcal sampling, •  and good clustering, •  decent speed, •  extensibility, •  especially for sparse data

•  But not •  totally badass speed •  comprehensive set of algorithms •  opKmizaKon, root finders, quadrature



Caveat Emptor

• Mahout is a toolkit •  There is a command line interface

•  You can’t always use it

•  Very oken end up wriKng code •  DocumentaKon is… ahem… scant

•  Best reference is Mahout in AcKon

•  Varying levels of maturity •  Varying levels of Hadoop support


Matrices and Vectors

•  At the core: •  DenseVector, RandomAccessSparseVector •  DenseMatrix, SparseRowMatrix

•  Highly composable API

•  Important ideas: •  view*, assign and aggregate •  iteraKon

m.viewDiagonal().assign(v)!



Assign? View?

•  Why assign? •  Copying is the major cost for naïve matrix packages •  In-‐place operaKons criKcal to reasonable performance •  Many kinds of updates required, so funcKonal style very helpful

•  Why view? •  In-‐place operaKons oken required for blocks, rows, columns or diagonals

•  With views, we need #assign + #views methods •  Without views, we need #assign x #views methods

•  Synergies •  With both views and assign, many loops become single line



Assign

• Matrices

•  Vectors

Matrix assign(double value);!Matrix assign(double[][] values);!Matrix assign(Matrix other);!Matrix assign(DoubleFunction f);!Matrix assign(Matrix other, DoubleDoubleFunction f);!

Vector assign(double value);!Vector assign(double[] values);!Vector assign(Vector other);!Vector assign(DoubleFunction f);!Vector assign(Vector other, DoubleDoubleFunction f);!Vector assign(DoubleDoubleFunction f, double y);!



Views

• Matrices

•  Vectors

Matrix viewPart(int[] offset, int[] size);!Matrix viewPart(int row, int rlen, int col, int clen);!Vector viewRow(int row);!Vector viewColumn(int column);!Vector viewDiagonal();!

Vector viewPart(int offset, int length);!



Aggregates

• Matrices

•  Vectors double zSum();!double aggregate(! DoubleDoubleFunction reduce, DoubleFunction map);!double aggregate(Vector other, ! DoubleDoubleFunction aggregator, ! DoubleDoubleFunction combiner);!

double zSum();!Vector aggregateRows(VectorFunction f);!Vector aggregateColumns(VectorFunction f);!double aggregate(DoubleDoubleFunction combiner, ! DoubleFunction mapper);!



Predefined FuncKons

• Many handy funcKons ABS LOG2 !ACOS NEGATE !ASIN RINT !ATAN SIGN !CEIL SIN !COS SQRT !EXP SQUARE !FLOOR SIGMOID !IDENTITY SIGMOIDGRADIENT !INV TAN !LOGARITHM!



Examples

double alpha; a.assign(alpha);

a.assign(b, FuncKons.chain( FuncKons.plus(beta), FuncKons.mult(alpha));

A =α

A =αB+β



Sparse OpKmizaKons

•  DoubleDoubleFuncKon abstract properKes

•  And Vector properKes

public boolean isLikeRightPlus();!public boolean isLikeLeftMult();!public boolean isLikeRightMult();!public boolean isLikeMult();!public boolean isCommutative();!public boolean isAssociative();!public boolean isAssociativeAndCommutative();!public boolean isDensifying();!

public boolean isDense();!public boolean isSequentialAccess();!public double getLookupCost();!public double getIteratorAdvanceCost();!public boolean isAddConstantTime();!



Examples

•  The trace of a matrix

•  Set diagonal to zero

•  Set diagonal to negaKve of row sums excluding the diagonal

m.viewDiagonal().zSum()!

m.viewDiagonal().assign(0)!

Vector diag = m.viewDiagonal().assign(0);!diag.assign(m.rowSums().assign(Functions.MINUS));!



IteraKon

• Matrices are Iterable in Mahout

•  Vectors are densely or sparsely iterable

// compute both row and columns sums in one pass!for (MatrixSlice row: m) {! rSums.set(row.index(), row.zSum());! cSums.assign(row, Functions.PLUS);!}!

double entropy = 0;!for (Vector.Element e: v.iterateNonZero()) {! entropy += e.get() * Math.log(e.get());!}!



Random Sampling

•  Samples from some type

•  Lots of kinds ChineseRestaurant Missing Normal !Empirical Multinomial PoissonSampler !IndianBuffet MultiNormal Sampler !

public interface Sampler<T> {! T sample();!}!!public abstract class AbstractSamplerFunction ! extends DoubleFunction ! implements Sampler<Double>!



Mahout Math Summary

•  Matrices, Vectors •  views •  in-‐place assignment •  aggregaKons •  iteraKons

•  FuncKons •  lots built-‐in •  cooperate with sparse vector opKmizaKons

•  Sampling •  abstract samplers •  samplers as funcKons

•  Other stuff … clustering, SVD



Other Stuff

• Matrix DecomposiKon •  ClassificaKon •  Clustering •  RecommendaKons


Focus: Machine Learning

Math Vectors/Matrices/SVD

Recommenders Clustering ClassificaKon Freq. PaCern Mining

GeneKc

UKliKes Lucene/Vectorizer

CollecKons (primiKves)

Apache Hadoop

ApplicaKons

Examples

See hCp://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

©Lucid ImaginaKon 2010


Prepare Data from Raw content

•  Data Sources: •  Lucene integraKon

•  bin/mahout lucenevector …

•  Document Vectorizer •  bin/mahout seqdirectory … •  bin/mahout seq2sparse …

•  ProgrammaKcally •  See the UKls module in Mahout

•  Database •  File system



RecommendaKons

•  Extensive framework for collaboraKve filtering •  Recommenders

•  User based, Item based, ALS, SlopeOne, SVD, others

•  Online and Offline support •  Offline can uKlize Hadoop

• Many different Similarity measures •  Cosine, LLR, Tanimoto, Pearson, others



Clustering

•  Document level •  Group documents based on a noKon of similarity

•  K-‐Means, Fuzzy K-‐Means, Dirichlet, Canopy, Mean-‐Shik

•  Distance Measures •  ManhaCan, Euclidean, other

•  Topic Modeling •  Cluster words across documents to idenKfy topics

•  Latent Dirichlet AllocaKon



CategorizaKon

•  Place new items into predefined categories: •  Sports, poliKcs, entertainment

•  Mahout has several implementaKons •  Naïve Bayes •  Complementary Naïve Bayes •  Decision Forests •  LogisKc Regression (SGD)



Freq. PaCern Mining

•  IdenKfy frequently co-‐occurrent items

•  Useful for: •  Query RecommendaKons

•  Apple -‐> iPhone, orange, OS X

•  Related product placement •  “Beer and Diapers”

•  Spam DetecKon •  Yahoo: hCp://www.slideshare.net/hadoopusergroup/mail-‐anKspam

hCp://www.amazon.com



EvoluKonary

• Map-‐Reduce ready fitness funcKons for geneKc programming

•  IntegraKon with Watchmaker •  hCp://watchmaker.uncommons.org/index.php

•  Problems solved: •  Traveling salesman •  Class discovery •  Many others



Singular Value DecomposiKon

•  Reduces a big matrix into a much smaller matrix by amplifying the important parts while removing/reducing the less important parts

•  Mahout has fully distributed Lanczos implementaKon <MAHOUT_HOME>/bin/mahout svd -‐Dmapred.input.dir=path/to/corpus -‐-‐tempDir path/for/svd-‐output -‐-‐rank 300 -‐-‐numColumns <numcols> -‐-‐numRows <num rows in the input> <MAHOUT_HOME>/bin/mahout cleansvd -‐-‐eigenInput path/for/svd-‐output -‐-‐corpusInput path/to/corpus -‐-‐output path/for/cleanOutput -‐-‐maxError 0.1 -‐-‐minEigenvalue 10.0

•  hCps://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+ReducKon



How to: Command Line

• Most algorithms have a Driver program •  Shell script in $MAHOUT_HOME/bin helps with most tasks

•  Prepare the Data •  Different algorithms require different setup

•  Run the algorithm •  Single Node •  Hadoop

•  Print out the results •  Several helper classes:

•  LDAPrintTopics, ClusterDumper, etc.



Ugly Demo II -‐ Prep

•  Data Set: Reuters •  hCp://www.daviddlewis.com/resources/testcollecKons/reuters21578/

•  Convert to Text via hCp://www.lucenebootcamp.com/lucene-‐boot-‐camp-‐preclass-‐training/

•  Convert to Sequence File: bin/mahout seqdirectory –input <PATH> -‐-‐output <PATH> -‐-‐charset UTF-‐8

•  Convert to Sparse Vector: bin/mahout seq2sparse -‐-‐input <PATH>/content/reuters/seqfiles/ -‐-‐norm 2 -‐-‐weight TF -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/ -‐-‐minDF 5 -‐-‐maxDFPercent 90



Ugly Demo II: Topic Modeling

•  Latent Dirichlet AllocaKon ./mahout lda -‐-‐input <PATH>/content/reuters/seqfiles-‐TF/vectors/ -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/lda-‐output -‐-‐numWords 34000 –numTopics 10 ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics -‐-‐input <PATH>/content/reuters/seqfiles-‐TF/lda-‐output/state-‐19 -‐-‐dict <PATH>/content/reuters/seqfiles-‐TF/dictionary.file-‐0 -‐-‐words 10 -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/lda-‐output/topics -‐-‐dictionaryType sequencefile

•  Good feature reducKon (stopword removal) required



Ugly Demo III: Clustering

•  K-‐Means •  Same Prep as UD II, except use TFIDF weight ./mahout kmeans -‐-‐input <PATH>/content/reuters/seqfiles-‐TFIDF/vectors/part-‐00000 -‐-‐k 15 -‐-‐output <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans -‐-‐clusters <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/clusters

•  Print out the clusters: ./mahout clusterdump -‐-‐seqFileDir <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/clusters-‐15/ -‐-‐pointsDir <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/points/ -‐-‐dictionary <PATH>/content/reuters/seqfiles-‐TFIDF/dictionary.file-‐0 -‐-‐dictionaryType sequencefile -‐-‐substring 20



Ugly Demo IV: Frequent PaCern Mining

•  Data: hCp://fimi.cs.helsinki.fi/data/ •  ./mahout fpg -‐i <PATH>/content/freqitemset/accidents.dat -‐o patterns -‐k 50 -‐method mapreduce -‐g 10 -‐regex [\ ]

•  ./mahout seqdump -‐-‐seqFile patterns/fpgrowth/part-‐r-‐00000



Cloudera ML


Cloudera ML

•  CollecKon of Java libraries and command-‐line tools •  Goal: make data scienKsts more producKve with CDH

•  Exploratory data analysis •  Data preparaKon •  Model fi}ng •  Model evaluaKon

•  Apache 2.0 licensed •  Developed on GitHub

•  hCp://github.com/cloudera/ml

37


Cloudera ML: Building Blocks

•  Apache Hadoop •  Scalable data storage (HDFS) and processing (MapReduce)

•  Apache Hive •  Metadata for structured data in HDFS

•  Apache Crunch •  Easy MapReduce pipelines

•  Apache Mahout •  Vector interface

•  Apache Avro •  SerializaKon format

38

Stanford CS 246H Winter ‘14 39

Cloudera ML Workflow: Clustering


Cloudera ML: summary

•  client/bin/ml summary -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/header.csv (local FS) -‐-‐summary-‐file examples/kdd99/s.json (local FS)

40



41

HDFS

Local FS

kddcup.data_10_percent

header.csv

1. summary



42

HDFS

Local FS


header.csv

1. summary

s.json



•  s.json •  Categorical features: histogram •  Numerical features: distribuKon summary

43


Cloudera ML: normalize

•  client/bin/ml normalize -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐summary-‐file examples/kdd99/s.json (local FS) -‐-‐transform Z -‐-‐output-‐path kdd99 (HDFS) -‐-‐output-‐type avro -‐-‐id-‐column category -‐-‐compress

44



45

HDFS

Local FS


header.csv

2. normalize

s.json



46

HDFS

Local FS


header.csv

2. normalize

s.json

kdd99/



•  kdd99/part-‐m-‐0000[0|1].avro •  Examples (rows)

•  Part 0: 442,454 vectors •  Part 1: 51,567 vectors •  Total: 494,021 vectors

•  Features (columns) •  Before: 41 fields •  Aker: 143 fields

47


Cloudera ML: ksketch

•  client/bin/ml ksketch -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐points-‐per-‐iteraKon 500 -‐-‐output-‐file wc.avro (local FS) -‐-‐seed 1729 -‐-‐iteraKons 5 -‐-‐cross-‐folds 2

48



49

HDFS

Local FS


header.csv

3. ksketch

s.json

kdd99/



50

HDFS

Local FS


header.csv

3. ksketch

s.json

kdd99/

wc.avro



•  wc.avro •  Examples (rows)

•  2 “folds” of 2501 examples •  1 iniKal example •  500 examples from each iteraKon (5 iteraKons) •  Each example has an associated weight

•  Features (columns) •  143 features (sKll)

51


Cloudera ML: kmeans

•  client/bin/ml kmeans -‐-‐input-‐file wc.avro (local FS) -‐-‐centers-‐file centers.avro (local FS) -‐-‐seed 19 -‐-‐clusters 1,10,25,35,45 -‐-‐best-‐of 2 -‐-‐num-‐threads 4 -‐-‐eval-‐stats-‐file kmeans_stats.csv (local FS)

52


Cloudera ML: kmeans

53

HDFS

Local FS


header.csv

4. kmeans

s.json

kdd99/

wc.avro


HDFS

Local FS


header.csv

4. kmeans

s.json

kdd99/

wc.avro

kmeans_stats.csv

centers.avro

Cloudera ML: kmeans

54


Cloudera ML: kmeans

•  centers.avro •  1 row for each run of k-‐means++ •  9 total runs: 1 for k=1, 2 each for k=10, 25, 35, and 45

•  kmeans_stats.csv •  Clustering quality scores

55


Cloudera ML: kassign

•  client/bin/ml kassign -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐centers-‐file centers.avro (local FS) -‐-‐center-‐ids 4 -‐-‐output-‐path assigned (HDFS) -‐-‐output-‐type csv

56



57

HDFS

Local FS


header.csv

5. kassign

s.json

kdd99/

wc.avro centers.avro



58

HDFS

Local FS


header.csv

5. kassign

s.json

kdd99/


assigned/



•  assigned/part-‐m-‐0000[0|1] •  Rows

•  Part 0: 442,454 •  Part 1: 51,567 •  Total: 494,021

•  Columns •  Point ID (normal/aCack type, in this case) •  Index in centers.avro •  Assigned cluster ID •  Squared distance to nearest cluster

59


Cloudera ML: sample

•  client/bin/ml sample -‐-‐input-‐paths assigned (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/kassign_header.csv (local FS) -‐-‐weight-‐field squared_distance -‐-‐group-‐fields clustering_id,closest_center_id -‐-‐output-‐type csv -‐-‐size 20 -‐-‐output-‐path extremal (HDFS)

60


Cloudera ML: sample

61

HDFS

Local FS


header.csv

6. sample

s.json

kdd99/


assigned/

kassign_header.csv


Cloudera ML: sample

62

HDFS

Local FS


header.csv

6. sample

s.json

kdd99/


assigned/

kassign_header.csv

extremal/


Cloudera ML: sample

•  extremal/part-‐r-‐00000 •  Rows

•  Up to 20 examples from each cluster •  Examples that are furthest from the center of the cluster

•  Columns •  Point ID (normal/aCack type, in this case) •  Index in centers.avro •  Assigned cluster ID •  Squared distance to nearest cluster

63


Oryx


2014: Lab to Factory

65


Data Science Will Be Opera-onal Analy-cs

66


I Built A Model. Now What?

67

Build Model Query Model Collect Input

Repeat


I Built A Model On Hadoop. Now What?

68

Build Model Query Model Collect Input

Repeat

? ? ?


Example: Oryx


www.mwCl.com/wp-‐content/uploads/2013/11/IMG_5446_edited-‐2_mwCl.jpg


Gaps to fill, and Goals

71

• Model Building •  Large-‐scale •  Con-nuous •  Apache Hadoop™-‐based •  Few, good algorithms

• Model Serving •  Real-‐-me query •  Real-‐-me update

•  Algorithms •  Parallelizable •  Updateable •  Works on diverse input

•  Interoperable •  PMML model format •  Simple REST API •  Open source


Large-‐Scale or Real-‐Time?

72

Large-‐Scale Offline Batch

Real-‐Time Online Streaming

vs

Why Don’t We Have Both?

λ!


Lambda Architecture

73

•  Batch, Stream Processing are different

•  Tackle separately in 2+ Layers

•  Batch Layer: offline, asynchronous

•  Serving / Speed Layer: real-‐Kme, incremental, approximate

jameskinley.tumblr.com/post/37398560534/the-‐lambda-‐architecture-‐principles-‐for-‐architecKng

… λ?


Batch

Serving/Speed


Two Layers

75

•  ComputaKon Layer •  Java-‐based server process •  Client of Hadoop 2.x •  Periodically builds “generaKon” from recent data and past model

•  Baby-‐sits MapReduce* jobs (or, locally in-‐core)

•  Publishes models

•  Serving Layer •  Apache Tomcat™-‐based server process

•  Consumes models from HDFS (or local FS)

•  Serves queries from model in memory

•  Updates from new input •  Also writes input to HDFS •  Replicas for scale

* Apache Spark later


CollaboraKve Filtering : ALS

76

•  AlternaKng Least Squares •  Latent-‐factor model •  Accepts implicit or explicit feedback

•  Real-‐Kme update via fold-‐in of input

•  No cold-‐start •  Parallelizable

YT

X


Clustering : k-‐means++

77

• Well-‐known and understood

•  Parallelizable •  Clusters updateable

cwiki.apache.org/confluence/display/MAHOUT/K-‐Means+Clustering


ClassificaKon / Regression : RDF

78

•  Random Decision Forests •  Ensemble method •  Numeric, categorical features and target

•  Very parallel •  Nodes updateable • Works well on many problems

age$>$30

female? Yes

income$>$20000 Yes

Yes No


PMML

79

•  PredicKve Modeling Markup Language

•  XML-‐based format for predicKve models

•  Standardized by Data Mining Group (www.dmg.org)

• Wide tool support

<PMML xmlns="http://www.dmg.org/PMML-4_1"! version="4.1">! <Header copyright="www.dmg.org"/>! <DataDictionary numberOfFields="5">! <DataField name="temperature"! optype="continuous"! dataType="double"/>! …! </DataDictionary>! <TreeModel modelName="golfing"! functionName="classification">! <MiningSchema>! <MiningField name="temperature"/>! … ! </MiningSchema>! <Node score="will play">! <Node score="will play">! <SimplePredicate field="outlook"! operator="equal" ! value="sunny"/>! …! </Node>! </Node>! </TreeModel>!</PMML>!

www.dmg.org/v4-‐1/TreeModel.html


HTTP REST API

80

•  ConvenKon for RPC-‐like request / response

•  HTTP verbs, transport •  GET : query •  POST : add input •  Easy from browser, CLI, Java, Python, Scala, etc.

GET /recommend/jwills!

HTTP/1.1 200 OK!Content-Type: text/plain!!"Ray LaMontagne",0.951 "Fleet Foxes",0.7905!"The National",0.688!"Shearwater",0.3017!


Wish List

81

•  Revamp workflow •  Spark / Crunch-‐like API, not raw M/R

•  De-‐emphasize model building •  Well-‐solved •  Bring your own

• More component-‐ized •  Less black-‐box service •  Emphasize integraKon

•  PMML, etc.

•  “Pull” opKons •  Ka�a? •  Hive / Impala ?


Open Source

82

github.com/cloudera/oryx!

100% Apache License 2.0