Lecture 7 - CS 246h

Preview:

DESCRIPTION

a

Citation preview

Stanford  CS  246H  Winter  ‘14  

Stanford  CS  246H:  Mining  Massive  Data  Sets  Hadoop  Lab  

Stanford  CS  246H  Winter  ‘14  

Machine  Learning  &  Hadoop  

Stanford  CS  246H  Winter  ‘14  

Peanut  BuCer  and  Chocolate?  

•  The  Promise  of  Big  Data™  •  Sounds  great,  but  how?  

•  Hadoop  talent  pool  is  small  •  ML  talent  pool  is  Kny  

•  Tools  and  toolkits  starKng  to  appear  •  Mahout,  Oryx,  Alpine,  Ayasdi,  Skytree,  etc.  

•  Summary:  Hadoop  is  hard,  and  ML  is  hard  1.  Lots  of  people/companies  are  trying  to  make  it  easy  2.  Don’t  believe  anyone  who  tells  you  they  make  it  easy  

Stanford  CS  246H  Winter  ‘14  

Hadoop  &  ML:  A  Brief  History  

•  2005  –  Taste  project  started  on  SourceForge  •  2007  –  Mahout  project  started  at  Apache  •  2008  –  Taste  donated  to  Mahout  •  …  Kme  passes  …  •  2012  –  Myrrix  is  launched  •  2013  –  Cloudera  ML  project  started  on  Github  •  Late  2013  –  Oryx  project  started  on  Github  

Stanford  CS  246H  Winter  ‘14  

Hadoop  ML  Family  Tree  

Taste  

Mahout  

Myrrix  Cloudera  ML  

Oryx  

Lucene  

Andrew  Ng  

Stanford  CS  246H  Winter  ‘14  

Apache  Mahout  

Stanford  CS  246H  Winter  ‘14  

What  is  Mahout?  

•  “Scalable  machine  learning”  •  not  just  Hadoop-­‐oriented  machine  learning  •  not  en%rely,  that  is.    Just  mostly.  

•  Components  •  math  library  •  clustering  •  classificaKon  •  decomposiKons  •  recommendaKons  

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Mahout  Math  

•  Goals  are  •  basic  linear  algebra,  •  and  staKsKcal  sampling,  •  and  good  clustering,  •  decent  speed,  •  extensibility,  •  especially  for  sparse  data  

•  But  not    •  totally  badass  speed  •  comprehensive  set  of  algorithms  •  opKmizaKon,  root  finders,  quadrature  

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Caveat  Emptor  

• Mahout  is  a  toolkit  •  There  is  a  command  line  interface  

•  You  can’t  always  use  it  

•  Very  oken  end  up  wriKng  code  •  DocumentaKon  is…  ahem…  scant  

•  Best  reference  is  Mahout  in  AcKon  

•  Varying  levels  of  maturity  •  Varying  levels  of  Hadoop  support  

Stanford  CS  246H  Winter  ‘14  

Matrices  and  Vectors  

•  At  the  core:  •  DenseVector,  RandomAccessSparseVector  •  DenseMatrix,  SparseRowMatrix  

•  Highly  composable  API  

•  Important  ideas:    •  view*,  assign  and  aggregate  •  iteraKon  

m.viewDiagonal().assign(v)!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Assign?    View?  

•  Why  assign?  •  Copying  is  the  major  cost  for  naïve  matrix  packages  •  In-­‐place  operaKons  criKcal  to  reasonable  performance  •  Many  kinds  of  updates  required,  so  funcKonal  style  very  helpful  

•  Why  view?  •  In-­‐place  operaKons  oken  required  for  blocks,  rows,  columns  or  diagonals  

•  With  views,  we  need  #assign  +  #views  methods  •  Without  views,  we  need  #assign  x  #views  methods  

•  Synergies  •  With  both  views  and  assign,  many  loops  become  single  line  

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Assign  

• Matrices  

•  Vectors  

Matrix assign(double value);!Matrix assign(double[][] values);!Matrix assign(Matrix other);!Matrix assign(DoubleFunction f);!Matrix assign(Matrix other, DoubleDoubleFunction f);!

Vector assign(double value);!Vector assign(double[] values);!Vector assign(Vector other);!Vector assign(DoubleFunction f);!Vector assign(Vector other, DoubleDoubleFunction f);!Vector assign(DoubleDoubleFunction f, double y);!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Views  

• Matrices  

•  Vectors  

Matrix viewPart(int[] offset, int[] size);!Matrix viewPart(int row, int rlen, int col, int clen);!Vector viewRow(int row);!Vector viewColumn(int column);!Vector viewDiagonal();!

Vector viewPart(int offset, int length);!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Aggregates  

• Matrices  

 

•  Vectors  double zSum();!double aggregate(! DoubleDoubleFunction reduce, DoubleFunction map);!double aggregate(Vector other, ! DoubleDoubleFunction aggregator, ! DoubleDoubleFunction combiner);!

double zSum();!Vector aggregateRows(VectorFunction f);!Vector aggregateColumns(VectorFunction f);!double aggregate(DoubleDoubleFunction combiner, ! DoubleFunction mapper);!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Predefined  FuncKons  

• Many  handy  funcKons  ABS LOG2 !ACOS NEGATE !ASIN RINT !ATAN SIGN !CEIL SIN !COS SQRT !EXP SQUARE !FLOOR SIGMOID !IDENTITY SIGMOIDGRADIENT !INV TAN !LOGARITHM!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Examples  

double  alpha;  a.assign(alpha);  

a.assign(b,  FuncKons.chain(          FuncKons.plus(beta),            FuncKons.mult(alpha));  

A =α

A =αB+β

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Sparse  OpKmizaKons  

•  DoubleDoubleFuncKon  abstract  properKes  

•  And  Vector  properKes  

public boolean isLikeRightPlus();!public boolean isLikeLeftMult();!public boolean isLikeRightMult();!public boolean isLikeMult();!public boolean isCommutative();!public boolean isAssociative();!public boolean isAssociativeAndCommutative();!public boolean isDensifying();!

public boolean isDense();!public boolean isSequentialAccess();!public double getLookupCost();!public double getIteratorAdvanceCost();!public boolean isAddConstantTime();!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Examples  

•  The  trace  of  a  matrix  

•  Set  diagonal  to  zero  

•  Set  diagonal  to  negaKve  of  row  sums  excluding  the  diagonal  

m.viewDiagonal().zSum()!

m.viewDiagonal().assign(0)!

Vector diag = m.viewDiagonal().assign(0);!diag.assign(m.rowSums().assign(Functions.MINUS));!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

IteraKon  

• Matrices  are  Iterable  in  Mahout  

 

•  Vectors  are  densely  or  sparsely  iterable  

// compute both row and columns sums in one pass!for (MatrixSlice row: m) {! rSums.set(row.index(), row.zSum());! cSums.assign(row, Functions.PLUS);!}!

double entropy = 0;!for (Vector.Element e: v.iterateNonZero()) {! entropy += e.get() * Math.log(e.get());!}!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Random  Sampling  

•  Samples  from  some  type  

•  Lots  of  kinds  ChineseRestaurant Missing Normal !Empirical Multinomial PoissonSampler !IndianBuffet MultiNormal Sampler !

public interface Sampler<T> {! T sample();!}!!public abstract class AbstractSamplerFunction ! extends DoubleFunction ! implements Sampler<Double>!

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Mahout  Math  Summary  

•  Matrices,  Vectors  •  views  •  in-­‐place  assignment  •  aggregaKons  •  iteraKons  

•  FuncKons  •  lots  built-­‐in  •  cooperate  with  sparse  vector  opKmizaKons  

•  Sampling  •  abstract  samplers  •  samplers  as  funcKons  

•  Other  stuff  …  clustering,  SVD    

©MapR  Technologies  2013  

Stanford  CS  246H  Winter  ‘14  

Other  Stuff  

• Matrix  DecomposiKon  •  ClassificaKon  •  Clustering  •  RecommendaKons  

Stanford  CS  246H  Winter  ‘14  

Focus:  Machine  Learning  

Math  Vectors/Matrices/SVD  

Recommenders  Clustering  ClassificaKon  Freq.  PaCern  Mining  

GeneKc  

UKliKes  Lucene/Vectorizer  

CollecKons  (primiKves)  

Apache  Hadoop  

ApplicaKons  

Examples  

See  hCp://cwiki.apache.org/confluence/display/MAHOUT/Algorithms  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Prepare  Data  from  Raw  content  

•  Data  Sources:  •  Lucene  integraKon  

•  bin/mahout  lucenevector  …  

•  Document  Vectorizer  •  bin/mahout  seqdirectory  …  •  bin/mahout  seq2sparse  …  

•  ProgrammaKcally  •  See  the  UKls  module  in  Mahout  

•  Database  •  File  system  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

RecommendaKons  

•  Extensive  framework  for  collaboraKve  filtering  •  Recommenders  

•  User  based,  Item  based,  ALS,  SlopeOne,  SVD,  others  

•  Online  and  Offline  support  •  Offline  can  uKlize  Hadoop  

• Many  different  Similarity  measures  •  Cosine,  LLR,  Tanimoto,  Pearson,  others  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Clustering  

•  Document  level  •  Group  documents  based  on  a  noKon  of  similarity  

•  K-­‐Means,  Fuzzy  K-­‐Means,  Dirichlet,  Canopy,  Mean-­‐Shik  

•  Distance  Measures  •  ManhaCan,  Euclidean,  other  

•  Topic  Modeling    •  Cluster  words  across  documents  to  idenKfy  topics  

•  Latent  Dirichlet  AllocaKon  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

CategorizaKon  

•  Place  new  items  into  predefined  categories:  •  Sports,  poliKcs,  entertainment  

•  Mahout  has  several  implementaKons  •  Naïve  Bayes  •  Complementary  Naïve  Bayes  •  Decision  Forests  •  LogisKc  Regression  (SGD)  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Freq.  PaCern  Mining  

•  IdenKfy  frequently  co-­‐occurrent  items  

•  Useful  for:  •  Query  RecommendaKons  

•  Apple  -­‐>  iPhone,  orange,  OS  X  

•  Related  product  placement  •  “Beer  and  Diapers”  

•  Spam  DetecKon  •  Yahoo:  hCp://www.slideshare.net/hadoopusergroup/mail-­‐anKspam  

hCp://www.amazon.com  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

EvoluKonary  

• Map-­‐Reduce  ready  fitness  funcKons  for  geneKc  programming  

•  IntegraKon  with  Watchmaker  •  hCp://watchmaker.uncommons.org/index.php  

•  Problems  solved:  •  Traveling  salesman  •  Class  discovery  •  Many  others  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Singular  Value  DecomposiKon  

•  Reduces  a  big  matrix  into  a  much  smaller  matrix  by  amplifying  the  important  parts  while  removing/reducing  the  less  important  parts  

•  Mahout  has  fully  distributed  Lanczos  implementaKon  <MAHOUT_HOME>/bin/mahout  svd  -­‐Dmapred.input.dir=path/to/corpus  -­‐-­‐tempDir  path/for/svd-­‐output  -­‐-­‐rank  300  -­‐-­‐numColumns  <numcols>  -­‐-­‐numRows  <num  rows  in  the  input>  <MAHOUT_HOME>/bin/mahout  cleansvd  -­‐-­‐eigenInput  path/for/svd-­‐output  -­‐-­‐corpusInput  path/to/corpus  -­‐-­‐output  path/for/cleanOutput  -­‐-­‐maxError  0.1  -­‐-­‐minEigenvalue  10.0    

•  hCps://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+ReducKon    

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

How  to:  Command  Line  

• Most  algorithms  have  a  Driver  program  •  Shell  script  in  $MAHOUT_HOME/bin  helps  with  most  tasks  

•  Prepare  the  Data  •  Different  algorithms  require  different  setup  

•  Run  the  algorithm  •  Single  Node  •  Hadoop  

•  Print  out  the  results  •  Several  helper  classes:    

•  LDAPrintTopics,  ClusterDumper,  etc.  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  II  -­‐  Prep  

•  Data  Set:  Reuters  •  hCp://www.daviddlewis.com/resources/testcollecKons/reuters21578/  

•  Convert  to  Text  via  hCp://www.lucenebootcamp.com/lucene-­‐boot-­‐camp-­‐preclass-­‐training/  

•  Convert  to  Sequence  File:  bin/mahout  seqdirectory  –input  <PATH>  -­‐-­‐output  <PATH>  -­‐-­‐charset  UTF-­‐8  

•  Convert  to  Sparse  Vector:  bin/mahout  seq2sparse  -­‐-­‐input  <PATH>/content/reuters/seqfiles/  -­‐-­‐norm  2  -­‐-­‐weight  TF  -­‐-­‐output  <PATH>/content/reuters/seqfiles-­‐TF/  -­‐-­‐minDF  5  -­‐-­‐maxDFPercent  90  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  II:  Topic  Modeling  

•  Latent  Dirichlet  AllocaKon  ./mahout  lda  -­‐-­‐input    <PATH>/content/reuters/seqfiles-­‐TF/vectors/  -­‐-­‐output    <PATH>/content/reuters/seqfiles-­‐TF/lda-­‐output  -­‐-­‐numWords  34000  –numTopics  10  ./mahout  org.apache.mahout.clustering.lda.LDAPrintTopics  -­‐-­‐input  <PATH>/content/reuters/seqfiles-­‐TF/lda-­‐output/state-­‐19  -­‐-­‐dict  <PATH>/content/reuters/seqfiles-­‐TF/dictionary.file-­‐0  -­‐-­‐words  10  -­‐-­‐output  <PATH>/content/reuters/seqfiles-­‐TF/lda-­‐output/topics  -­‐-­‐dictionaryType  sequencefile  

•  Good  feature  reducKon  (stopword  removal)  required  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  III:  Clustering  

•  K-­‐Means  •  Same  Prep  as  UD  II,  except  use  TFIDF  weight  ./mahout  kmeans  -­‐-­‐input  <PATH>/content/reuters/seqfiles-­‐TFIDF/vectors/part-­‐00000  -­‐-­‐k  15  -­‐-­‐output  <PATH>/content/reuters/seqfiles-­‐TFIDF/output-­‐kmeans  -­‐-­‐clusters  <PATH>/content/reuters/seqfiles-­‐TFIDF/output-­‐kmeans/clusters  

•  Print  out  the  clusters:  ./mahout  clusterdump  -­‐-­‐seqFileDir  <PATH>/content/reuters/seqfiles-­‐TFIDF/output-­‐kmeans/clusters-­‐15/  -­‐-­‐pointsDir  <PATH>/content/reuters/seqfiles-­‐TFIDF/output-­‐kmeans/points/  -­‐-­‐dictionary  <PATH>/content/reuters/seqfiles-­‐TFIDF/dictionary.file-­‐0  -­‐-­‐dictionaryType  sequencefile  -­‐-­‐substring  20  

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Ugly  Demo  IV:  Frequent  PaCern  Mining  

•  Data:  hCp://fimi.cs.helsinki.fi/data/  •  ./mahout  fpg  -­‐i  <PATH>/content/freqitemset/accidents.dat  -­‐o  patterns  -­‐k  50  -­‐method  mapreduce  -­‐g  10  -­‐regex  [\  ]  

•   ./mahout  seqdump  -­‐-­‐seqFile  patterns/fpgrowth/part-­‐r-­‐00000    

©Lucid  ImaginaKon  2010  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML  

•  CollecKon  of  Java  libraries  and  command-­‐line  tools  •  Goal:  make  data  scienKsts  more  producKve  with  CDH  

•  Exploratory  data  analysis  •  Data  preparaKon  •  Model  fi}ng  •  Model  evaluaKon  

•  Apache  2.0  licensed  •  Developed  on  GitHub  

•  hCp://github.com/cloudera/ml  

37  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  Building  Blocks  

•  Apache  Hadoop  •  Scalable  data  storage  (HDFS)  and  processing  (MapReduce)  

•  Apache  Hive  •  Metadata  for  structured  data  in  HDFS  

•  Apache  Crunch  •  Easy  MapReduce  pipelines  

•  Apache  Mahout  •  Vector  interface  

•  Apache  Avro  •  SerializaKon  format  

38  

Stanford  CS  246H  Winter  ‘14  39  

Cloudera  ML  Workflow:  Clustering  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary  

•  client/bin/ml  summary  -­‐-­‐input-­‐paths  kddcup.data_10_percent  (HDFS)  -­‐-­‐format  text  -­‐-­‐header-­‐file  examples/kdd99/header.csv  (local  FS)  -­‐-­‐summary-­‐file  examples/kdd99/s.json  (local  FS)    

40  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary  

41  

HDFS

Local FS

kddcup.data_10_percent

header.csv

1. summary

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary  

42  

HDFS

Local FS

kddcup.data_10_percent

header.csv

1. summary

s.json

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  summary  

•  s.json  •  Categorical  features:  histogram  •  Numerical  features:  distribuKon  summary  

43  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize  

•  client/bin/ml  normalize  -­‐-­‐input-­‐paths  kddcup.data_10_percent  (HDFS)  -­‐-­‐format  text  -­‐-­‐summary-­‐file  examples/kdd99/s.json  (local  FS)  -­‐-­‐transform  Z  -­‐-­‐output-­‐path  kdd99  (HDFS)  -­‐-­‐output-­‐type  avro  -­‐-­‐id-­‐column  category  -­‐-­‐compress  

44  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize  

45  

HDFS

Local FS

kddcup.data_10_percent

header.csv

2. normalize

s.json

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize  

46  

HDFS

Local FS

kddcup.data_10_percent

header.csv

2. normalize

s.json

kdd99/

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  normalize  

•  kdd99/part-­‐m-­‐0000[0|1].avro  •  Examples  (rows)    

•  Part  0:  442,454  vectors  •  Part  1:  51,567  vectors  •  Total:  494,021  vectors  

•  Features  (columns)  •  Before:  41  fields  •  Aker:  143  fields  

47  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch  

•  client/bin/ml  ksketch    -­‐-­‐input-­‐paths  kdd99  (HDFS)  -­‐-­‐format  avro  -­‐-­‐points-­‐per-­‐iteraKon  500  -­‐-­‐output-­‐file  wc.avro  (local  FS)  -­‐-­‐seed  1729  -­‐-­‐iteraKons  5  -­‐-­‐cross-­‐folds  2  

48  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch  

49  

HDFS

Local FS

kddcup.data_10_percent

header.csv

3. ksketch

s.json

kdd99/

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch  

50  

HDFS

Local FS

kddcup.data_10_percent

header.csv

3. ksketch

s.json

kdd99/

wc.avro

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  ksketch  

•  wc.avro  •  Examples  (rows)  

•  2  “folds”  of  2501  examples  •  1  iniKal  example  •  500  examples  from  each  iteraKon  (5  iteraKons)  •  Each  example  has  an  associated  weight  

•  Features  (columns)  •  143  features  (sKll)  

51  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kmeans  

•  client/bin/ml  kmeans  -­‐-­‐input-­‐file  wc.avro  (local  FS)  -­‐-­‐centers-­‐file  centers.avro  (local  FS)  -­‐-­‐seed  19  -­‐-­‐clusters  1,10,25,35,45  -­‐-­‐best-­‐of  2  -­‐-­‐num-­‐threads  4  -­‐-­‐eval-­‐stats-­‐file  kmeans_stats.csv  (local  FS)  

52  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kmeans  

53  

HDFS

Local FS

kddcup.data_10_percent

header.csv

4. kmeans

s.json

kdd99/

wc.avro

Stanford  CS  246H  Winter  ‘14  

HDFS

Local FS

kddcup.data_10_percent

header.csv

4. kmeans

s.json

kdd99/

wc.avro

kmeans_stats.csv

centers.avro

Cloudera  ML:  kmeans  

54  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kmeans  

•  centers.avro  •  1  row  for  each  run  of  k-­‐means++  •  9  total  runs:  1  for  k=1,  2  each  for  k=10,  25,  35,  and  45  

•  kmeans_stats.csv  •  Clustering  quality  scores  

55  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign  

•  client/bin/ml  kassign  -­‐-­‐input-­‐paths  kdd99  (HDFS)  -­‐-­‐format  avro  -­‐-­‐centers-­‐file  centers.avro  (local  FS)  -­‐-­‐center-­‐ids  4  -­‐-­‐output-­‐path  assigned  (HDFS)  -­‐-­‐output-­‐type  csv  

56  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign  

57  

HDFS

Local FS

kddcup.data_10_percent

header.csv

5. kassign

s.json

kdd99/

wc.avro centers.avro

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign  

58  

HDFS

Local FS

kddcup.data_10_percent

header.csv

5. kassign

s.json

kdd99/

wc.avro centers.avro

assigned/

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  kassign  

•  assigned/part-­‐m-­‐0000[0|1]  •  Rows    

•  Part  0:  442,454  •  Part  1:  51,567  •  Total:  494,021  

•  Columns  •  Point  ID  (normal/aCack  type,  in  this  case)  •  Index  in  centers.avro  •  Assigned  cluster  ID  •  Squared  distance  to  nearest  cluster  

59  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample  

•  client/bin/ml  sample  -­‐-­‐input-­‐paths  assigned  (HDFS)  -­‐-­‐format  text  -­‐-­‐header-­‐file  examples/kdd99/kassign_header.csv  (local  FS)  -­‐-­‐weight-­‐field  squared_distance  -­‐-­‐group-­‐fields  clustering_id,closest_center_id  -­‐-­‐output-­‐type  csv  -­‐-­‐size  20  -­‐-­‐output-­‐path  extremal  (HDFS)  

60  

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample  

61  

HDFS

Local FS

kddcup.data_10_percent

header.csv

6. sample

s.json

kdd99/

wc.avro centers.avro

assigned/

kassign_header.csv

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample  

62  

HDFS

Local FS

kddcup.data_10_percent

header.csv

6. sample

s.json

kdd99/

wc.avro centers.avro

assigned/

kassign_header.csv

extremal/

Stanford  CS  246H  Winter  ‘14  

Cloudera  ML:  sample  

•  extremal/part-­‐r-­‐00000  •  Rows    

•  Up  to  20  examples  from  each  cluster  •  Examples  that  are  furthest  from  the  center  of  the  cluster  

•  Columns  •  Point  ID  (normal/aCack  type,  in  this  case)  •  Index  in  centers.avro  •  Assigned  cluster  ID  •  Squared  distance  to  nearest  cluster  

63  

Stanford  CS  246H  Winter  ‘14  

Oryx  

Stanford  CS  246H  Winter  ‘14  

2014:  Lab  to  Factory  

65  

Stanford  CS  246H  Winter  ‘14  

Data  Science  Will  Be  Opera-onal  Analy-cs  

66  

Stanford  CS  246H  Winter  ‘14  

I  Built  A  Model.  Now  What?  

67  

Build  Model   Query  Model  Collect  Input  

Repeat  

Stanford  CS  246H  Winter  ‘14  

I  Built  A  Model  On  Hadoop.  Now  What?  

68  

Build  Model   Query  Model  Collect  Input  

Repeat  

?  ?  ?  

Stanford  CS  246H  Winter  ‘14  69  

Example:  Oryx  

Stanford  CS  246H  Winter  ‘14  70  

www.mwCl.com/wp-­‐content/uploads/2013/11/IMG_5446_edited-­‐2_mwCl.jpg  

Stanford  CS  246H  Winter  ‘14  

Gaps  to  fill,  and  Goals  

71  

• Model  Building  •  Large-­‐scale  •  Con-nuous  •  Apache  Hadoop™-­‐based  •  Few,  good  algorithms  

• Model  Serving  •  Real-­‐-me  query  •  Real-­‐-me  update  

•  Algorithms  •  Parallelizable  •  Updateable  •  Works  on  diverse  input  

•  Interoperable  •  PMML  model  format  •  Simple  REST  API  •  Open  source  

Stanford  CS  246H  Winter  ‘14  

Large-­‐Scale  or  Real-­‐Time?  

72  

Large-­‐Scale  Offline  Batch  

Real-­‐Time  Online  Streaming  

vs  

Why  Don’t  We  Have  Both?  

λ!  

Stanford  CS  246H  Winter  ‘14  

Lambda  Architecture  

73  

•  Batch,  Stream    Processing  are  different  

•  Tackle  separately  in    2+  Layers  

•  Batch  Layer:  offline,  asynchronous  

•  Serving  /  Speed  Layer:  real-­‐Kme,  incremental,  approximate  

jameskinley.tumblr.com/post/37398560534/the-­‐lambda-­‐architecture-­‐principles-­‐for-­‐architecKng  

…  λ?  

Stanford  CS  246H  Winter  ‘14  74  

Batch  

Serving/Speed  

Stanford  CS  246H  Winter  ‘14  

Two  Layers  

75  

•  ComputaKon  Layer  •  Java-­‐based  server  process  •  Client  of  Hadoop  2.x  •  Periodically  builds  “generaKon”  from  recent  data  and  past  model  

•  Baby-­‐sits  MapReduce*  jobs  (or,  locally  in-­‐core)  

•  Publishes  models  

•  Serving  Layer  •  Apache  Tomcat™-­‐based  server  process  

•  Consumes  models  from  HDFS  (or  local  FS)  

•  Serves  queries  from  model  in  memory  

•  Updates  from  new  input  •  Also  writes  input  to  HDFS  •  Replicas  for  scale  

*  Apache  Spark  later  

Stanford  CS  246H  Winter  ‘14  

CollaboraKve  Filtering  :  ALS  

76  

•  AlternaKng  Least  Squares  •  Latent-­‐factor  model  •  Accepts  implicit  or    explicit  feedback  

•  Real-­‐Kme  update    via  fold-­‐in  of  input  

•  No  cold-­‐start  •  Parallelizable  

YT  

X  

Stanford  CS  246H  Winter  ‘14  

Clustering  :  k-­‐means++  

77  

• Well-­‐known  and  understood  

•  Parallelizable  •  Clusters  updateable  

cwiki.apache.org/confluence/display/MAHOUT/K-­‐Means+Clustering  

Stanford  CS  246H  Winter  ‘14  

ClassificaKon  /  Regression  :  RDF  

78  

•  Random  Decision  Forests  •  Ensemble  method  •  Numeric,  categorical    features  and  target    

•  Very  parallel  •  Nodes  updateable  • Works  well  on  many  problems  

age$>$30

female? Yes

income$>$20000 Yes

Yes No

Stanford  CS  246H  Winter  ‘14  

PMML  

79  

•  PredicKve  Modeling  Markup  Language  

•  XML-­‐based  format  for  predicKve  models  

•  Standardized  by  Data  Mining  Group  (www.dmg.org)  

• Wide  tool  support  

<PMML xmlns="http://www.dmg.org/PMML-4_1"! version="4.1">! <Header copyright="www.dmg.org"/>! <DataDictionary numberOfFields="5">! <DataField name="temperature"! optype="continuous"! dataType="double"/>! …! </DataDictionary>! <TreeModel modelName="golfing"! functionName="classification">! <MiningSchema>! <MiningField name="temperature"/>! … ! </MiningSchema>! <Node score="will play">! <Node score="will play">! <SimplePredicate field="outlook"! operator="equal" ! value="sunny"/>! …! </Node>! </Node>! </TreeModel>!</PMML>!

www.dmg.org/v4-­‐1/TreeModel.html  

Stanford  CS  246H  Winter  ‘14  

HTTP  REST  API  

80  

•  ConvenKon  for  RPC-­‐like  request  /  response  

•  HTTP  verbs,  transport  •  GET  :  query  •  POST  :  add  input  •  Easy  from  browser,  CLI,  Java,  Python,  Scala,  etc.  

GET /recommend/jwills!

HTTP/1.1 200 OK!Content-Type: text/plain!!"Ray LaMontagne",0.951 "Fleet Foxes",0.7905!"The National",0.688!"Shearwater",0.3017!

 

Stanford  CS  246H  Winter  ‘14  

Wish  List  

81  

•  Revamp  workflow  •  Spark  /  Crunch-­‐like  API,  not  raw  M/R  

•  De-­‐emphasize  model  building  •  Well-­‐solved  •  Bring  your  own  

• More  component-­‐ized    •  Less  black-­‐box  service  •  Emphasize  integraKon  

•  PMML,  etc.  

•  “Pull”  opKons  •  Ka�a?  •  Hive  /  Impala  ?  

Stanford  CS  246H  Winter  ‘14  

Open  Source  

82  

github.com/cloudera/oryx!

100%  Apache  License  2.0  

Stanford  CS  246H  Winter  ‘14  

Recommended