Lecture 7 - CS 246h

Stanford CS 246H Winter ‘14

Stanford CS 246H: Mining Massive Data Sets Hadoop Lab

Machine Learning & Hadoop

Peanut BuCer and Chocolate?

•  The Promise of Big Data™ •  Sounds great, but how?

•  Hadoop talent pool is small •  ML talent pool is Kny

•  Tools and toolkits starKng to appear •  Mahout, Oryx, Alpine, Ayasdi, Skytree, etc.

•  Summary: Hadoop is hard, and ML is hard 1.  Lots of people/companies are trying to make it easy 2.  Don’t believe anyone who tells you they make it easy

Hadoop & ML: A Brief History

•  2005 – Taste project started on SourceForge •  2007 – Mahout project started at Apache •  2008 – Taste donated to Mahout •  … Kme passes … •  2012 – Myrrix is launched •  2013 – Cloudera ML project started on Github •  Late 2013 – Oryx project started on Github

Hadoop ML Family Tree

Mahout

Myrrix Cloudera ML

Lucene

Andrew Ng

Apache Mahout

What is Mahout?

•  “Scalable machine learning” •  not just Hadoop-‐oriented machine learning •  not en%rely, that is. Just mostly.

•  Components •  math library •  clustering •  classificaKon •  decomposiKons •  recommendaKons

©MapR Technologies 2013

Mahout Math

•  Goals are •  basic linear algebra, •  and staKsKcal sampling, •  and good clustering, •  decent speed, •  extensibility, •  especially for sparse data

•  But not •  totally badass speed •  comprehensive set of algorithms •  opKmizaKon, root finders, quadrature

Caveat Emptor

• Mahout is a toolkit •  There is a command line interface

•  You can’t always use it

•  Very oken end up wriKng code •  DocumentaKon is… ahem… scant

•  Best reference is Mahout in AcKon

•  Varying levels of maturity •  Varying levels of Hadoop support

Matrices and Vectors

•  At the core: •  DenseVector, RandomAccessSparseVector •  DenseMatrix, SparseRowMatrix

•  Highly composable API

•  Important ideas: •  view*, assign and aggregate •  iteraKon

m.viewDiagonal().assign(v)!

Assign? View?

•  Why assign? •  Copying is the major cost for naïve matrix packages •  In-‐place operaKons criKcal to reasonable performance •  Many kinds of updates required, so funcKonal style very helpful

•  Why view? •  In-‐place operaKons oken required for blocks, rows, columns or diagonals

•  With views, we need #assign + #views methods •  Without views, we need #assign x #views methods

•  Synergies •  With both views and assign, many loops become single line

Assign

• Matrices

•  Vectors

Matrix assign(double value);!Matrix assign(double[][] values);!Matrix assign(Matrix other);!Matrix assign(DoubleFunction f);!Matrix assign(Matrix other, DoubleDoubleFunction f);!

Vector assign(double value);!Vector assign(double[] values);!Vector assign(Vector other);!Vector assign(DoubleFunction f);!Vector assign(Vector other, DoubleDoubleFunction f);!Vector assign(DoubleDoubleFunction f, double y);!

• Matrices

•  Vectors

Matrix viewPart(int[] offset, int[] size);!Matrix viewPart(int row, int rlen, int col, int clen);!Vector viewRow(int row);!Vector viewColumn(int column);!Vector viewDiagonal();!

Vector viewPart(int offset, int length);!

Aggregates

• Matrices

•  Vectors double zSum();!double aggregate(! DoubleDoubleFunction reduce, DoubleFunction map);!double aggregate(Vector other, ! DoubleDoubleFunction aggregator, ! DoubleDoubleFunction combiner);!

double zSum();!Vector aggregateRows(VectorFunction f);!Vector aggregateColumns(VectorFunction f);!double aggregate(DoubleDoubleFunction combiner, ! DoubleFunction mapper);!

Predefined FuncKons

• Many handy funcKons ABS LOG2 !ACOS NEGATE !ASIN RINT !ATAN SIGN !CEIL SIN !COS SQRT !EXP SQUARE !FLOOR SIGMOID !IDENTITY SIGMOIDGRADIENT !INV TAN !LOGARITHM!

Examples

double alpha; a.assign(alpha);

a.assign(b, FuncKons.chain( FuncKons.plus(beta), FuncKons.mult(alpha));

A =αB+β

Sparse OpKmizaKons

•  DoubleDoubleFuncKon abstract properKes

•  And Vector properKes

public boolean isLikeRightPlus();!public boolean isLikeLeftMult();!public boolean isLikeRightMult();!public boolean isLikeMult();!public boolean isCommutative();!public boolean isAssociative();!public boolean isAssociativeAndCommutative();!public boolean isDensifying();!

public boolean isDense();!public boolean isSequentialAccess();!public double getLookupCost();!public double getIteratorAdvanceCost();!public boolean isAddConstantTime();!

Examples

•  The trace of a matrix

•  Set diagonal to zero

•  Set diagonal to negaKve of row sums excluding the diagonal

m.viewDiagonal().zSum()!

m.viewDiagonal().assign(0)!

Vector diag = m.viewDiagonal().assign(0);!diag.assign(m.rowSums().assign(Functions.MINUS));!

IteraKon

• Matrices are Iterable in Mahout

•  Vectors are densely or sparsely iterable

// compute both row and columns sums in one pass!for (MatrixSlice row: m) {! rSums.set(row.index(), row.zSum());! cSums.assign(row, Functions.PLUS);!}!

double entropy = 0;!for (Vector.Element e: v.iterateNonZero()) {! entropy += e.get() * Math.log(e.get());!}!

Random Sampling

•  Samples from some type

•  Lots of kinds ChineseRestaurant Missing Normal !Empirical Multinomial PoissonSampler !IndianBuffet MultiNormal Sampler !

public interface Sampler<T> {! T sample();!}!!public abstract class AbstractSamplerFunction ! extends DoubleFunction ! implements Sampler<Double>!

Mahout Math Summary

•  Matrices, Vectors •  views •  in-‐place assignment •  aggregaKons •  iteraKons

•  FuncKons •  lots built-‐in •  cooperate with sparse vector opKmizaKons

•  Sampling •  abstract samplers •  samplers as funcKons

•  Other stuff … clustering, SVD

Other Stuff

• Matrix DecomposiKon •  ClassificaKon •  Clustering •  RecommendaKons

Focus: Machine Learning

Math Vectors/Matrices/SVD

Recommenders Clustering ClassificaKon Freq. PaCern Mining

GeneKc

UKliKes Lucene/Vectorizer

CollecKons (primiKves)

Apache Hadoop

ApplicaKons

Examples

See hCp://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

©Lucid ImaginaKon 2010

Prepare Data from Raw content

•  Data Sources: •  Lucene integraKon

•  bin/mahout lucenevector …

•  Document Vectorizer •  bin/mahout seqdirectory … •  bin/mahout seq2sparse …

•  ProgrammaKcally •  See the UKls module in Mahout

•  Database •  File system

RecommendaKons

•  Extensive framework for collaboraKve filtering •  Recommenders

•  User based, Item based, ALS, SlopeOne, SVD, others

•  Online and Offline support •  Offline can uKlize Hadoop

• Many different Similarity measures •  Cosine, LLR, Tanimoto, Pearson, others

Clustering

•  Document level •  Group documents based on a noKon of similarity

•  K-‐Means, Fuzzy K-‐Means, Dirichlet, Canopy, Mean-‐Shik

•  Distance Measures •  ManhaCan, Euclidean, other

•  Topic Modeling •  Cluster words across documents to idenKfy topics

•  Latent Dirichlet AllocaKon

CategorizaKon

•  Place new items into predefined categories: •  Sports, poliKcs, entertainment

•  Mahout has several implementaKons •  Naïve Bayes •  Complementary Naïve Bayes •  Decision Forests •  LogisKc Regression (SGD)

Freq. PaCern Mining

•  IdenKfy frequently co-‐occurrent items

•  Useful for: •  Query RecommendaKons

•  Apple -‐> iPhone, orange, OS X

•  Related product placement •  “Beer and Diapers”

•  Spam DetecKon •  Yahoo: hCp://www.slideshare.net/hadoopusergroup/mail-‐anKspam

hCp://www.amazon.com

EvoluKonary

• Map-‐Reduce ready fitness funcKons for geneKc programming

•  IntegraKon with Watchmaker •  hCp://watchmaker.uncommons.org/index.php

•  Problems solved: •  Traveling salesman •  Class discovery •  Many others

Singular Value DecomposiKon

•  Reduces a big matrix into a much smaller matrix by amplifying the important parts while removing/reducing the less important parts

•  Mahout has fully distributed Lanczos implementaKon <MAHOUT_HOME>/bin/mahout svd -‐Dmapred.input.dir=path/to/corpus -‐-‐tempDir path/for/svd-‐output -‐-‐rank 300 -‐-‐numColumns <numcols> -‐-‐numRows <num rows in the input> <MAHOUT_HOME>/bin/mahout cleansvd -‐-‐eigenInput path/for/svd-‐output -‐-‐corpusInput path/to/corpus -‐-‐output path/for/cleanOutput -‐-‐maxError 0.1 -‐-‐minEigenvalue 10.0

•  hCps://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+ReducKon

How to: Command Line

• Most algorithms have a Driver program •  Shell script in $MAHOUT_HOME/bin helps with most tasks

•  Prepare the Data •  Different algorithms require different setup

•  Run the algorithm •  Single Node •  Hadoop

•  Print out the results •  Several helper classes:

•  LDAPrintTopics, ClusterDumper, etc.

Ugly Demo II -‐ Prep

•  Data Set: Reuters •  hCp://www.daviddlewis.com/resources/testcollecKons/reuters21578/

•  Convert to Text via hCp://www.lucenebootcamp.com/lucene-‐boot-‐camp-‐preclass-‐training/

•  Convert to Sequence File: bin/mahout seqdirectory –input <PATH> -‐-‐output <PATH> -‐-‐charset UTF-‐8

•  Convert to Sparse Vector: bin/mahout seq2sparse -‐-‐input <PATH>/content/reuters/seqfiles/ -‐-‐norm 2 -‐-‐weight TF -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/ -‐-‐minDF 5 -‐-‐maxDFPercent 90

Ugly Demo II: Topic Modeling

•  Latent Dirichlet AllocaKon ./mahout lda -‐-‐input <PATH>/content/reuters/seqfiles-‐TF/vectors/ -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/lda-‐output -‐-‐numWords 34000 –numTopics 10 ./mahout org.apache.mahout.clustering.lda.LDAPrintTopics -‐-‐input <PATH>/content/reuters/seqfiles-‐TF/lda-‐output/state-‐19 -‐-‐dict <PATH>/content/reuters/seqfiles-‐TF/dictionary.file-‐0 -‐-‐words 10 -‐-‐output <PATH>/content/reuters/seqfiles-‐TF/lda-‐output/topics -‐-‐dictionaryType sequencefile

•  Good feature reducKon (stopword removal) required

Ugly Demo III: Clustering

•  K-‐Means •  Same Prep as UD II, except use TFIDF weight ./mahout kmeans -‐-‐input <PATH>/content/reuters/seqfiles-‐TFIDF/vectors/part-‐00000 -‐-‐k 15 -‐-‐output <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans -‐-‐clusters <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/clusters

•  Print out the clusters: ./mahout clusterdump -‐-‐seqFileDir <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/clusters-‐15/ -‐-‐pointsDir <PATH>/content/reuters/seqfiles-‐TFIDF/output-‐kmeans/points/ -‐-‐dictionary <PATH>/content/reuters/seqfiles-‐TFIDF/dictionary.file-‐0 -‐-‐dictionaryType sequencefile -‐-‐substring 20

Ugly Demo IV: Frequent PaCern Mining

•  Data: hCp://fimi.cs.helsinki.fi/data/ •  ./mahout fpg -‐i <PATH>/content/freqitemset/accidents.dat -‐o patterns -‐k 50 -‐method mapreduce -‐g 10 -‐regex [\ ]

•  ./mahout seqdump -‐-‐seqFile patterns/fpgrowth/part-‐r-‐00000

Cloudera ML

•  CollecKon of Java libraries and command-‐line tools •  Goal: make data scienKsts more producKve with CDH

•  Exploratory data analysis •  Data preparaKon •  Model fi}ng •  Model evaluaKon

•  Apache 2.0 licensed •  Developed on GitHub

•  hCp://github.com/cloudera/ml

Cloudera ML: Building Blocks

•  Apache Hadoop •  Scalable data storage (HDFS) and processing (MapReduce)

•  Apache Hive •  Metadata for structured data in HDFS

•  Apache Crunch •  Easy MapReduce pipelines

•  Apache Mahout •  Vector interface

•  Apache Avro •  SerializaKon format

Stanford CS 246H Winter ‘14 39

Cloudera ML Workflow: Clustering

Cloudera ML: summary

•  client/bin/ml summary -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/header.csv (local FS) -‐-‐summary-‐file examples/kdd99/s.json (local FS)

Local FS

kddcup.data_10_percent

header.csv

1. summary

Local FS

header.csv

1. summary

s.json

•  s.json •  Categorical features: histogram •  Numerical features: distribuKon summary

Cloudera ML: normalize

•  client/bin/ml normalize -‐-‐input-‐paths kddcup.data_10_percent (HDFS) -‐-‐format text -‐-‐summary-‐file examples/kdd99/s.json (local FS) -‐-‐transform Z -‐-‐output-‐path kdd99 (HDFS) -‐-‐output-‐type avro -‐-‐id-‐column category -‐-‐compress

Local FS

header.csv

2. normalize

s.json

Local FS

header.csv

2. normalize

s.json

kdd99/

•  kdd99/part-‐m-‐0000[0|1].avro •  Examples (rows)

•  Part 0: 442,454 vectors •  Part 1: 51,567 vectors •  Total: 494,021 vectors

•  Features (columns) •  Before: 41 fields •  Aker: 143 fields

Cloudera ML: ksketch

•  client/bin/ml ksketch -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐points-‐per-‐iteraKon 500 -‐-‐output-‐file wc.avro (local FS) -‐-‐seed 1729 -‐-‐iteraKons 5 -‐-‐cross-‐folds 2

Local FS

header.csv

3. ksketch

s.json

kdd99/

Local FS

header.csv

3. ksketch

s.json

kdd99/

wc.avro

•  wc.avro •  Examples (rows)

•  2 “folds” of 2501 examples •  1 iniKal example •  500 examples from each iteraKon (5 iteraKons) •  Each example has an associated weight

•  Features (columns) •  143 features (sKll)

Cloudera ML: kmeans

•  client/bin/ml kmeans -‐-‐input-‐file wc.avro (local FS) -‐-‐centers-‐file centers.avro (local FS) -‐-‐seed 19 -‐-‐clusters 1,10,25,35,45 -‐-‐best-‐of 2 -‐-‐num-‐threads 4 -‐-‐eval-‐stats-‐file kmeans_stats.csv (local FS)

Cloudera ML: kmeans

Local FS

header.csv

4. kmeans

s.json

kdd99/

wc.avro

Local FS

header.csv

4. kmeans

s.json

kdd99/

wc.avro

kmeans_stats.csv

centers.avro

Cloudera ML: kmeans

•  centers.avro •  1 row for each run of k-‐means++ •  9 total runs: 1 for k=1, 2 each for k=10, 25, 35, and 45

•  kmeans_stats.csv •  Clustering quality scores

Cloudera ML: kassign

•  client/bin/ml kassign -‐-‐input-‐paths kdd99 (HDFS) -‐-‐format avro -‐-‐centers-‐file centers.avro (local FS) -‐-‐center-‐ids 4 -‐-‐output-‐path assigned (HDFS) -‐-‐output-‐type csv

Local FS

header.csv

5. kassign

s.json

kdd99/

wc.avro centers.avro

Local FS

header.csv

5. kassign

s.json

kdd99/

assigned/

•  assigned/part-‐m-‐0000[0|1] •  Rows

•  Part 0: 442,454 •  Part 1: 51,567 •  Total: 494,021

•  Columns •  Point ID (normal/aCack type, in this case) •  Index in centers.avro •  Assigned cluster ID •  Squared distance to nearest cluster

Cloudera ML: sample

•  client/bin/ml sample -‐-‐input-‐paths assigned (HDFS) -‐-‐format text -‐-‐header-‐file examples/kdd99/kassign_header.csv (local FS) -‐-‐weight-‐field squared_distance -‐-‐group-‐fields clustering_id,closest_center_id -‐-‐output-‐type csv -‐-‐size 20 -‐-‐output-‐path extremal (HDFS)

Cloudera ML: sample

Local FS

header.csv

6. sample

s.json

kdd99/

assigned/

kassign_header.csv

Cloudera ML: sample

Local FS

header.csv

6. sample

s.json

kdd99/

assigned/

kassign_header.csv

extremal/

Cloudera ML: sample

•  extremal/part-‐r-‐00000 •  Rows

•  Up to 20 examples from each cluster •  Examples that are furthest from the center of the cluster

•  Columns •  Point ID (normal/aCack type, in this case) •  Index in centers.avro •  Assigned cluster ID •  Squared distance to nearest cluster

2014: Lab to Factory

Data Science Will Be Opera-onal Analy-cs

I Built A Model. Now What?

Build Model Query Model Collect Input

Repeat

I Built A Model On Hadoop. Now What?

Build Model Query Model Collect Input

Repeat

Example: Oryx

www.mwCl.com/wp-‐content/uploads/2013/11/IMG_5446_edited-‐2_mwCl.jpg

Gaps to fill, and Goals

• Model Building •  Large-‐scale •  Con-nuous •  Apache Hadoop™-‐based •  Few, good algorithms

• Model Serving •  Real-‐-me query •  Real-‐-me update

•  Algorithms •  Parallelizable •  Updateable •  Works on diverse input

•  Interoperable •  PMML model format •  Simple REST API •  Open source

Large-‐Scale or Real-‐Time?

Large-‐Scale Offline Batch

Real-‐Time Online Streaming

Why Don’t We Have Both?

Lambda Architecture

•  Batch, Stream Processing are different

•  Tackle separately in 2+ Layers

•  Batch Layer: offline, asynchronous

•  Serving / Speed Layer: real-‐Kme, incremental, approximate

jameskinley.tumblr.com/post/37398560534/the-‐lambda-‐architecture-‐principles-‐for-‐architecKng

… λ?

Serving/Speed

Two Layers

•  ComputaKon Layer •  Java-‐based server process •  Client of Hadoop 2.x •  Periodically builds “generaKon” from recent data and past model

•  Baby-‐sits MapReduce* jobs (or, locally in-‐core)

•  Publishes models

•  Serving Layer •  Apache Tomcat™-‐based server process

•  Consumes models from HDFS (or local FS)

•  Serves queries from model in memory

•  Updates from new input •  Also writes input to HDFS •  Replicas for scale

* Apache Spark later

CollaboraKve Filtering : ALS

•  AlternaKng Least Squares •  Latent-‐factor model •  Accepts implicit or explicit feedback

•  Real-‐Kme update via fold-‐in of input

•  No cold-‐start •  Parallelizable

Clustering : k-‐means++

• Well-‐known and understood

•  Parallelizable •  Clusters updateable

cwiki.apache.org/confluence/display/MAHOUT/K-‐Means+Clustering

ClassificaKon / Regression : RDF

•  Random Decision Forests •  Ensemble method •  Numeric, categorical features and target

•  Very parallel •  Nodes updateable • Works well on many problems

age$>$30

female? Yes

income$>$20000 Yes

Yes No

•  PredicKve Modeling Markup Language

•  XML-‐based format for predicKve models

•  Standardized by Data Mining Group (www.dmg.org)

• Wide tool support

<PMML xmlns="http://www.dmg.org/PMML-4_1"! version="4.1">! <Header copyright="www.dmg.org"/>! <DataDictionary numberOfFields="5">! <DataField name="temperature"! optype="continuous"! dataType="double"/>! …! </DataDictionary>! <TreeModel modelName="golfing"! functionName="classification">! <MiningSchema>! <MiningField name="temperature"/>! … ! </MiningSchema>! <Node score="will play">! <Node score="will play">! <SimplePredicate field="outlook"! operator="equal" ! value="sunny"/>! …! </Node>! </Node>! </TreeModel>!</PMML>!

www.dmg.org/v4-‐1/TreeModel.html

HTTP REST API

•  ConvenKon for RPC-‐like request / response

•  HTTP verbs, transport •  GET : query •  POST : add input •  Easy from browser, CLI, Java, Python, Scala, etc.

GET /recommend/jwills!

HTTP/1.1 200 OK!Content-Type: text/plain!!"Ray LaMontagne",0.951 "Fleet Foxes",0.7905!"The National",0.688!"Shearwater",0.3017!

Wish List

•  Revamp workflow •  Spark / Crunch-‐like API, not raw M/R

•  De-‐emphasize model building •  Well-‐solved •  Bring your own

• More component-‐ized •  Less black-‐box service •  Emphasize integraKon

•  PMML, etc.

•  “Pull” opKons •  Ka�a? •  Hive / Impala ?

Open Source

github.com/cloudera/oryx!

100% Apache License 2.0

Lecture 7 - CS 246h

Documents

CS 367 Introduction to Data Structures Lecture 7

CS 416 Artificial Intelligence Lecture 9 Logical Agents Chapter 7 Lecture 9 Logical Agents Chapter 7

CS-3432 Electronic Commerce Lecture – 7 Sikandar Shujah Toor

CS 430 Lecture 7 - University of Evansvilleuenics.evansville.edu/~hwang/s12-courses/cs430/lecture07... · 2012-01-31 · Tuesday, January 31 CS 430 Artificial Intelligence - Lecture

MA/CS 375 Fall 2002 Lecture Summary Week 1 Week 7

CS 461: Machine Learning Lecture 7

CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

CS 763 F20 Lecture 7: Linear Programming A. Lubiw, U. Waterloo

CS 5150 1 CS 5150 Software Engineering Lecture 7 Managing Large Projects Guest Lecturer: Stephen Purpura

CS 475 : Lecture 7 Asymmetric Cryptogreenie/cs475/CS475-15-07.pdf · CS 475 : Lecture 7 Asymmetric Crypto Rachel Greenstadt ... • Asymmetric information ... 2 e2…p k ek • If

INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 7 2/23/2015

CS 416 Artificial Intelligence Lecture 7 Optimization Optimization

CS 268: Lecture 7 (Beyond TCP Congestion Control)

Profs. Necula CS 164 Lecture 6-7 1 Top-Down Parsing ICOM 4036 Lecture 5

CS 5150 Software Engineering Lecture 7 Requirements 1

1 CS 430 / INFO 430 Information Retrieval Lecture 7 String Processing

Prof. Aiken CS 169 Lecture 71 Version Control CS169 Lecture 7

Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 7

CS 361 Data Structures & Algs Lecture 7 › ~hayes › teaching › cs361-fall13 › handout… · CS 361 Data Structures & Algs Lecture 7 Prof. Tom Hayes University of New Mexico

1 CS 501 Spring 2006 CS 501: Software Engineering Lecture 7 Requirements I