46
Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Embed Size (px)

Citation preview

Page 1: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Scalable Machine Learning

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

Page 2: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

But What is Machine Learning

• “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”

• Given a data set X, can we effectively predict Y by optimizing Z?

Intro. to Machine Learning by E. Alpaydin

Page 3: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Supervised vs. Unsupervised

• Algorithms trained on labeled examples– I know these images are of cats and these are of

dogs, tell me if this image is a cat or a dog• Algorithms trained on unlabeled examples– Group these images together by similarity, i.e.

some kind of distance function

Page 4: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Use Cases

• Collaborative Filtering– Takes users' behavior, and from that try to find items users might

like• Clustering

– Take things and put them into groups of related things• Classification

– Learn from existing categories to determine what things in a category look like, and assign unlabeled things the (hopefully) correct category

• Frequent Itemset Mining– Analyzes items in a groups and identifies which items frequently

appear together

Page 5: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Clustering

• Dirichlet Processing Clustering– Bayesian mixture modeling

• K-Means Clustering– Partition n observations into k clusters

• Fuzzy K-Means– Soft clusters where a point can be in more than one

• Hierarchical Clustering• Hierarchy of clusters from bottom-up or top-down

• Canopy Clustering– Preprocess data before K-Means or Hierarchical

Page 6: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

More Clustering• Latent Dirichlet Allocation

– Cluster words into topics and documents into mixtures of topics

• Mean Shift Clustering– Finding modes or clusters in 2-

dimensional space, where number of clusters is unknown

• Minhash Clustering– Quickly estimate similarity

between two data sets

• Spectral Clustering– Cluster points using eigenvectors

of matrices derived from data

Page 7: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Collaborative Filtering

• Distributed Item-based Collaborate Filtering– Estimates a user’s preference for one item by looking at

preference for similar items

• Collaborate Filtering using a Parallel Matrix Factorization– Among a matrix of items that a user has not yet seen, predict

which items the user might prefer

Page 8: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Classification

• Bayesian– Classify objects into

binary categories

• Random Forests– Method for classification

and regression by constructing a multitude of decision trees

Dog

Cat

Page 9: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Frequent Itemset Mining

• Parallel FP Growth Algorithm– Analyzes items in a group and then identifies

which items appear together

Page 10: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Algorithm Examples

• K-Means Clustering– Using Mahout

• Alternating Least Squares (Recommender)– Using Spark Mllib

Page 11: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

APACHE MAHOUT

ma·hout -\mə-ˈha t\ - noun - A keeper and driver of an elephantu̇�

Page 12: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Overview• Build a scalable machine learning library, in both data volume and

processing• Began in 2008 as a subproject of Apache Lucene, then became a top-

level Apache project in 2010• No longer accepting Java MapReduce implementations in favor of

Spark MLlib

• Address issues commonly found in ML libraries:– Lack community, scalability, documentation/examples, Apache licensing– Not well-tested– Not research oriented– Not built on existing production-quality projects– Active Community

Page 13: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Technical Requirements

• Linux• Java 1.6 or greater• Maven• Hadoop– Although, not all algorithms are implemented to

work on Hadoop clusters

Page 14: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Building Mahout for Hadoop 2

• Check out Mahout trunk with gitgit clone https://github.com/apache/mahout.git

• Build with Maven, giving it the proper Hadoop and HBase versions

cd gitmvn install -DskipTests \

-Dhadoop2 -Dhadoop2.version=2.6.0 \-Dhbase.version=1.0.0

cd ../mv mahout /usr/share/491s15# Edit .bashrc/.bash_profile to add a $MAHOUT_HOME variable, # $MAHOUT_HOME/bin to the path, and # export HADOOP_CONF_DIR=/usr/share/491s15/hadoop/etc/hadoop

Page 15: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

c1

c2

c3

K-Means Clustering

Page 16: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

c1

c2

c3

K-Means Clustering

Page 17: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

c1

c2

c3

K-Means Clustering

Page 18: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

c1

c2

c3

c1

c2

c3

K-Means Clustering

Page 19: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

c1

c2

c3

K-Means Clustering

Page 20: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

K-Means ClusteringExample

• Let’s cluster the Reuter’s data set together– A bunch (21,578 to be exact) of hand-classified

news articles from the greatest year created, 1987

• Steps!1. Generate Sequence Files from data2. Generate Vectors from Sequence Files3. Run k-means

Page 21: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

K-Means ClusteringConvert dataset into a Sequence File

• Download and extract the SGML files$ wget http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz$ mkdir reuters-sgm$ tar -xf reuters21578.tar.gz -C reuters-sgm/

• Extract content from SGML to text file$ mahout org.apache.lucene.benchmark.utils.ExtractReuters \

reuters-sgm/ reuters-out/$ hdfs dfs -put reuters-out . # Takes a while...

• Use seqdirectory tool to convert text file into a Hadoop Sequence File

$ mahout seqdirectory -i reuters-out \-o reuters-out-seqdir -c UTF-8 -chunk 5

Page 22: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Tangent: Writing to Sequence Files// Say you have some documents array

Configuration conf = new Configuration();FileSystem fs = FileSystem.get(conf); Path path = new Path("testdata/part-00000"); SequenceFile.Writer writer = new SequenceFile.Writer(fs,

conf, path, Text.class, Text.class); for (int i = 0; i < MAX_DOCS; ++i) {

writer.append(new Text(documents[i].getId()),new Text(documents[i].getContent()));

}writer.close();

Page 23: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Original File

$ cat reut2-000.sgm-30.txt26-FEB-1987 15:43:14.36

U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991

Page 24: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Now, in Sequence File

/reut2-000.sgm-30.txt 26-FEB-1987 15:43:14.36

U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991

Key Value*

* Contains new line characters

Page 25: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

• Steps1. Compute Dictionary2. Assign integers for words3. Compute feature weights4. Create vector for each document using word-integer

mapping and feature-weight

• Or simply run $ mahout seq2sparse

$ mahout seq2sparse \ -i reuters-out-seqdir/ \ -o reuters-out-seqdir-sparse-kmeans

K-Means ClusteringGenerate Vectors from Sequence Files

Page 26: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Document to Integers to Vector

26-FEB-1987 15:43:14.36

U.S. TAX WRITERS SEEK ESTATE TAX CURBS, RAISING 6.7 BILLION DLRS THRU 1991

14.36 273715

29621991 396026

540543

83616.7

10882billion 15528curbs 19078dlrs 20362estate 21578feb

22224raising

33629seek 35909tax

38507u.s

39687writers 41511

{3960:1.0,21578:1.0,33629:1.0,41511:1.0,8361:1.0,10882:1.0,5405:1.0,22224:1.0,15528:1.0,38507:2.0,39687:1.0,2737:1.0,35909:1.0,2962:1.0,19078:1.0,20362:1.0}

One document of many!

Page 27: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

After seq2sparse

/reut2-000.sgm-30.txt {3960:1.0,21578:1.0, 33629:1.0,41511:1.0,8361:1.0,10882:1.0,5405:1.0,22224:1.0,15528:1.0,38507:2.0,39687:1.0,2737:1.0 ,35909:1.0,2962:1.0,19078:1.0,20362:1.0}

Key Value

Page 28: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

K-Means ClusteringRun the kmeans program

$ mahout kmeans \-i reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \-c reuters-kmeans-clusters \-o reuters-kmeans \-dm

org.apache.mahout.common.distance.CosineDistanceMeasure \-cd 0.1 -x 10 -k 20

• Key Parameters– dm: Distance measure– cd: Convergence delta– x: Number of iterations– k: Creating assignments

Page 29: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Inspect clusters$ bin/mahout clusterdump \ -i reuters-kmeans/clusters-*-final \ -d reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \ -dt sequencefile -b 100 -n 10

:{"identifier":"VL-316","r":[{"00":0.497},{"00.14":0.408},{"00.18":0.408},{"00.56Top Terms:

president => 3.4944214993103375chief => 3.3234287659025012executive => 3.16472187060367officer => 3.143776322498974chairman => 2.5400053276308587vice => 1.9913627557428164named => 1.9851312619198411said => 1.9030630459350324company => 1.782354193948521names => 1.4052995438811444

Page 30: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

FAQs• How to get rid of useless words?

– Increase minSupport and or decrease dfPercent– Use StopwordsAnalyzer

• How to see documents to cluster assignments? – Run clustering process at the end of centroid generation using –cl

• How to choose appropriate weighting?– If its long text, go with tf-idf. Use normalization if documents different in

length• How to run this on a cluster?

– Set HADOOP_CONF directory to point to your hadoop cluster conf directory• How to scale?

– Use small value of k to partially cluster data and then do full clustering on each cluster.

Page 31: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

FAQs

• How to choose k?– Figure out based on the data you have. Trial and error– Or use Canopy Clustering and distance threshold to figure it

out– Or use Spectral clustering

• How to improve Similarity Measurement?– Not all features are equal– Small weight difference for certain types creates a large

semantic difference– Use WeightedDistanceMeasure– Or write a custom DistanceMeasure

Page 32: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Page 33: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Recommendations

• Help users find items they might like based on historical preferences

Based on example by Sebastian Schelter in “Distributed Itembased Collaborative Filtering with Apache Mahout”

Page 34: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Recommendations

Alice

Bob

Peter

5 1 4

2 5

4 3 2

?

Page 35: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Recommendations

• Algorithm– Neighborhood-based approach– Works by finding similarly rated items in the user-

item-matrix (e.g. cosine, Pearson-Correlation, Tanimoto Coefficient)

– Estimates a user's preference towards an item by looking at his/her preferences towards similar items

Page 36: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Recommendations

• Prediction: Estimate Bob's preference towards “The Matrix”1. Look at all items that

– a) are similar to “The Matrix“ – b) have been rated by Bob

=> “Alien“, “Inception“

2. Estimate the unknown preference with a weighted sum

Page 37: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Recommendations

• MapReduce phase 1– Map – Make user the key

(Alice, Matrix, 5)

(Alice, Alien, 1)

(Alice, Inception, 4)

(Bob, Alien, 2)

(Bob, Inception, 5)(Peter, Matrix, 4)

(Peter, Alien, 3)

(Peter, Inception, 2)

Alice (Matrix, 5)

Alice (Alien, 1)

Alice (Inception, 4)

Bob (Alien, 2)

Bob (Inception, 5)Peter (Matrix, 4)

Peter (Alien, 3)

Peter (Inception, 2)

Page 38: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Recommendations

• MapReduce phase 1– Reduce – Create inverted index

Alice (Matrix, 5)

Alice (Alien, 1)

Alice (Inception, 4)

Bob (Alien, 2)

Bob (Inception, 5)Peter (Matrix, 4)

Peter (Alien, 3)

Peter (Inception, 2)

Alice (Matrix, 5) (Alien, 1) (Inception, 4)

Bob (Alien, 2) (Inception, 5)

Peter (Matrix, 4) (Alien, 3) (Inception, 2)

Page 39: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Recommendations

• MapReduce phase 2– Map – Isolate all co-occurred ratings (all cases

where a user rated both items)

Matrix, Alien (5,1)

Matrix, Alien (4,3)

Alien, Inception (1,4)

Alien, Inception (2,5)

Alien, Inception (3,2)Matrix, Inception (4,2)

Matrix, Inception (5,4)

Alice (Matrix, 5) (Alien, 1) (Inception, 4)

Bob (Alien, 2) (Inception, 5)

Peter(Matrix, 4) (Alien, 3) (Inception, 2)

Page 40: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Recommendations

• MapReduce phase 2– Reduce – Compute similarities

Matrix, Alien (5,1)

Matrix, Alien (4,3)

Alien, Inception (1,4)

Alien, Inception (2,5)

Alien, Inception (3,2)Matrix, Inception (4,2)

Matrix, Inception (5,4)

Matrix, Alien (-0.47)

Matrix, Inception (0.47)

Alien, Inception(-0.63)

Page 41: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Recommendations

• Calculate Weighted sum

(-.47*2 + .47*5) / (.47+.47) = 1.5

Page 42: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Recommendations

Alice

Bob

Peter

5 1 4

2 5

4 3 2

1.5

Page 43: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Implementation in Spark

• Alternating Least Squares (ALS)• Accepts a tuple of (user, product, rating) to

train data• Accepts a tuple of (user, product) to predict

their rating• Example:

https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

Page 44: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Implementations in Mahout

• ItemSimilarityJob– Computes all item similarities– Various configuration options:• Similarity measure to use (cosine, Pearson-Correlation,

etc.)• Maximum number of similar items per item• Maximum number of co-occurences to consider

– Input: CSV file (userId, itemID, value)– Output: Pairs of itemIDs with associated similarity

Page 45: Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Implementations in Mahout

• RecommenderJob– Distributed Itembased Recommender– Various configuration options:• Similarity measure to use• Number of recommendations per user• Filter out some users or items

– Input: CSV file (userId, itemID, value)– Output: UserIds with recommended itemIDs and

their scores