Spark MLlib - Training Material

Spark MLlib

• ExperienceVpon Data EngineerTWM, Keywear, Nielsen

• Bryan’s notes for data analysishttp://bryannotes.blogspot.tw

• Spark.TW

• Linikedinhttps://tw.linkedin.com/pub/bryan-yang/7b/763/a79

ABOUT ME

http://bryannotes.blogspot.tw/

https://tw.linkedin.com/pub/bryan-yang/7b/763/a79

Agenda• Introduction to Machine Learning• Why MLlib• Basic Statistic• K-means• Logistic Regression• ALS• Demo

Introduction to Machine Learning

Definition

• Machine learning is a study of computer algorithms that improve automatically through experience.

Terminology• Observation

- The basic unit of data. • Features

- Features is that can be used to describe each observation in a quantitative manner.

• Feature vector- is an n-dimensional vector of numerical features that represents some object.

• Training/ Testing/ Evaluation set- Set of data to discover potential predictive relationships

Learning(Training)

• Features:1. Color: Red/ Green2. Type: Fruit3. Shape: nearly Circleetc…

Learning(Training)ID Color Type Shape is apple

(Label)

1 Red Fruit Cirle Y

2 Red Fruit Cirle Y

3 Black Logo nearly Cirle N

4 Blue N/A Cirle N

Categories of Machine Learning

• Classification: predict class from observations.

• Clustering: group observations into meaningful groups.

• Regression: predict value from observations.

Cluster the observations with no Labels

https://en.wikipedia.org/wiki/Cluster_analysis

Cut the observations

http://stats.stackexchange.com/questions/159957/how-to-do-one-vs-one-classification-for-logistic-regression

Find a model to describe the observations

https://commons.wikimedia.org/wiki/File:Linear_regression.svg

Machine Learning Pipelines

http://www.nltk.org/book/ch06.html

What is MLlib

What is MLlib• MLlib is an Apache Spark component focusing

on machine learning:

• MLlib is Spark’s core ML library

• Developed by MLbase team in AMPLab

• 80+ contributions from various organization

• Support Scala, Python, and Java APIs

Algorithms in MLlib• Statistics: Description, correlation

• Clustering: k-means

• Collaborative filtering: ALS

• Classification: SVMs, naive Bayes, decision tree.

• Regression: linear regression, logistic regression

• Dimensionality: SVD, PCA

Why Mllib

• Scalability

• Performance

• user-friendly documentation and APIs

• Cost of maintenance

Performance

Data Type

• Dense vector

• Sparse vector

• Labeled point

Dense & Sparse

• Raw Data:

IDA

B C D E F

1 1 0 0 0 0 32 0 1 0 1 0 23 1 1 1 0 1 1

Dense vs Sparse

• Training Set:- number of example: 12 million- number of features: 500- sparsity: 10%

• Not only save storage, but also received a 4x speed up

Dense Sparse

Storge 47GB 7GB

Time 240s 58s

Labeled Point• Dummy variable (1,0)

• Categorical variable (0, 1, 2, …)from pyspark.mllib.linalg import SparseVectorfrom pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

DEMO

Descriptive Statistics• Supported function:

- count- max- min- mean- variance…

• Supported data types- Dense- Sparse- Labeled Point

Examplefrom pyspark.mllib.stat import Statisticsfrom pyspark.mllib.linalg import Vectorsimport numpy as np

## example data（ 2 x 2 matrix at least）data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])

## to RDDdistData = sc.parallelize(data)

## Compute Statistic Valuesummary = Statistics.colStats(distData)print "Duration Statistics:"print " Mean: {}".format(round(summary.mean()[0],3))print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))print " Max value: {}".format(round(summary.max()[0],3))print " Min value: {}".format(round(summary.min()[0],3))print " Total value count: {}".format(summary.count())print " Number of non-zero values: {}".format(summary.numNonzeros()[0])

DEMO

Quiz 1

• Which statement is wrong purpose of descriptive statistic？1. Find the extreme value of the data.2. Understanding the data properties preliminary. 3. Describe the full properties of the data.

Correlation

https://commons.wikimedia.org/wiki/File:Linear_Correlation_Examples

https://commons.wikimedia.org/wiki/File:Linear_Correlation_Examples

Example• Data Set

from pyspark.mllib.stat import Statistics

## 選擇統計方法corrType='pearson'

## 建立測試資料及group1 = sc.parallelize([1,3,4,3,7,5,6,8])group2 = sc.parallelize([1,2,3,4,5,6,7,8])corr = Statistics.corr(group1,group2, corrType)print corr# 0.87

A B C D E F G H

Case1 1 3 4 3 7 5 6 8

Case2 1 2 3 4 5 6 7 8

Shape is important

http://2012books.lardbucket.org/books/beginning-psychology/

http://2012books.lardbucket.org/books/beginning-psychology/

DEMO

Quiz 2

• Which situation below can not use pearson correlation?1. When two variables have no correlation2. When two variables have correlation of indices3. When count of data rows > 304. When variables are normal distribution.

Clustering ModelK-Means

K-means

• K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Clustering with K-Means






Summary

https://en.wikipedia.org/wiki/K-means_clustering

K-Means: Pythonfrom pyspark.mllib.clustering import KMeans, KMeansModelfrom numpy import arrayfrom math import sqrt

# Load and parse the datadata = sc.textFile("data/mllib/kmeans_data.txt")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))

DEMO

• Which data is not suitable for K-means analysis?

Quiz 3

Classification ModelLogistic Regression

linear regression

https://commons.wikimedia.org/wiki/File:Linear_regression.svg

When outcome is only 1/0

http://www.slideshare.net/21_venkat/logistic-regression-17406472

http://www.slideshare.net/21_venkat/logistic-regression-17406472

Hypotheses function• hypotheses:

• sigmoid function:

• Maximum Likelihood estimation

Cost Function

if y = 0if y = 1 regularized

Sample Codefrom pyspark.mllib.classification import LogisticRegressionWithSGDfrom pyspark.mllib.regression import LabeledPointfrom numpy import array

# Load and parse the datadef parsePoint(line): values = [float(x) for x in line.split(' ')] return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")parsedData = data.map(parsePoint)

# Build the modelmodel = LogisticRegressionWithSGD.train(parsedData)

# Evaluating the model on training datalabelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())print("Training Error = " + str(trainErr))

Quiz 4

• What situation is not suitable for logistic regression?1. data with multiple label point2. data with no label3. data with more the 1,000 features4. data with dummy features

DEMO

Recommendation ModelAlternating Least Squares

Definition• Collaborative Filtering(CF) is a subset of

algorithms that exploit other users and items along with their ratings(selection, purchase information could be also used) and target user history to recommend an item that target user does not have ratings for.

• Fundamental assumption behind this approach is that other users preference over the items could be used recommending an item to the user who did not see the item or purchase before.

Definition

• CF differs itself from content-based methods in the sense that user or the item itself does not play a role in recommeniation but rather how(rating) and which users(user) rated a particular item. (Preference of users is shared through a group of users who selected, purchased or rate items similarly)

Recommendation

Recommendation

Recommendation

Recommendation

ALS: Pythonfrom pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

# Load and parse the datadata = sc.textFile("data/mllib/als/test.data")ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

# Build the recommendation model using Alternating Least Squaresrank = 10numIterations = 20model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training datatestdata = ratings.map(lambda p: (p[0], p[1]))predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()print("Mean Squared Error = " + str(MSE))

DEMO

Quiz 5

• Search for other recommendation algorithms, point out the difference with ALS. - Item Base- User Base- Content Base

Reference• Machine Learning Library (MLlib) Guide

http://spark.apache.org/docs/1.4.1/mllib-guide.html

• MLlib: Spark's Machine Learning Libraryhttp://www.slideshare.net/jeykottalam/mllib

• Recent Developments in Spark MLlib and Beyondhttp://www.slideshare.net/Hadoop_Summit

• Introduction to Machine Learninghttp://www.slideshare.net/rahuldausa/introduction-to-machine-learning

http://spark.apache.org/docs/1.4.1/mllib-guide.html

http://www.slideshare.net/jeykottalam/mllib

http://www.slideshare.net/Hadoop_Summit

http://www.slideshare.net/rahuldausa/introduction-to-machine-learning

http://www.slideshare.net/rahuldausa/introduction-to-machine-learning

Data & Analytics

Spark MLlib - Training Material