Upload
bryan-yang
View
556
Download
8
Embed Size (px)
Citation preview
Spark MLlib
• ExperienceVpon Data EngineerTWM, Keywear, Nielsen
• Bryan’s notes for data analysishttp://bryannotes.blogspot.tw
• Spark.TW
• Linikedinhttps://tw.linkedin.com/pub/bryan-yang/7b/763/a79
ABOUT ME
Agenda• Introduction to Machine Learning• Why MLlib• Basic Statistic• K-means• Logistic Regression• ALS• Demo
Introduction to Machine Learning
Definition
• Machine learning is a study of computer algorithms that improve automatically through experience.
Terminology• Observation
- The basic unit of data. • Features
- Features is that can be used to describe each observation in a quantitative manner.
• Feature vector- is an n-dimensional vector of numerical features that represents some object.
• Training/ Testing/ Evaluation set- Set of data to discover potential predictive relationships
Learning(Training)
• Features:1. Color: Red/ Green2. Type: Fruit3. Shape: nearly Circleetc…
Learning(Training)ID Color Type Shape is apple
(Label)
1 Red Fruit Cirle Y
2 Red Fruit Cirle Y
3 Black Logo nearly Cirle N
4 Blue N/A Cirle N
Categories of Machine Learning
• Classification: predict class from observations.
• Clustering: group observations into meaningful groups.
• Regression: predict value from observations.
Cluster the observations with no Labels
https://en.wikipedia.org/wiki/Cluster_analysis
Cut the observations
http://stats.stackexchange.com/questions/159957/how-to-do-one-vs-one-classification-for-logistic-regression
Find a model to describe the observations
https://commons.wikimedia.org/wiki/File:Linear_regression.svg
Machine Learning Pipelines
http://www.nltk.org/book/ch06.html
What is MLlib
What is MLlib• MLlib is an Apache Spark component focusing
on machine learning:
• MLlib is Spark’s core ML library
• Developed by MLbase team in AMPLab
• 80+ contributions from various organization
• Support Scala, Python, and Java APIs
Algorithms in MLlib• Statistics: Description, correlation
• Clustering: k-means
• Collaborative filtering: ALS
• Classification: SVMs, naive Bayes, decision tree.
• Regression: linear regression, logistic regression
• Dimensionality: SVD, PCA
Why Mllib
• Scalability
• Performance
• user-friendly documentation and APIs
• Cost of maintenance
Performance
Data Type
• Dense vector
• Sparse vector
• Labeled point
Dense & Sparse
• Raw Data:
IDA
B C D E F
1 1 0 0 0 0 32 0 1 0 1 0 23 1 1 1 0 1 1
Dense vs Sparse
• Training Set:- number of example: 12 million- number of features: 500- sparsity: 10%
• Not only save storage, but also received a 4x speed up
Dense Sparse
Storge 47GB 7GB
Time 240s 58s
Labeled Point• Dummy variable (1,0)
• Categorical variable (0, 1, 2, …)from pyspark.mllib.linalg import SparseVectorfrom pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
DEMO
Descriptive Statistics• Supported function:
- count- max- min- mean- variance…
• Supported data types- Dense- Sparse- Labeled Point
Examplefrom pyspark.mllib.stat import Statisticsfrom pyspark.mllib.linalg import Vectorsimport numpy as np
## example data( 2 x 2 matrix at least)data= np.array([[1.0,2.0,3.0,4.0,5.0],[1.0,2.0,3.0,4.0,5.0]])
## to RDDdistData = sc.parallelize(data)
## Compute Statistic Valuesummary = Statistics.colStats(distData)print "Duration Statistics:"print " Mean: {}".format(round(summary.mean()[0],3))print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))print " Max value: {}".format(round(summary.max()[0],3))print " Min value: {}".format(round(summary.min()[0],3))print " Total value count: {}".format(summary.count())print " Number of non-zero values: {}".format(summary.numNonzeros()[0])
DEMO
Quiz 1
• Which statement is wrong purpose of descriptive statistic?1. Find the extreme value of the data.2. Understanding the data properties preliminary. 3. Describe the full properties of the data.
Correlation
https://commons.wikimedia.org/wiki/File:Linear_Correlation_Examples
Example• Data Set
from pyspark.mllib.stat import Statistics
## 選擇統計方法corrType='pearson'
## 建立測試資料及group1 = sc.parallelize([1,3,4,3,7,5,6,8])group2 = sc.parallelize([1,2,3,4,5,6,7,8])corr = Statistics.corr(group1,group2, corrType)print corr# 0.87
A B C D E F G H
Case1 1 3 4 3 7 5 6 8
Case2 1 2 3 4 5 6 7 8
Shape is important
http://2012books.lardbucket.org/books/beginning-psychology/
DEMO
Quiz 2
• Which situation below can not use pearson correlation?1. When two variables have no correlation2. When two variables have correlation of indices3. When count of data rows > 304. When variables are normal distribution.
Clustering ModelK-Means
K-means
• K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Clustering with K-Means
Summary
https://en.wikipedia.org/wiki/K-means_clustering
K-Means: Pythonfrom pyspark.mllib.clustering import KMeans, KMeansModelfrom numpy import arrayfrom math import sqrt
# Load and parse the datadata = sc.textFile("data/mllib/kmeans_data.txt")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))
DEMO
• Which data is not suitable for K-means analysis?
Quiz 3
Classification ModelLogistic Regression
linear regression
https://commons.wikimedia.org/wiki/File:Linear_regression.svg
When outcome is only 1/0
http://www.slideshare.net/21_venkat/logistic-regression-17406472
Hypotheses function• hypotheses:
• sigmoid function:
• Maximum Likelihood estimation
Cost Function
if y = 0if y = 1 regularized
Sample Codefrom pyspark.mllib.classification import LogisticRegressionWithSGDfrom pyspark.mllib.regression import LabeledPointfrom numpy import array
# Load and parse the datadef parsePoint(line): values = [float(x) for x in line.split(' ')] return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/sample_svm_data.txt")parsedData = data.map(parsePoint)
# Build the modelmodel = LogisticRegressionWithSGD.train(parsedData)
# Evaluating the model on training datalabelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())print("Training Error = " + str(trainErr))
Quiz 4
• What situation is not suitable for logistic regression?1. data with multiple label point2. data with no label3. data with more the 1,000 features4. data with dummy features
DEMO
Recommendation ModelAlternating Least Squares
Definition• Collaborative Filtering(CF) is a subset of
algorithms that exploit other users and items along with their ratings(selection, purchase information could be also used) and target user history to recommend an item that target user does not have ratings for.
• Fundamental assumption behind this approach is that other users preference over the items could be used recommending an item to the user who did not see the item or purchase before.
Definition
• CF differs itself from content-based methods in the sense that user or the item itself does not play a role in recommeniation but rather how(rating) and which users(user) rated a particular item. (Preference of users is shared through a group of users who selected, purchased or rate items similarly)
Recommendation
Recommendation
Recommendation
Recommendation
ALS: Pythonfrom pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the datadata = sc.textFile("data/mllib/als/test.data")ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squaresrank = 10numIterations = 20model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training datatestdata = ratings.map(lambda p: (p[0], p[1]))predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()print("Mean Squared Error = " + str(MSE))
DEMO
Quiz 5
• Search for other recommendation algorithms, point out the difference with ALS. - Item Base- User Base- Content Base
Reference• Machine Learning Library (MLlib) Guide
http://spark.apache.org/docs/1.4.1/mllib-guide.html
• MLlib: Spark's Machine Learning Libraryhttp://www.slideshare.net/jeykottalam/mllib
• Recent Developments in Spark MLlib and Beyondhttp://www.slideshare.net/Hadoop_Summit
• Introduction to Machine Learninghttp://www.slideshare.net/rahuldausa/introduction-to-machine-learning