Upload
hug-france
View
1.765
Download
0
Embed Size (px)
Citation preview
Hi! I’m Koby
2
▣ Data Scientist at Equancy□ Previously: Kpler, Engie
▣ Python Dev□ scikit-learn / pandas / Jupyter□ Sometimes I use R
▣ I used Hadoop before for data pipelines
▣ My first project doing distributed ML!
Hello, my name is Hervé!
3
▣ Equancy Partner & Chief Scientist
▣ In charge with Data Technologies□ Data Engineering□ Data Science□ Innovating with data
▣ PhD in Machine Learning many years ago
Recommenders: What for?
6
▣ Only one occasion to interact with customers□ Which marketing message to choose?
▣ Personalized User Experience□ Improved Experience!
▣ No information overload□ ~230,000 Products
Three different recommendation systems
9
Homepage Product Page Cart
Collaborative Filtering (Unsupervised Learning)
Frequently Bought-Together Prediction
(Supervised Learning)
Content-Based Filtering(Correlation Maximization)
Three different recommendation systems
10
Homepage Product Page Cart
Collaborative Filtering (Unsupervised Learning)
Frequently Bought-Together Prediction
(Supervised Learning)
Content-Based Filtering(Correlation Maximization)
Business Inputs
▣ Score should be based on three factors:
□ Interaction type - purchase is more important
than a product view
□ Time (decay) - a product purchased in recent
history witll have more impact than a product
purchased in the distant past
□ Season - a product purchased during another
season will have less impact
Business Rules
▣ The following items should be Filtered-out:
□ Purchased recently or very similar
□ Not in current season
□ Not user’s gender
□ Not in stock
? ? ? 1 ? ?
5 1 3 ? ? ?
? ? ? 1 ? ?
? ? ? ? ? ?
1 1 ? ? ? ?
? ? 3 ? ? 1
? ? 1 ? 5 ?
? 5 ? 3 ? ?
► 1
► 3
► 5
▣ Predict missing interactions
?
?
?
?
?
?
?
?
Training
=
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
1
5 1 3
1
1 1
3 1
1 5
5 3
? ? ? ? ? ?X
Items
Users
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
Latent Factors
Matrix Factorization
1
5 1 3
1
3 1
1 1
3 1
1 5
5 3
1
5 1 3
1
3 1
1 1
3 1
1 5
5 3
Training
Matrix Factorization
▣ Input:
□ Sparse representation of matrix (tuples)
□ Representation of an interaction score
between user and product
Training
Matrix Factorization
▣ Output:
□ User Features
mapping users to latent features
□ Product Features
Mapping products to latent features
□ Estimation of interaction scores
?
?
?
?
?
?
?
?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ?
1
5 1 3
1
1 1
3 1
1 5
5 3
? ? ? ? ? ?X
Items
Users
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
Latent Factors
Implicit Collaborative Filtering
▣ Difficulties:□ How to interprate missing relations between
users and products?If a user didn’t click on the item - does it means
that the user doesn’t like it?Maybe he just didn’t see it yet?
□ What values should we use for missing relations?should we replace with 0?should we replace with mean/median?
▣ Using methods for explicit feedback (i.e. product rating) can’t be applied to our case!
▣ Spark MLlib has a special CF implementation for the implicit feedback case, based on:
▣ The general idea is using confidence level that will let us tune what a lack of feedback means for our applications
Implicit Collaborative Filtering
(Google the title to read it for free on the author’s page)
Training
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, blocks: Int, alpha:
Double, seed: Long): MatrixFactorizationModel
Train a matrix factorization model given an RDD of 'implicit preferences' given by users to some products, in the
form of (userID, productID, preference) pairs. We approximate the ratings matrix as the product of two lower-rank
matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of
ALS. This is done using a level of parallelism given by blocks.
ratings
RDD of (userID, productID, rating) pairs
rank
number of features to use
iterations
number of iterations of ALS (recommended: 10-20)
lambda
regularization factor (recommended: 0.01)
blocks
level of parallelism to split computation into
alpha
confidence parameter
seed
random seed
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))
# Save and load model
model.save(sc, "target/tmp/myCollaborativeFilter")
sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")
ALS for Python
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
val data = sc.textFile("data/mllib/als/test.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions =
model.predict(usersProducts).map { case Rating(user, product, rate) =>
((user, product), rate)
}
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)
}.join(predictions)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
println("Mean Squared Error = " + MSE)
// Save and load model
model.save(sc, "target/tmp/myCollaborativeFilter")
val sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")
ALS for Scala
Measuring prediction Performance
▣ In order to select good parameters for our model we designed a validation benchmark
▣ We based it on relatively small dataset to be able to make a significant amount of tests
▣ We chose to measure and minimize the RMSE*:□ used by default in ALS□ punishes big errors□ error is in the scale of the rating unit□ common metric for CF
* RMSE - Root Mean Square Error
Measuring prediction Performance
# splitting dataset randomly into train set and validation set
training_RDD, validation_RDD = small_ratings_data.randomSplit([0.7, 0.3], seed=0)
validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))
ranks = [4, 8, 12]
errors = [0, 0, 0]
err = 0
# measuring error on the validation set
min_error = float('inf')
best_rank = -1
best_iteration = -1
for rank in ranks:
model = ALS.train(training_RDD, rank, seed=0, iterations=10, lambda=0.1)
predictions = model.predictAll(validation_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_preds = validation_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())
errors[err] = error
err += 1
print 'For rank %s the RMSE is %s' % (rank, error)
if error < min_error:
min_error = error
best_rank = rank
print 'The best model was trained with rank %s' % best_rank
For rank 4 the RMSE is 0.963681878574
For rank 8 the RMSE is 0.96250475933
For rank 12 the RMSE is 0.971647563632
The best model was trained with rank 8
Deployment
▣ Training a model is actually pretty fast▣ Deploying is slow
□ We decided that all users will get a top-n recommendation
□ This recommendation is stored in a DB▣ We need to make a fresh recommendation for every
user - there are 4 million users. In Python:def recommendProducts(self, user, num):
"""Recommends the top "num" number of products for a given user and returns a listof Rating objects sorted by the predicted rating in descending order."""return list(self.call("recommendProducts", user, num))
▣ This call was around 20 ms - pretty quick□ calling this function 4M times = 1 day
Deployment
▣ Solutions: Extracting the User / Product features and applying matrix multiplication and sorting directly the RDD by batches:
users_rdd = model.userFeatures()
products_rdd = model.productFeatures()
…
from joblib import Parallel, delayed
Parallel(n_jobs=cores, verbose=1000)(delayed(prepare_recommendation)(user_features_batch, gender)
for user_features_batch in nested_user_features)
...
user_features_batch.dot(product_features_T)
This was about 10 times faster than calling recommendProducts
▣ Starting from Spark 1.6 recommendProductsForUsers is implemented for Python
□ This where a Scala has advantage over Python!
Domain-specific discussion
▣ Pros
□ Helps us to find non-obvious relations between users and products
□ High diversity and coverage of item catalogue
□ Using an unsupervised method we project to a low-dimensional space:
Latent Factor 1 = 20% red boots + 30% green snickers + …
Latent Factor 2 = 15% adidas snickers + 35% comfy boots + ...
➔ Embodies “deep” preferences (fashion, style, ...)
▣ Cons
□ Unpredictable results:
e.g. user never shopped for red boots - why is it recommended?
□ Can be interpreted as intrusion to the users’ privacy through (a machine
analysing “deep” human desires…)
Machine Learning / Big Data discussion
▣ Pros
□ Training of the model is quick thanks to the latent feature low dimensionality
□ Linear model with a closed-form solution (“easy!”)
□ No cold-start problem (vs. User-based CF)
□ Training is parallelizable: Hadoop Friendly
▣ Cons
□ Heavy in computation in comparison to Content-Based approaches
□ Unable to fit non-linear relations (polynomial tricks can’t be applied)
Thank You!
www.equancy.com
47 rue de Chaillot - 75116 Paris
Koby KarpHervé Mignot
[email protected]@equancy.com