Recommendations with Spark

Hi! I’m Koby


▣ Data Scientist at Equancy□ Previously: Kpler, Engie

▣ Python Dev□ scikit-learn / pandas / Jupyter□ Sometimes I use R

▣ I used Hadoop before for data pipelines

▣ My first project doing distributed ML!

Hello, my name is Hervé!


▣ Equancy Partner & Chief Scientist

▣ In charge with Data Technologies□ Data Engineering□ Data Science□ Innovating with data

▣ PhD in Machine Learning many years ago


Recommender Systems

Recommenders: What for?


▣ Only one occasion to interact with customers□ Which marketing message to choose?

▣ Personalized User Experience□ Improved Experience!

▣ No information overload□ ~230,000 Products

Why personalization matters?Because no personalization is ugly...


Recommendation algorithms


Three different recommendation systems


Homepage Product Page Cart

Collaborative Filtering (Unsupervised Learning)

Frequently Bought-Together Prediction

(Supervised Learning)

Content-Based Filtering(Correlation Maximization)

Three different recommendation systems


Homepage Product Page Cart

Collaborative Filtering (Unsupervised Learning)

Frequently Bought-Together Prediction

(Supervised Learning)

Content-Based Filtering(Correlation Maximization)

Business Rules

Business Inputs

▣ Score should be based on three factors:

□ Interaction type - purchase is more important

than a product view

□ Time (decay) - a product purchased in recent

history witll have more impact than a product

purchased in the distant past

□ Season - a product purchased during another

season will have less impact

Business Rules

▣ The following items should be Filtered-out:

□ Purchased recently or very similar

□ Not in current season

□ Not user’s gender

□ Not in stock

Collaborative Filtering


Matrix Factorization


Matrix Factorization

▣ Input:

□ Sparse representation of matrix (tuples)

□ Representation of an interaction score

between user and product


Matrix Factorization

▣ Output:

□ User Features

mapping users to latent features

□ Product Features

Mapping products to latent features

□ Estimation of interaction scores









Alternating Least Squares (ALS)

Implicit Collaborative Filtering

Implicit Collaborative Filtering

▣ Difficulties:□ How to interprate missing relations between

users and products?If a user didn’t click on the item - does it means

that the user doesn’t like it?Maybe he just didn’t see it yet?

□ What values should we use for missing relations?should we replace with 0?should we replace with mean/median?

▣ Using methods for explicit feedback (i.e. product rating) can’t be applied to our case!

▣ Spark MLlib has a special CF implementation for the implicit feedback case, based on:

▣ The general idea is using confidence level that will let us tune what a lack of feedback means for our applications

Implicit Collaborative Filtering

(Google the title to read it for free on the author’s page)

Implementation in Spark


def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double, blocks: Int, alpha:

Double, seed: Long): MatrixFactorizationModel

Train a matrix factorization model given an RDD of 'implicit preferences' given by users to some products, in the

form of (userID, productID, preference) pairs. We approximate the ratings matrix as the product of two lower-rank

matrices of a given rank (number of features). To solve for these features, we run a given number of iterations of

ALS. This is done using a level of parallelism given by blocks.


RDD of (userID, productID, rating) pairs


number of features to use


number of iterations of ALS (recommended: 10-20)


regularization factor (recommended: 0.01)


level of parallelism to split computation into


confidence parameter


random seed

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

# Load and parse the data

data = sc.textFile("data/mllib/als/test.data")

ratings = data.map(lambda l: l.split(','))\

.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

# Build the recommendation model using Alternating Least Squares

rank = 10

numIterations = 10

model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training data

testdata = ratings.map(lambda p: (p[0], p[1]))

predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))

ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)

MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()

print("Mean Squared Error = " + str(MSE))

# Save and load model

model.save(sc, "target/tmp/myCollaborativeFilter")

sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")

ALS for Python

import org.apache.spark.mllib.recommendation.ALS

import org.apache.spark.mllib.recommendation.MatrixFactorizationModel

import org.apache.spark.mllib.recommendation.Rating

// Load and parse the data

val data = sc.textFile("data/mllib/als/test.data")

val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>

Rating(user.toInt, item.toInt, rate.toDouble)


// Build the recommendation model using ALS

val rank = 10

val numIterations = 10

val model = ALS.train(ratings, rank, numIterations, 0.01)

// Evaluate the model on rating data

val usersProducts = ratings.map { case Rating(user, product, rate) =>

(user, product)


val predictions =

model.predict(usersProducts).map { case Rating(user, product, rate) =>

((user, product), rate)


val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>

((user, product), rate)


val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>

val err = (r1 - r2)

err * err


println("Mean Squared Error = " + MSE)

// Save and load model

model.save(sc, "target/tmp/myCollaborativeFilter")

val sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")

ALS for Scala

Validation and Parameter Tuning

Measuring prediction Performance

▣ In order to select good parameters for our model we designed a validation benchmark

▣ We based it on relatively small dataset to be able to make a significant amount of tests

▣ We chose to measure and minimize the RMSE*:□ used by default in ALS□ punishes big errors□ error is in the scale of the rating unit□ common metric for CF

* RMSE - Root Mean Square Error

Measuring prediction Performance

# splitting dataset randomly into train set and validation set

training_RDD, validation_RDD = small_ratings_data.randomSplit([0.7, 0.3], seed=0)

validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))

test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))

ranks = [4, 8, 12]

errors = [0, 0, 0]

err = 0

# measuring error on the validation set

min_error = float('inf')

best_rank = -1

best_iteration = -1

for rank in ranks:

model = ALS.train(training_RDD, rank, seed=0, iterations=10, lambda=0.1)

predictions = model.predictAll(validation_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))

rates_and_preds = validation_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)

error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())

errors[err] = error

err += 1

print 'For rank %s the RMSE is %s' % (rank, error)

if error < min_error:

min_error = error

best_rank = rank

print 'The best model was trained with rank %s' % best_rank

For rank 4 the RMSE is 0.963681878574

For rank 8 the RMSE is 0.96250475933

For rank 12 the RMSE is 0.971647563632

The best model was trained with rank 8



▣ Training a model is actually pretty fast▣ Deploying is slow

□ We decided that all users will get a top-n recommendation

□ This recommendation is stored in a DB▣ We need to make a fresh recommendation for every

user - there are 4 million users. In Python:def recommendProducts(self, user, num):

"""Recommends the top "num" number of products for a given user and returns a listof Rating objects sorted by the predicted rating in descending order."""return list(self.call("recommendProducts", user, num))

▣ This call was around 20 ms - pretty quick□ calling this function 4M times = 1 day


▣ I wasn’t the only one that needed this feature ...


▣ Solutions: Extracting the User / Product features and applying matrix multiplication and sorting directly the RDD by batches:

users_rdd = model.userFeatures()

products_rdd = model.productFeatures()

from joblib import Parallel, delayed

Parallel(n_jobs=cores, verbose=1000)(delayed(prepare_recommendation)(user_features_batch, gender)

for user_features_batch in nested_user_features)



This was about 10 times faster than calling recommendProducts

▣ Starting from Spark 1.6 recommendProductsForUsers is implemented for Python

□ This where a Scala has advantage over Python!

Discussing Collaborative Filtering

Domain-specific discussion

▣ Pros

□ Helps us to find non-obvious relations between users and products

□ High diversity and coverage of item catalogue

□ Using an unsupervised method we project to a low-dimensional space:

Latent Factor 1 = 20% red boots + 30% green snickers + …

Latent Factor 2 = 15% adidas snickers + 35% comfy boots + ...

➔ Embodies “deep” preferences (fashion, style, ...)

▣ Cons

□ Unpredictable results:

e.g. user never shopped for red boots - why is it recommended?

□ Can be interpreted as intrusion to the users’ privacy through (a machine

analysing “deep” human desires…)

Machine Learning / Big Data discussion

▣ Pros

□ Training of the model is quick thanks to the latent feature low dimensionality

□ Linear model with a closed-form solution (“easy!”)

□ No cold-start problem (vs. User-based CF)

□ Training is parallelizable: Hadoop Friendly

▣ Cons

□ Heavy in computation in comparison to Content-Based approaches

□ Unable to fit non-linear relations (polynomial tricks can’t be applied)

Guess what?


Thank You!


47 rue de Chaillot - 75116 Paris

Koby Karp
Hervé Mignot

