Week 10 project presentations

CSE 291:Trends in Recommender Systems and Human Behavioral

Modeling

Week 10 project presentations

Neural Rating Regression with Abstractive Tips Generation for Recommendation

Balasubramaniam Srinivasan, Nitin Kalra, Prem Nagarajan

Problem StatementGiven a user and an item, simultaneously predict precise rating and generate tips.

DatasetAmazon dataset

Rating regression categories: Electronics, Movies, Books (size ~ 1 GB each)

Multi-task learning: Pet Supplies, Arts & Crafts, Cell Phone accessories

603,668 367,982 8,887,781

192,403 63,001 1,684,779

123,960 50,052 1,697,533

Architecture

Baseline model● Deep Learning based framework named NRT (Neural Rating and Tips

generation)● Multi-layer perceptron models user and item latent factor into rating● Gated Recurrent Unit (GRU) translates user and item latent factor into tips● Uses beam search algorithm to generate tips from the trained model● Multi-task learning framework integrates both rating predictions and tips

generation given by the objective function

Evaluation metrics● For rating prediction task:

○ Mean Absolute Error (MAE):

○ Root Mean Square Error (RMSE):

● For Tip generation task:○ ROUGE-N score:

Extension 1● Effect of the following metadata on the ratings

a. Also viewedb. Also boughtc. Bought together

● Modelling the new features as graphs● Learning from the node2vec representation of the nodes

Extension 2● Using the factoid answers dataset for improving the rating prediction and

tip generation● Contains questions and answers data from Amazon

Results for rating predictionBooks Electronics Movies & TV

Baseline model (NRT)

NRT + Also viewed (128)

NRT + Also bought (128)

MAE RMSE MAE RMSE MAE RMSE

*(Sampled

Down)

*(Sampled

Down)

0.805 1.060 0.885 1.130

* * 0.794 1.039 0.921 1.126

* * 0.802 1.052 0.905 1.119

Results for tip generationROUGE-1 ROUGE-2 ROUGE-L

ROUGE-1 ROUGE-2 ROUGE-L

F1 P R F1 P R F1 P R

27.98 22.11 45.67 2.03 1.68 3.63 22.84 21.66 45.67

28.31 22.32 46.09 2.13 1.56 4.27 23.09 21.83 45.88

NRT

NRT + Q/A

NRT

NRT + Q/A

Pet Supplies:

Arts and CraftsF1 P R F1 P R F1 P R

30.85 26.06 44.90 1.21 1.12 1.88 26.33 25.59 44.90

31.14 23.33 55.84 0.86 0.79 1.13 28.72 26.98 55.84

Results for tip generation

NRT

NRT + Q/A

Cell PhoneAccessories: ROUGE-1 ROUGE-2 ROUGE-L

F1 P R F1 P R F1 P R

28.25 19.82 60.21 1.02 0.64 2.68 20.86 19.40 60.21

29.08 25.73 43.27 0.45 0.34 0.71 23.22 22.09 43.14


MAE RMSE

0.712 0.822

0.706 0.784

NRT

NRT + Q/A

NRT

NRT + Q/A

Pet Supplies:

Arts and Crafts MAE RMSE

0.543 0.9310

0.543 1.087


NRT

NRT + Q/A

Cell PhoneAccessories:

MAE RMSE

0.539 0.487

0.493 0.477

Limitations● Large datasets ● Model is Compute Intensive● Extensions are compute intensive

Work in Progress

● Analyze the importance of time or season on product ratings and reviews○ Capturing user and item state

● Books Dataset Sampling

Thank you!

Extension to Neural Collaborative Filtering

Wen Liang, Zeng Fan

Original Paper

Presents GMF (General Matrix Factorization) model and NeuMF model.

MotivationsUse user and item attributes in the dataset

Tackle the sparsity issue

DataSet● MovieLens

○ User-Movie Ratings○ User information: gender, age, occupation○ Movie information: genre (e.g. adventure, comedy, etc.)

● Pinterest○ User-item pairs○ Number of each user’s pins○ User’s category

Evaluation and Metrics● Evaluation

○ Leave-one-out evaluation: for each user, leave 1 user-item interaction to the testset

● Metrics○ Hit Ratio@10○ Normalized Discounted Cumulative Gain (NDCG)@10

Revisit NeuMF ModelNeuMF: Combines GMF and MLP together to better capture implicit user-item relationship.

Using only GMF model is efficient and does not cost much performance.

Attributed-aware deep CF model● An extension for the NeuMF

model● Social network based● Add pooling layer above

embedding layer

Wang et al. (2017) Item Silk Road: Recommending Items from Information Domains to Social Users

Proposed ModelUse a shared user embedding to solve cold-start problem

Use a weight to balance:

Element-wise product between pairs of user, item and attributes vectors.

Results

Hit Ratio@10

Normalized Discounted Cumulative Gain (NDCG@10)

Training Loss vs. Epochs

Questions

Final Project: A synthetic Approach for Recommendation

Yan ChengMoyuan Huang

Overview

1. Objective: predict customer ratings for business

2. Metric: root mean square error

3. Dataset: subset of Yelp

4. Models

Dataset

1. Yelp - dataset

2. Select 5000 data for simplicity

3. To avoid sparsity in recommendation matrix, we work with users have more than 30 reviews

Dataset

1. Ratings: business_id, business_stars, user_id, and user_average_stars

2. Relations: user_id and friend_ids

3. Reviews: business_id, user_id, and rating and review_text

Model Overview

1. Basic Modela. Mean estimationb. Matrix Factorization

2. MF with latent factor

3. Topic MFa. origin versionb. modified version

4. Social MFa. Friend relationb. Social popularityc. User similarity

● Basic Modela. Mean estimation

rating = mean(ratings) + [mean(user) - mean(ratings)] + [mean(business) - mean(ratings)]

b. Matrix Factorization

sklearn.decomposition.NMF

Model Overview

● MF with latent factor

Model Overview

Model Overview

● Topic MF(incorporate reviews)a. Input: tf for each reviewb. LDAc. Output: vector of topic distribution

● Different implementa. origin version didn’t work outb. modified version

Model Overview

● Topic MF(incorporating reviews)a. origin versionb. modified version

Model Overview

● Social MF(social relationship information) WIPa. Friend relationb. Social popularityc. User similarity

Result

1. Basic Modela. Mean estimation 0.804b. Matrix Factorization 0.800

2. MF with latent factor ?

3. Topic MFa. origin version 0.907b. modified version 0.794

4. Social MFa. Relation 0.796b. Popularity 0.773c. User similarity 0.804

WIP

1. using word representation instead of bag-of-words

2. Combine social MF together

3. Compare performance across models

4. Explain the result

Dynamic Recurrent Network for Next-Basket Recommendation with Attention

TEAM MEMBERS :

KRITI AGGARWAL, SUDHANSHU BAHETY, DIGVIJAY KARAMCHANDANI

Original Paper: A Dynamic Recurrent Model for Next Basket Recommendation (DREAM)

▶ Original DREAM model proposes a dynamic recurrent basket model based on RNN for next basket recommendation

▶ Merges user basket’s current items and global sequential basket features using RNN - LSTM into users’ recurrent and dynamic representation.

▶ It shows that the nonlinear operation(MAX-POOL) on learning the representation of a basket does well in capturing elaborate interactions among multiple factors of items. (i.e. Learns Item embedding as a part of the network using a feed forward network)

▶ Extensive experiments on two public datasets (T-mall and Ta-Feng) demonstrated the effectiveness of the proposed model.

Original Network Architecture

Extension 1: Implementing and Adapting the DREAM Model to Instacart Dataset

▶ We used the Instacart Market Analysis dataset as the original datasets were not available to us.

▶ The reason this dataset was chosen was because this was found to be the closest to the original datasets while doing literature review.

▶ We needed to communicate with authors to clarify certain parts o of the paper.

▶ Implemented the original DREAM Model on pytorch.▶ Dataset Description:

▶ Anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

▶ Between 4 and 100 of their orders, with the sequence of products purchased in each order

Extension 2: Adding Attention to the DREAM MODEL

▶ We took the idea of adding attention from the ICLR 2015 paper “NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE”.

At Time t,Yt is the representation of the input basket.st-1 is the hidden representation of the user (LSTM hidden state).(+) -> we add attention for the weighted focus on previous user hidden states.After attention we get a context vector(Ci), same size as st-1.Ci represents the weighted sum of previous user representation, so that the model attends to the most important user hidden factors.

Adding Attention to the DREAM MODEL

▶ Drawing parallels from attention used successfully in Seq2Seq models, we wanted the LSTM to take input the most important parts of the last few user hidden representations.

▶ The hidden representations of the users are captured at each time step t.

▶ The attention is based on an alignment score, i.e, how correlated is the current input to each of the previous baskets in a window.

▶ We have an hyper-parameter k, which decides what would be the appropriate window size for previous baskets.

RESULTS

▶ Due to computational limitations, we sampled out 10% of the Dataset

▶ We ran the DREAM model on 32000 users, 44440 unique Items.

▶ Padding was performed differently on each batch

▶ Runtime: 500s for each epoch.

▶ The model saturated after about 10 epochs.

▶ Final Results :

F1 @k NDCG Precision@k Recall @k

Baseline 0.0548 1.2688 0.2822 0.0367

Our Model 0.0493 1.2377 0.2767 0.0303

Key Takeaways and Future Work▶ Attention did not help our case

▶ Learning item embedding as a part of the network and max pooling being a non linear operation does well in capturing elaborate interactions among multiple factors of items.

▶ Padding specific to each basket performs better than having the same pad length.

Questions ?

Extensions on Generating and Personalizing Bundle

Recommendations on SteamYiwen Gong, Siyu Jiang, Kuang-Hsuan Lee

Objectives1. Predict the preference rating of items/bundles given

the user2. Recommend bundles to the given user according to

their preference3. Generate new personalized bundles

Original paper: Architect

Item BPR model

Bundle BPR model

Bi , Pu , Qi

Items

Initial bundle

Candidate items

Candidate bundle

compare

User_item Data

User_Bundle Data

Recommended bundle

Weakness of Original Paper● Naive model, more features could be considered given data

○ Game Genre○ Bundle Discount Rate

● Unstable results, model AUC varies from 0.63 to 0.88● Users share preference for bundle diversity

● Flaws in bundle generation○ Generate new bundles only consisting items from existing bundles○ Tends to include popular items in bundles (To increase profit,

common bundles usually carry a small amount of unpopular items)

Original Bundle Ranking● 2-step BPR

○ Item BPR

○ Bundle BPR

Extended Bundle Ranking● 2-step BPR

○ Item BPR

○ Bundle BPR

Extended Bundle Ranking: discount effect

Bundle discount [0 , 1] tan(x), x : [ -π/ 2 , π/ 2]

sigmoid(x) : [ 0 , 1]Discount effect : [0 , Tu]

discount Increasing utility

0% - 10% high

40% - 50% few

90% - 100% high

Original Bundle GenerationMethod Result

● Recommend bundles with items a user

has already bought

● For 100 users, only 7 of the items are

new to everyone

● AUC is not always right!

● People can buy the popular items

without recommendation system, one of

goal of system is to activate user’s

demand

Extended Bundle Generation● Item BPR model has learnt info about items outside of all existing

bundles, our bundle generation is able to generate bundles with items new to existing bundles

● Ensure generated bundles consist only items a user never bought● Tends to include unpopular items for profit consideration

New Bundle Generation Algorithm1. Assign picking probability to each item based on their popularity. Less popular

items get higher probability.

2. Initialize a bundle with a random size from [Average Bundle Size - s, Average Bundle Size + s]. Items are chosen according to their assigned probabilities Pu,i.

3. Generate a candidate set. Choose half of the items in the set from items not bought. Choose the other half from all items using Pu,i.

4. Generate new bundles by adding, deleting or replacing items in the initial bundle using items from the candidate set.

5. Choose bundle with the largest Xu,b as the new bundle. 6. Repeat steps 3 to 5 until converge.

Results - Bundle Ranking

Results - Bundle Generation

t-SNE embedding of latent representations

Conclusion ● We propose several extensions for the original bundle recommendation

method.● Our method achieves a large improvement in BPR ranking results over

the original method.● Our method achieves better and more reasonable bundle generation for

specific users.

Limitation & Future Work● About 97% users bought less than 10 bundles. If a user only bought few

items or bundles, it would be hard to estimate the user sensitivity on bundles prices.

● The current model generates bundles based on the preference of users. However, without knowing the commercial information (cost, etc. ), It is hard to generate bundles which are beneficial for game distributors.

TransRec: Smarter Translation Vectors

Rajiv Pasricha

Original Paper

Translation-based Recommendation, by Ruining He, Wang-Cheng Kang, and Julian McAuley

● Sequential model for recommendation○ Embed users and items into a low-dimensional

“translation space”○ Each user travels along their personalized

trajectory of item interactions○

The TransRec Model

● Probability of next item j given user u and previous item i● βj = item bias (captures overall item popularity)● d = distance function (e.g. L1 or L2)● γi = previous item factors, γj = next item factors● Tu = user translation vector● Φ, Ψ= transition space and subspace, restricting factors helps regularization (TransRec: L2 ball)

● Trained using Sequential BPR Loss, SGD

Datasets and Evaluation

Evaluation: AUC

Extensions: Personalization

● Personalized translation vector○ Model “typical” sequences of items that are common across users

○ AUC on the Amazon Video Games dataset: 0.7610 → 0.7633

Extensions: Temporal Dynamics

● Time Delta model○ Incorporate the time delay between interactions○ Interactions that are farther apart can have larger translations between them.○ Amazon Video Games dataset: 0.7610 → 0.7544

● Personalized Time Delta model○ Add a user-specific scaling factor to the above time deltas○ Learn the scaling factor from the data○ Amazon Video Games dataset: 0.7610 → 0.7570

Extensions: Extra Translation Vector

● Introduce separate user offsets for short-term and long-term interactions○ Learn two translation vectors per user, threshold at time delay = 6 months○ Allow users to exhibit different tendencies based on temporal data○ If 6 months, then ○ If 6 months, then

○ Amazon Video Games dataset: 0.7610 → 0.7646

Extensions: Nonlinear Translation Vectors

● Use a neural network to model more complex translation relationships● (1) Model a nonlinear relationship between the previous item and user

translation vector.○ ○ Amazon Video Games dataset: 0.7610 → 0.7661

● (2) Directly estimate the probability of transitioning to the next item.○ ○ Amazon Video Games dataset: 0.7610 → 0.7552

Extensions: Nonlinear Temporal Models

● Add temporal information into the nonlinear neural network models● Add the delta between the previous and next interaction times

○ Neural Net translation vector model: ○ Amazon Video Games dataset: 0.7610 → 0.7661 → 0.7665○ Neural Net distance model:○ Amazon Video Games dataset: 0.7610 → 0.7552 → 0.7629

● Add the raw previous and next interaction times○ Neural Net translation vector model:○ Amazon Video Games dataset: 0.7610 → 0.7661 → 0.7662○ Neural Net distance model:○ Amazon Video Games dataset: 0.7610 → 0.7552 → 0.7661

Visualization

● “Transition space” learned by the model (when k = 2)

Training sequence of items for one user in the dataset

Visualization (without normalization)

● “Transition space” learned by the model (when k = 2)

Training sequence of items for one user in the dataset

Discussion and Future Work

● Adding nonlinear translation vectors helps the model learn more complex relationships between items.

● Adding temporal information helps when integrated with nonlinear models.

● It will be helpful to also compare results using different evaluation metrics in addition to AUC, e.g. Hit@50

● Additional visualizations, come up with a model that more clearly arranges items sequentially in the transition space.

Questions?

Extensions to Personalized Ranking Metric Embedding

(PRME)

- Shreyas Udupa Balekudru

Problem Statement

Next New POI Recommendation problemNew POIs with respect to user’s current location are to be recommended

Input – User ID, Current POI, Physical Location(Latitude and Longitude), Check-In TimeOutput – Recommended POI

Dataset

Foursquare check-insCheck-ins in Singapore between 08/2010 and 07/2011Number of check-ins = 151589Number of users = 2321Number of POIs = 5596

Training for PRME

Sequential transition space and User preference space (weights of metric spaces parameterized)Stochastic Gradient DescentModel parameters initialized with normal distributionIf check-in time difference is greater than threshold, only parameters in user preference space are updated.

Hyperparameters Used

K = 20Number of iterations = 1000Alpha = 0.2Learning rate = 0.005Regularization factor = 0.03

Incorporating Distance (PRMEG)

Include geographical distance as a multiplicative factor in the distance metric.Users prefer visiting nearer POIs over farther POIs.

Issues Faced

Units for distance not specified in the original paperTraining time for higher values of k is prohibitive (Training Algorithm Complexity - O(IK|C|))Dates in data are specified with ID. It is unclear if consecutive IDs represent consecutive days.

Evaluation MetricMean Reciprocal Rank

Where Q is the number of queriesAnd Rank_i is the ranking of the next POI in the test set in comparison to 20 randomly sampled ‘negative’ POIs from the dataset

Results

Higher the value of k, better the results.Not all functions of the geographical distance lead to improvement in performance.

Visualization

Work in Progress

Re-evaluating results of PRMEG from the paper for sanity check.Interpretation of metric embedding visualization.Evaluating PRMEG-like approach for new product recommendation using rating as a distance metric.

Questions?

Wednesday Presentations

Personalized Next Song Recommendation

Kiran Kannar, Rahul DubeyDec 06, 2017

Problem StatementGiven user song listening history, provide personalized next song recommendation using metric embeddings.

s1

Viva La Vida

Coldplay

Just The Way You Are

Bruno Mars

s2

?

?

s4

Firework

Katy Perry

s3

Datasets

Measure Now Playing 30 Music

# sessions 9288 100,000

#users 1032 7146

#tracks 76,652 694817

Avg. #sessions/user 9 9.33

PRME Model

Transition Probability

MAP

Gradient update equation

PRME- AuPersonalizing alpha

- Non-convex problem- Use alternating minimization technique- Empirical results showed random normal clipping better than sigmoid/tanh, 0/1 clipping, - Best initialization: initialize to global alpha of PRME! - Bounding tradeoff better than unbounded tradeoff

PRME SocialSimilarity Score (asymmetric)

MAP

Gradient update equation

Results

AUC vs iterations

Now Playing 30 Music

Metrics vs. Dimensions

MRR Hit Rate

Visualizing songs in sequence space

Alpha_U statistics - I: (30M dataset)

Median: 0.1964205

Mean: 0.216677464557

Standard deviation: 0.146711865691

Minimum value: 9.9e-05

Maximum value: 0.931338

Alpha_U statistics - II: (30M dataset)

Alpha_U statistics - III: (30M dataset)

Thank you!

FashionGAN: A generative model for fashion recommendation

By Vignesh Gokul

Base paper● Learning Visual Clothing Style with Heterogeneous Dyadic Co-occurrences

(Andreas Veit, Balazs Kovacs, Sean Bell ,Julian McAuley ,Kavita Bala and Serge Belongie)

● The paper implements a Siamese CNN with strategic sampling to learn the embedding space for all items and use these embeddings to build a better item recommender system

Siamese CNN Architecture

FashionGAN● A generative model, which outputs a compatible image given an input image● Condition on the input image● Related Work:

○ Image-to-Image Translation with Conditional Adversarial Networks

Image to Image Translation with CGANs

FashionGAN

Siamese GAN

Figure: Architecture of the Generator

Siamese GAN(Results)

Evaluation

● Inception score● Opposite SSIM

● Can improve siamese GAN by using variational encoders.

Model Inception score Opposite SSIM

Image to Image GAN 3.2733448 0.64017811

Siamese GAN 1.9960622 0.6160975

Another Extension (Work in progress)● Use deep supervision to improve siamese CNNs

Questions?

TransNets: Learning to transform for Recommendation

By:Akanksha Grover Dhruv Sharma Rishab Gulati

TransNets: Using Review Text for rating prediction

● TransNets represents Users, Items using past reviews given by/to them

● Learns a latent representation of the prospective review using the interactions between

<User,Item>

● Optimizes the MSE of the ratings produced

● Models only the interaction between a user and item using only the reviews

● TransNet Ext. models the interaction using both the user-item latent vector and review as input

Dataset and Code● Original Model uses Yelp 2016 dataset: https://www.yelp.com/dataset_challenge

● We ran the model on Yelp 2017 dataset.

● Data Statistics:

○ 4,700,000 reviews

○ 156,000 businesses

○ 1,100,000 users

● This data is larger than the original data the model was run on when the paper was published.

● Original code taken from: https://github.com/rosecatherinek/TransNets

● Modifications have been done to the above code for extensions

https://www.yelp.com/dataset_challenge

https://github.com/rosecatherinek/TransNets

Train, Test and Validation Epochs

● We divided the entire dataset including all reviews into 3 parts - train, test and validation

sets randomly.

● Number of datapoints:

○ Train-3,789,517

○ Test-473,689

○ Validation-473,689

● Due to computational reasons, we limited our experiments to a single training epoch.

● We ran the original model and our modified model for one epoch and compared the MSE

value.

Original Model● A Target Network that processes the

target review rev AB

● A Source Network that processes the texts of

the (userA, itemB ) pair that does not

include the joint review, rev AB .

● Original model concatenates all user

reviews and item reviews respectively for

the user and item till a length of 1000 words

(except the common review).

TransNet Original

1. Running Time : 9 hours per epoch

2. Review Length = 800

3. Embedding Trained on top 50K frequent

words in Yelp 2017

4. Test MSE : 1.81

Extension 1:

Issues:● Original model concatenates all reviews

into single composite review● It does not consider variation in review

text across reviews of different ratings.● Requires a large matrix of word

embeddings for each user/item/review

Proposed Solution:● Each column should represent a review’s

embedding● Reviews are sorted by ratings to allow the

CNN to learn variation by spatial correlation in the matrix

● Requires only a matrix of K x Size of embedding

Extension 1:

Each column is a the latent representation of a user review/item review.

Review Embeddings

A set of k user/item reviews sorted by rating in increasing order

ExperimentsWe sampled the reviews for each user/item using the following methods:

● For all the methods we fixed a threshold(K) value i.e the number of users/item reviews to be

sampled. We took values of k=10 and k=20.

● Review embeddings were made by summation of word embeddings of all the words in a

review in the first three experiments.

● Review embeddings were learnt from a separate DeepCon Network in the last experiment

● We did a total of 4 experiments for this extension

● One training epoch takes about 3-4 hours for each experiment.

Experiments1. Sample K Reviews using User/Item Reviews + Global Sampling

○ We randomly sampled ‘k’ user/item reviews and sorted them in increasing order

according to review rating .

○ If the user or item reviews were less than ‘k’, we sampled the reviews from a global set

of all reviews.

MSE : 1.923229

Possible Issue: Sampling from a global set of reviews might not be relevant for the <user,item>

Experiments2. User/Item Reviews + Corresponding Item/User Reviews + Global Sampling

○ We randomly sampled ‘k’ user/item reviews and sorted them in increasing order

according to review rating .

○ If the user or item reviews were less than ‘k’, then we sampled the reviews for the

corresponding item/user in the training data. If the reviews were still less than ‘k’, we

sampled the rest from a global set of all reviews

MSE : 1.858533

Possible Issue : Most of the users, items have very few reviews, hence most training samples are

for cold start users / items

Experiments 3. Filtered Training Data+User/Item Reviews + Corresponding Item/User Reviews + Global

Sampling

○ Sampling process was kept same as (2) but we filtered the training set to keep only those

users/items which had at least ‘k’ reviews.

MSE : 1.886834

Possible Issue: Training on data with no cold start <user, item> did not generalize well for the test set

Experiments4. Generate Review Embeddings using DeepCon

○ Sampling process was kept same as (2).

○ Review embeddings were generated by a separate DeepCon Network

trained on a small sample of training set with equal representation of

each rating.

○ Acts as a step of generate pretrained “Review Embeddings”

○ Tried the review embeddings with only Experiment 2

MSE : 1.730112

Performs the best among all experiments and baseline

Results for Extension 1

Baseline Experiment 1(k=10)

Experiment 2(k=10)

Experiment 2(k=20)

Experiment 3(k=10)

Experiment 4(k=10)

1.813865 1.923229 1.858533 1.891809 1.886834 1.730112

MSE

Baseline Extension(k=10) Extension (k=20)Number of input parameters 153,600 52,480 53,760Hours to train per epoch ~9.5 hrs ~4 hrs ~5 hrs

Extenison 2● We show that TransNet can be used for Tips generation on Yelp dataset

● Inspired from the paper “Neural Rating Regression with Abstractive Tips Generation

for Recommendation”, Piji Li et. al

● Review latent representation learned from TransNets’ transform layer is used as a context

vector to generate tips

Methodology(1/3)

● Original Yelp dataset has about 400,000

data points that has both reviews and tips.

We take top 50,000 training data points.

● Train the entire TransNet for just 1 epoch.

● Transfer the output of Transform Layer for

each data point in the train set as well as

the test set to the RNN.

Methodology (2/3)● Sequence length = 3

● GloVe embeddings of the most common 50,000 words in reviews .

● Added a <UNK> word to represent all words not in the vocabulary

● Add embeddings of 2 <UNK> vectors and embedding of the first word of the tip -> 50 dim vector

● Concatenated with 64 dimension vector from the TransNet that represents corresponding review

● Concatenated vector is fed as the input Context Context Context Context

Methodology (3/3)

● We train for 500 epochs for about 6 hrs.

● For test data, we used 2000 data points from the original 121k data points

● At test time, we concatenate the 64 dimension review vector from TransNet with 50

dimensional representation of <UNK>.

● This generates the first word of the tip and then we use embedding of each generated

word at each time step in concatenation with vector from TransNet.

● We sample word based on the output probability.

Generated Tips

made place to beer . . . this amazing i i

they great place my to sushi onion spot in dog .

tea pizza . . week ' chicken items place very service

sale , great . this worth to cardio were this don

lots spot hair some is they be but no amazing tim

Baseline and Evaluation● We use LexRank as our baseline.● LexRank produces summary of the whole review.● We calculate ROUGE-1 and ROUGE-2 as our evaluation measures.●

Score LexRank TransNet + RNN

Precision Recall F-1 Score Precision Recall F-1 Score

ROUGE-1 0.0694001 0.0451601 0.0456172 0.0242294 0.0292329 0.0239379

ROUGE-2 0.0025476 0.0013569 0.0014726 0.0003033 0.0003055 0.0002377

Conclusion & Future Work

1. Pre-trained review embeddings gave the highest boost

2. We are still not very sure about the best way to sample the K reviews of the user/item

and we want to investigate further how review embeddings change our results

3. There is no analysis on robustness to temporal change

4. The absolute value of MSE is very high

5. Future Work:

a. Combine temporal signals and use global, user and item biases

b. Extend the Transnet model to use implicit feedback and ranking prediction

c. Evaluate Tip Generation against more baselines

Questions

Neural Collaborative Filtering[He et al. 2017] Extensions

Kulshreshth Dhiman, Sai Kolasani

Overview

● Extensions to Neural Collaborating Filtering [He et. al 2017]● Extensions

1. Pairwise Ranking2. Cold start3. Experiments with Architecture

● Dataset○ Movielens20M user-item interactions (converted to implicit feedback)○ Movie features from IMDB

ModelGMF:

● Uses the inner product of the user and item representations in the latent space.

MLP:● A multi layer perceptron network using the

concatenation of user and item latent representations as an input feature.

NeuMF:● Combines the GMF and MLP into a single deep

network for better accuracy.

Pairwise Ranking ModelGMF pairwise model

Shared weights

Shared weights

● The pairwise networks use shared weights and shared user embedding.

● The objective function is modified to maximize the difference between the score of a preferred item over a non preferred item.

● During evaluation we tried two approaches○ Take the output from the final sigmoid

layer(calculating the sigmoid of the difference of scores)

○ Take the sigmoid of the output from the linear dense layer just before the sigmoid - This works better.

● Similarly designed for other models

Pairwise Ranking algorithm

● We can find the position of the positive item in the ranking efficiently using batch prediction.○ We find the pairwise ranking of the positive item with all the negative

items(N) in a single batch and find the number of times the positive item is preferred(k)

○ Then the ranking of the positive item will be N-k● If exact ranking of K items is needed then a heap based

algorithm can be used.

Cold Start ModelGMF Cold start model

Cold Start ModelMLP Cold start model

Cold Start ModelNeuMF Cold start model

Dataset● Movielens20M* dataset with user movie ratings from 2012 to 2015,

converted to implicit feedback data.● Sampling

○ Movies released after 1990○ Al teast 20 items per user○ Randomly sampled 7000 users

*https://grouplens.org/datasets/movielens/20m/

Data Statistics

#users 7000

#items 8491

#ratings 724K

Sparsity 98.78%

Items per user Min:20, Max:1980, Median:59

Users per item Min:1, Max:3609, Median:12

https://grouplens.org/datasets/movielens/20m/

Item Features

● Collected item features using http://www.theimdbapi.org/ API

● From the data extracted from IMDB we used the following features.○ Year of release: Binned years (5 years) and then one hot encoding ○ Genre: Many hot encoding over 24 different genres○ Text features: We used the gensim doc2vec library to learn vector

representation of storyline of the movies in the training set

http://www.theimdbapi.org/

● Randomly sampled 10% items as cold start items and rest in training set

● Test SetLatest 2 positive user-item pairs in test set

● Test set - Cold-start (completely new items)no user-item pairs in the train set

● Test set - Pseudo Cold-start (relatively new items)10 positive user-items pairs in train set, rest in test set

● Train Set#negatives per positive user-item = 4

Train-Test set

Evalulation

● For evaluating the performance of the model we use ○ HR: HitRate@10○ NDCG: Normal Discounted Cumulative Gain(NDCG@10)

log(2/(rank+2))○ AUC

● Randomly sampled 99 negative user-item pairs (not in train set) and ranked positive item among negative items

Results - Base

Results - Base

Results - Pairwise

Point wise performed better than pair wise

Results - Coldstart● Cold start models outperformed

base models● Item features improved

performance over general test set

Results - Coldstart● Cold start models

performed better thanBase models

● MLP had higher hit rates

Results - Coldstart● MLP models had

relatively high hit rates

Results - Coldstart● NeuMF cold start model

had highest AUC

Architecture Experiments● NeuMF: Shared Embeddings for

GMF and MLP model○ GMF and MLP learn different latent space

● GMF: Add a dense layer after MF layer(dim=latent_size/2)

HR NDCG AUCSeparate 0.8039 0.5021 0.9288

Shared 0.7962 0.4975 0.9261

HR@10

PF GMF With Dense Base GMF

8 0.7981 0.7986

16 0.7999 0.8046

32 0.7999 0.8046

Conclusion

● Pointwise ranking model works better than pairwise ranking model

● ItemFeatures like storyline, genre and year improves hit-rates of cold start performance as well as non-cold start items

● GMF had higher hit rates for non-cold start items and MLP had higher hit-rates for cold-start items

Questions?Thank you!

Jointly Modeling Aspects, Ratings and Sentimentsfor Movie Recommendation (JMARS)

Presented By: Rishabh Misra, Tushar Bansal

Problem Statement

● Motivation: Uncovering aspects and sentiments from reviews could provide a better understanding of users, movies (items), and the process involved in generating ratings.

● Approach: Capture the interest distribution of users and the content distribution for movies and provide a link between interest and relevance on a per-aspect basis. Authors also differentiate between positive and negative sentiments on a per-aspect basis. This all leads to better rating prediction.

Model

Algorithm

● Objective:

● EM Algorithm● E-Step : Sample {y, z, s} for each word from the current distribution● M-Step :

○ Fix sampled {y, z, s} for each word○ Optimize other parameters using L-BFGS.

DataOriginal Paper: IMDB dataset

● 54671 Users | 22380 Movies | 348415 Reviews

Our Implementation:

● Amazon Clothing Category Dataset○ 1981 Users | 1962 Items | 11935 Reviews

● Amazon Instant Video Dataset○ 2000 Users | 1643 Items | 14355 Reviews

● We opted for small datasets because the inference of JMARS on large number of reviews is computationally expensive and time intensive, and we spent most of our time implementing the original method.

Extension● Add temporal dynamics to user latent factors, biases and interest

distribution. ● Idea borrowed from Collaborative Filtering with Temporal Dynamics

(Koren, 2009)● This formulation doesn’t lead to significant increase in parameters.

Quantitative Results

Amazon Clothing Data

Without Temporal Dynamics

With Temporal Dynamics

Improvement

Baseline 1.1505 1.1420 0.74%

JMARS (A=6; K=5) 1.1251 1.1152 0.88%

JMARS (A=12; K=5) 1.1244 1.1150 0.84%

Baseline: JMARS without language models (i.e. simple latent factor model).

Evaluation Metric: MSE

Amazon Video Data

Without Temporal Dynamics

With Temporal Dynamics

Improvement

Baseline 1.1269 1.1170 0.88%

JMARS (A=6; K=5) 1.0945 1.0843 0.93%

Qualitative Results

● Background Words○ Price, Product, Picture, Fit, Wear, Quality, Purchase, Material

● General Sentiment Words○ Positive

■ Comfort, Nice, Well, Love, Buy, Good, Great, Pretty○ Negative

■ Problem, Waste, Flaw, Review, Nothing, Worst

● Aspects Words○ Material/Color

■ Color, Material, Elastic, Light, Care, Weather○ Size/Fit

■ Tight, Wear, Comfort, 8/10 (shoe sizes), Inch

Qualitative Results

● Aspects Sentiment Words○ Material/Color

■ Great, Design, Soft, Quality, Durable, Cheap○ Size/Fit

■ Shrink, True, Doesn’t/Don’t, Small, Thick

● Item Specific Words○ Item 1

■ Bag, Compartment, Pocket, Purse○ Item 2

■ Shoe, Clarks, Merrell, Timberland

Temporal EffectInterest distribution change for aspect material/color.

Date: 06/11/2013

My hubby is hard on his shoes, so I like to find him good ones at a reduced price, such as these. He likes the fit and feel of New Balance, so these will be his next pair when his current ones are too tattered to wear anymore. Good grippy sole for our rocky western trails, and decent laces that shouldn’t break with his hard use.

Date: 04/02/2014

Thanks to another reviewer I got the green ones instead of the raspberry. The green insoles have just the right arch support for my plantar fasciitis-ridden feet. I am glad to have them in my everyday Merrell slip on shoes. These insoles are not too soft, but soft enough, and after just one day of wear I don't notice them at all, which is perfect. Based on the last pair (I had the raspberry) I expect about a year from these, but will happily accept a longer wear time from them.

Conclusion and Future Work

● The extension did improve on the current model but only by a small amount.

● The reasons for only a small improvement could be:○ The dataset we use is relatively small (because of limited resources) with few

reviews for each user so the temporal dynamics might not learned properly. ○ The linear time function might not be the best to capture the temporal dynamics

across different aspects. Other options like binning might work better.

● Add hierarchical structuring to the language models.

Questions?

TransNets++Learning to Translate Better by Accounting for

Higher Order Interactions

Sejal ShahSiddharth Dinesh

Goal

What effect does the inclusion of higher order interactions have on a complex feature extraction mechanism such as TransNets?

MotivationNeural networks are predominantly used for preprocessing of data in recommender systems

Neural factorization machines have not been evaluated in settings where the features are neurally extracted

TransNets

Factorization Machines

Neural Factorization Machine Plain Old Factorization Machine

Implementation of the paper1. Data: Yelp Dataset 2017

a. 4.7 million reviews

b. TransNets paper uses only 4.1 million reviews: Filtering criteria is unclear

2. Resultsa. Our implementation resulted in an MSE of 1.7559 (random epochs, filtered reviews)

b. Used the result from the TransNet implementation as our baseline

Extension 1: L2 Loss

● TransNets optimizes the Factorization Machine using L1 loss.● We report MSE, so makes sense to optimize L2 loss directly

Extension 2: Batch Normalization ● Batch Normalization is new-age alchemy to induce faster convergence of

SGD

Extension 3: Neural Factorization Machines● Added Neural layers to factorization machine● Experimented with 0, 1, 2 hidden layers.

Conclusions● Number of training epochs is important when comparing results● Creation of training epoch batch results in variance in MSE of TransNet

predictions, as TransNet only considers 1000 words from the reviews● Neural Factorization Machine only slightly improves predictions when input is

already constructed using non-linear transformations● Would NFM improve rating prediction if one-hot embeddings of users and

items are also served as input to the Factorization Machine?○ Like in TransNet-Ext

● How much do these results depend on the dataset? ○ Confirm lack of improvement from NFM using another dataset:

■ Google Local■ Amazon Reviews

Questions?

Efficient Bayesian Methods for Graph-based Recommendation Systems

Aditi Mavalankar, Stephanie Chen and Ajitesh Gupta

Original Model Overview

● Authors proposed a fast graph based method for general purpose recommendation

● It scores all items available on a 3-step path from the user in order to provide new recommendations.

● Scoring is done by making use of probability distributions based on the item ratings

Target User Item 1

User X Item 2 PotentialRecommendation!!

1

2

3

Original Model Overview - Reliability of Item

● Binary random variable Yj = 0 for negative assessment, 1 for positive assessment ● P(Yj = 1) = θj ~ Reliability of the item. Modelled as a Beta distribution.

∴ P(θj = 1 | Ratings) ~ Beta( R+ , R- ) (Conjugate Distributions)

● R+ = No. of Positive ratings● R- = No. of Negative ratings

Original Model Overview - Scoring Functions

● Posterior Inequality Scoring (PIS) - Probability of the reliability of candidate item x being greater than the reliability of item v in the user history.

● Posterior Prediction Scoring (PPS) - Probability of both v and x receiving positive assessments where we assume that Yv and Yx are independent.

● Posterior Odds Ratio Scoring (PORS) - How large the odds of x receiving a positive assessment is when compared to the odds of v receiving a positive assessment

Target User Item V

User W Item X PotentialRecommendation!!

Positives of the original model

● Existing approaches often use random walks○ Large number of transition matrices to be stored○ Large matrix multiplication operations○ Large number of simulations to converge in some cases

● No matrix multiplications ⇒ 1-2 orders of magnitude faster● No large matrices to store ⇒ Much lower space complexity

Negatives of original model - Motivation for extensions

● It does not involve the user information in the process of recommendation○ Binary Interactions - How similar users are to each other ?○ Unary Information - How experienced is the user ? How many items has he rated before ?

● It does not involve the binary interactions between items as well○ How similar are two items ?

Extension 1 - User reliability score

● Users that give ratings to more items are more significant. ● We generate a reliability score for each user, and multiply each item’s

PIS/PPS/PORS score by it to determine whether an item ought to be recommended.

Target User Item V

User W Item XModified_score( Ix ) = Rel( Uw ) * Score( Ix )

Extension 2 - User similarity score

U1 I2

I1

U2

I3

I4

1 1

1 1

similarity =

1

0

1

1similarity_score

= similarity / total common items

=

012

2 / 4

= 0.5

Extension 2 - User similarity score

Extension 3 - Item similarity score

User similarity and item similarity

User Similarity Heatmap Item Similarity Heatmap

Mean Average Precision

Mean Reciprocal Rank

Precision@5

Precision@10

Normalized Discounted Cumulative Gain@5

Normalized Discounted Cumulative Gain@10

Results on ML-100k

Method MAP MRR P@5 P@10 NDCG@5 NDCG@10

PIS 0.1459 0.4173 0.2049 0.1654 0.2482 0.2310

PIS_USS 0.1472 0.4209 0.2049 0.1667 0.2486 0.2319

PIS_ISS 0.1476 0.4266 0.2023 0.1653 0.2464 0.2307

PIS_USS_ISS 0.1479 0.4264 0.2023 0.1657 0.2465 0.2309

PPS 0.1531 0.4213 0.2102 0.1724 0.2546 0.2410

PPS_USS 0.1546 0.4235 0.2106 0.1743 0.2554 0.2432

PPS_ISS 0.1546 0.4304 0.2095 0.1727 0.2544 0.2412

PPS_USS_ISS 0.1558 0.4317 0.2076 0.1735 0.2534 0.2415

PORS 0.1147 0.2949 0.1525 0.1330 0.1694 0.1643

PORS_USS 0.1149 0.2931 0.1529 0.1326 0.1693 0.1639

PORS_ISS 0.1188 0.3054 0.1540 0.1372 0.1737 0.1704

PORS_USS_ISS 0.1195 0.3038 0.1559 0.1384 0.1751 0.1716

Conclusion

● User-user similarity is observed to be more useful than item-item similarity.● Introducing either kind of similarity improves the quality of recommendations.● User reliability score proves to be too naive, and hence provides no

improvement.● PPS still remains the top performer among the scoring techniques.● Since the results are consistent on FilmTrust, as well as ML-100k, it can be safe

to say that similar results will be exhibited on the other 5 datasets used in the original paper.

● Future work: Different algorithms to calculate user and item similarities

THANK YOU!

Documents

Week 10 project presentations