49
COBAFI: COLLABORATIVE BAYESIAN FILTERING Alex Beutel Joint work with Kenton Murray, Christos Faloutsos, Alex Smola April 9, 2014 – Seoul, South Korea

C O B A F I : COLLABORATIVE BAYESIAN FILTERING Alex Beutel Joint work with Kenton Murray, Christos Faloutsos, Alex Smola April 9, 2014 – Seoul, South Korea

Embed Size (px)

Citation preview

  • Slide 1

C O B A F I : COLLABORATIVE BAYESIAN FILTERING Alex Beutel Joint work with Kenton Murray, Christos Faloutsos, Alex Smola April 9, 2014 Seoul, South Korea Slide 2 Online Recommendation 25 Users Movies 5 3 5 5 2 Slide 3 Online Rating Models 3 Slide 4 Normal Collaborative Filtering Fit a Gaussian - Minimize the error Reality Minimizing error isnt good enough - Understanding the shape matters! 4 Slide 5 Online Rating Models Our Model 5 Normal Collaborative Filtering Fit a Gaussian - Minimize the error Slide 6 Our Goals and Challenges Given: A matrix of user ratings Find: A model that best fits and predicts user preferences Goals: G1. Fit the recommender distribution G2. Understand users who rate few items G3. Detect abnormal spam behavior 6 Slide 7 1. Background OUTLINE 2. Model Formulation 3. Inference 4. Catching Spam 5. Experiments 7 Slide 8 Collaborative Filtering X U V Users Movies Genres 5 = 1.50.73 6 0002.236 2.231.20.2 5 = 8 [Background] Slide 9 Matrix Factorization X Users Movies 9 [Background] U V Genres Slide 10 Bayesian Probabilistic Matrix Factorization (Salakhutdinov & Mnih, ICML 2008) UU ~ 10 [Background] Slide 11 1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 11 Slide 12 Our Model 12 Use user preferences to predict ratings Cluster users (& items) Share preferences within clusters Slide 13 The Recommender Distribution First introduced by Tan et al, 2013 Normalization 2 = -1.0 2 = 0.4 1 = 0 Vary 2 13 Linear Quadratic Slide 14 The Recommender Distribution 0.30.40.30.2-0.70.40.30.80.4 Genre PreferencesGeneral LeaningHow Polarized uiui 14 Goal 1: Fit the recommender distribution Slide 15 Understanding varying preferences 5 5 2 15 3 1 5 1 Slide 16 Resulting Co-clustering U V 16 Slide 17 Finding User Preferences UU UU 17 Goal 2: Understand users who rate few items Slide 18 Chinese Restaurant Process 11 22 33 18 Slide 19 1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 19 Slide 20 Gibbs Sampling - Clusters Probability of a cluster based on size (CRP) x Probability u i would come from the cluster [Details] 20 Probability of picking a cluster = Slide 21 Sampling user parameters [Details] Probability of preferences u i given cluster parameters x Probability of predicting ratings r i,j using new preferences Recommender distribution is non-conjugate Cant sample directly! 21 Probability of user preferences u i = Slide 22 1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 22 Slide 23 Review Spam and Fraud 5 5 Image from http://sinovera.deviantart.com/art/Cute-Devil-117932337 1 1 1 1 1 1 1 1 1 5 5 5 5 5 23 Slide 24 Clustering Fraudsters 11 22 33 New Spam ClusterPrevious Real Cluster 24 Slide 25 Clustering Fraudsters 11 22 33 Too much spam get separated into fraud cluster Trying to hide just means (a) very little spam or (b) camouflage reinforcing realistic reviews. 25 Slide 26 Clustering Fraudsters 11 22 33 44 55 Nave Spammers Spam + NoiseHijacked Accounts 26 Goal 3: Detect abnormal spam behavior Slide 27 1. Background OUTLINE 2. Our Model 3. Inference 4. Catching Spam 5. Experiments 27 Slide 28 Does it work? 28 Better Fit Slide 29 Catching Nave Spammers 29 83% are clustered together Injection Slide 30 Clustered Hijacked Accounts Clustered hijacked accounts Clustered attacked movies 30 Injection Slide 31 Real world clusters 31 Slide 32 Shape of real world data 32 Slide 33 Shape of Netflix reviews Most GaussianMost skewed The RookieThe O.C. Season 2 The FanSamurai X: Trust and Betrayal Cadet KellyAqua Teen Hunger Force: Vol. 2 Money TrainSealab 2001: Season 1 Alice Doesnt Live HereAqua Teen Hunger Force: Vol. 2 Sea of LoveGilmore Girls: Season 3 Boiling PointFelicity: Season 4 True BelieverThe O.C. Season 1 StakeoutThe Shield Season 3 The PackageQueer as Folk Season 4 33 More Gaussian More Skewed Slide 34 Shape of Amazon Clothing reviews Amazon Clothing Most Skewed Reviews Bra Disc Nipple Covers Vanity Fair Womens String Bikini Panty Lee Mens Relaxed Fit Tapered Jean Carhartt Mens Dungaree Jean Wrangler Mens Cowboy Cut Slim Fit Jean Nearly all are heavily polarized! 34 Slide 35 Shape of Amazon Electronics reviews Amazon Electronics Most Skewed Reviews Sony CD-R 50 Pack Spindle Olympus Stylus Epic Zoom Camera Sony AC Adapter Laptop Charger Apricorn Hard Drive Upgrade Kit Corsair 1GB Desktop Memory Nearly all are heavily polarized! 35 Slide 36 Shape of BeerAdvocate reviews BeerAdvocate Most Gaussian Reviews Weizenbock (Sierra Nevada) Ovila Abbey Saison (Sierra Nevada) Stoudts Abbey Double Ale Stoudts Fat Dog Stout Juniper Black Ale Nearly all are Gaussian! 36 Slide 37 Hypotheses on shape of data Hard to evaluate beyond binary Selection bias Only committed viewers watch Season 4 of a TV series Hard to compare value across very different items. Lots of beers and movies to compare Fewer TV shows Even fewer jeans or hard drives vs. 37 Slide 38 Key Points Modeling: Fit real data with flexible recommender distribution Prediction: Predict user preferences Anomaly Detection: When does a user not match the normal model? 38 Slide 39 Questions? Alex Beutel [email protected] http://alexbeutel.com 39 Slide 40 u5u5 u6u6 aa Sampling Cluster Parameters Hyperparameters , , W , Priors on , , W 40 Slide 41 Gibbs Sampling - Clusters Probability of a cluster (CRP) Probability u i would be sampled from cluster a [Details] 41 Slide 42 Sampling user parameters [Details] Probability of u i given cluster parameters Probability of predicting ratings r i,j Recommender distribution is non-conjugate Cant sample directly! 42 Use a Laplace approximation and perform Metropolis-Hastings Sampling Slide 43 Sampling user parameters [Details] Use candidate normal distribution Mode of p( u i )Variance of p( u i ) Sample Metropolis-Hastings Sampling: Keep new with probability 43 Slide 44 Sampling Cluster Parameters Priors Users/Items in the cluster [Details] 44 Slide 45 Inferring Hyperparameters [Details] Solved directly no sampling needed! Prior hidden as additional cluster 45 Slide 46 Have to use non-standard sampling procedure: 99.12% acceptance rate for Amazon Electronics 77.77% acceptance rate for Netflix 24k Does Metropolis Hasting work? 46 Slide 47 Does it work? UniformBPMFCoBaFi (us) Netflix (24k users) 1.69041.25251.1827 BeerAdvocate2.19721.98551.6741 Compare on Predictive Probability (PP) to see how well our model fits the data 47 Slide 48 Handling Spammers PP BeforePP After BPMF1.70471.8146 CoBaFi1.05491.7042 PP BeforePP After BPMF1.23751.3057 CoBaFi0.96701.2935 Random nave spammers in Amazon Electronics dataset Random hijacked accounts in Netflix 24k dataset 48 Slide 49 Clustered Nave Spammers 83% are clustered together 49 Slide 50 Clustered Hijacked Accounts Clustered hijacked accountsClustered attacked movies 50