Hybrid Filtering. Computational Journalism week 6

Preview:

DESCRIPTION

Jonathan Stray, Columbia University, Fall 2015Syllabus at http://www.compjournalism.com/?p=133

Citation preview

Frontiers  of  Computational  Journalism

Columbia Journalism School

Week 6: Hybrid Filtering

October 16, 2015

Filtering  Comments

Thousands of comments, what are the “good” ones?

Comment  voting

Problem: putting comments with most votes at top doesn’t work. Why?

Reddit  Comment  Ranking  (old)

Up – down votes plus time decay

Reddit  Comment  Ranking  (new)

Hypothetically, suppose all users voted on the comment, and v out of N up-voted. Then we could sort by proportion p = v/N of upvotes.

N=16 v  =  11 p  =  11/16  =  0.6875

Reddit  Comment  Ranking

Actually, only n users out of N vote, giving an observed approximate proportion p’ = v’/n

n=3 v’  =  1 p’  =  1/3  =  0.333

Reddit  Comment  Ranking

Limited sampling can rank votes wrong when we don’t have enough data.

p’  =  0.333 p  =  0.6875  

p’  =  0.75 p  =  0.1875  

Random  error  in  sampling If we observe p’ upvotes from n random users, what is the distribution of the true proportion p?

Distribution  of  p’  when  p=0.5

Confidence  interval Given observed p’, interval that true p has a probability α of lying inside.

Rank  comments  by  lower  bound    of  confidence  interval

p’ = observed proportion of upvotes n = how many people voted zα= how certain do we want to be before we assume that p’ is “close” to true p

Analytic  solution  for  confidence  interval,  known  as  “Wilson  score”

User-­‐‑item  matrix

Stores “rating” of each user for each item. Could also be binary variable that says whether user clicked, liked, starred, shared, purchased...

User-­‐‑item  matrix •  No content analysis. We know nothing about what is “in” each

item. •  Typically very sparse – a user hasn’t watched even 1% of all

movies. •  Filtering problem is guessing “unknown” entry in matrix. High

guessed values are things user would want to see.

Filtering  process

How  to  guess  unknown  rating?

Basic idea: suggest “similar” items. Similar items are rated in a similar way by many different users. Remember, “rating” could be a click, a like, a purchase.

o  “Users who bought A also bought B...” o  “Users who clicked A also clicked B...” o  “Users who shared A also shared B...”

Similar  items

Item  similarity Cosine similarity!

Other  distance  measures “adjusted cosine similarity”

Subtracts  average  rating  for  each  user,  to  compensate  for  general  enthusiasm  (“most  movies  suck”  vs.  “most  movies  are  great”)

Generating  a  recommendation

Weighted  average  of  item  ratings  by  their  similarity.

Matrix  factorization  recommender

Matrix  factorization  recommender

Matrix  factorization  plate  model

r

v

u

user  rating of  item

variation  in user  topics

λu

λv

variation  in item  topics

topics  for  user

topics  for  item

i  users

j  items

Combining  collaborative  filtering    and  topic  modeling

K  topics  

topic  for  word word  in  doc topics  in  doc topic  

concentration parameter

word concentration parameter

Content  modeling  -­‐‑  LDA

D  docs

words  in  topics

N  words in  doc

K  topics   topic  for  word word  in  doc topics  in  doc (content)

topic   concentration

weight  of  user selections

variation  in per-­‐‑user  topics topics  for  user

user  rating of  doc topics  in  doc

(collaborative)

Collaborative  Topic  Modeling  

content  only

content  +   social

Different  Filtering  Systems Content: Newsblaster analyzes the topics in the documents. No concept of users. Social: What I see on Twitter determined by who I follow. Reddit comments filtered by votes as input. Amazon "people who bought X also bought Y" No content analysis. Hybrid: Recommend based both on content and user behaviur.

Item  Content My  Data Other  Users’  Data

Text  analysis,   topic  modeling,  clustering...

who  I  follow

what  I’ve  read/liked

social  network  structure,

other  users’  likes  

How  to  evaluate/optimize?

How  to  evaluate/optimize? •  Netflix: try to predict the rating that the user gives a

movie after watching it.

•  Amazon: sell more stuff.

•  Google web search: human raters A/B test every change

•  Does the user understand how the filter works? •  Can they configure it as desired? •  Can they correctly predict what they will and won't

see?

How  to  evaluate/optimize?

•  Can it be gamed? Spam, "user-generated censorship," etc.

How  to  evaluate/optimize?

"ʺDuring  the  2012  election,  The  ~2000  members  of  an  anti-­‐‑Ron  Paul  subreddit  discovered  that  anything  they  posted,  anywhere  on  reddit,  was  being  rapidly,  repeatedly  downvoted.  They  created  a  diagnostic  subreddit  and  began  posting  otherwise  meaningless  text  to  verify  this  otherwise  odd  behavior."ʺ

Filter  design  problem Formally, given

U = user preferences, history, characteristics S = current story {P} = results of function on previous stories {B} = background world knowledge (other users?)

Define

r(S,U,{P},{B}) in [0...1] relevance of story S to user U

Filter  design  problem,  restated When should a user see a story? Aspects to this question:

normative personal: what I want societal: emergent group effects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely

How  to  evaluate/optimize?

Does it improve the user's life?

Recommended