Hybrid Filtering. Computational Journalism week 6

Frontiers of Computational Journalism

Columbia Journalism School

Week 6: Hybrid Filtering

October 16, 2015

Filtering Comments

Thousands of comments, what are the “good” ones?

Comment voting

Problem: putting comments with most votes at top doesn’t work. Why?

Reddit Comment Ranking (old)

Up – down votes plus time decay

Reddit Comment Ranking (new)

Hypothetically, suppose all users voted on the comment, and v out of N up-voted. Then we could sort by proportion p = v/N of upvotes.

N=16 v = 11 p = 11/16 = 0.6875

Reddit Comment Ranking

Actually, only n users out of N vote, giving an observed approximate proportion p’ = v’/n

n=3 v’ = 1 p’ = 1/3 = 0.333

Reddit Comment Ranking

Limited sampling can rank votes wrong when we don’t have enough data.

p’ = 0.333 p = 0.6875

p’ = 0.75 p = 0.1875

Random error in sampling If we observe p’ upvotes from n random users, what is the distribution of the true proportion p?

Distribution of p’ when p=0.5

Confidence interval Given observed p’, interval that true p has a probability α of lying inside.

Rank comments by lower bound of confidence interval

p’ = observed proportion of upvotes n = how many people voted zα= how certain do we want to be before we assume that p’ is “close” to true p

Analytic solution for confidence interval, known as “Wilson score”

User-‐‑item matrix

Stores “rating” of each user for each item. Could also be binary variable that says whether user clicked, liked, starred, shared, purchased...

User-‐‑item matrix •  No content analysis. We know nothing about what is “in” each

item. •  Typically very sparse – a user hasn’t watched even 1% of all

movies. •  Filtering problem is guessing “unknown” entry in matrix. High

guessed values are things user would want to see.

Filtering process

How to guess unknown rating?

Basic idea: suggest “similar” items. Similar items are rated in a similar way by many different users. Remember, “rating” could be a click, a like, a purchase.

o  “Users who bought A also bought B...” o  “Users who clicked A also clicked B...” o  “Users who shared A also shared B...”

Similar items

Item similarity Cosine similarity!

Other distance measures “adjusted cosine similarity”

Subtracts average rating for each user, to compensate for general enthusiasm (“most movies suck” vs. “most movies are great”)

Generating a recommendation

Weighted average of item ratings by their similarity.

Matrix factorization recommender

Matrix factorization plate model

user rating of item

variation in user topics

variation in item topics

topics for user

topics for item

i users

j items

Combining collaborative filtering and topic modeling

K topics

topic for word word in doc topics in doc topic

concentration parameter

word concentration parameter

Content modeling -‐‑ LDA

D docs

words in topics

N words in doc

K topics topic for word word in doc topics in doc (content)

topic concentration

weight of user selections

variation in per-‐‑user topics topics for user

user rating of doc topics in doc

(collaborative)

Collaborative Topic Modeling

content only

content + social

Different Filtering Systems Content: Newsblaster analyzes the topics in the documents. No concept of users. Social: What I see on Twitter determined by who I follow. Reddit comments filtered by votes as input. Amazon "people who bought X also bought Y" No content analysis. Hybrid: Recommend based both on content and user behaviur.

Item Content My Data Other Users’ Data

Text analysis, topic modeling, clustering...

who I follow

what I’ve read/liked

social network structure,

other users’ likes

How to evaluate/optimize?

How to evaluate/optimize? •  Netflix: try to predict the rating that the user gives a

movie after watching it.

•  Amazon: sell more stuff.

•  Google web search: human raters A/B test every change

•  Does the user understand how the filter works? •  Can they configure it as desired? •  Can they correctly predict what they will and won't

•  Can it be gamed? Spam, "user-generated censorship," etc.

"ʺDuring the 2012 election, The ~2000 members of an anti-‐‑Ron Paul subreddit discovered that anything they posted, anywhere on reddit, was being rapidly, repeatedly downvoted. They created a diagnostic subreddit and began posting otherwise meaningless text to verify this otherwise odd behavior."ʺ

Filter design problem Formally, given

U = user preferences, history, characteristics S = current story {P} = results of function on previous stories {B} = background world knowledge (other users?)

Define

r(S,U,{P},{B}) in [0...1] relevance of story S to user U

Filter design problem, restated When should a user see a story? Aspects to this question:

normative personal: what I want societal: emergent group effects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely

Does it improve the user's life?

Hybrid Filtering. Computational Journalism week 6

Documents

Computational Journalism at Columbia, Fall 2013, Lecture 7: Knowledge Representation

Provided by the author(s) and University College Dublin ... · Welcome to the 3rd European Data and Computational Journalism Conference! The 3rd European Data and Computational Journalism

Text Analysis. Computational journalism week 3

Visualization: Computational Journalism week 7

Pixels and Image Filtering Computational Photography Derek Hoiem 08/26/10 Graphic:

Towards an epistemology of data journalism in the devolved … · 2017-09-08 · data journalism, Hamilton and Turner (2009: 2) defined computational journalism as a ‘combination

Computational Journalism at Columbia, Fall 2013: Lecture 2, Text Analysis

Computational Journalism

Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 1

Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

Drawing Conclusions From Data: Computational Journalism Week 10

Clustering. Computational Journalism week 2

InVID at Computational journalism workshop in Rennes, France

Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics

Teaching Data and Computational Journalism

The Datafication of Data Journalism Scholarship: Focal Points, … · 2017-02-27 · citation analysis, computational journalism, computer-assisted reporting, data journalism, data-driven

Cultivating the Landscape of Innovation in Computational Journalism

Computational Journalism at Columbia, Fall 2013, Lecture 6: Visualization

Towards editorial transparency in computational journalism

Social Network Analysis: Computational Journalism week 9