41
Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Embed Size (px)

Citation preview

Page 1: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

SEARCHING THE BLOGOSPHERE

Nilesh BansalNick KoudasUniversity of Toronto

Page 2: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

BLOGOSPHERE

Page 3: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

Page 4: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

67M KNOWN BLOGS

100K NEW EVERYDAY

DOUBLING EVERY 200 DAYS

Page 5: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

WHAT ARE THEY WRITING ABOUT??

PERSONAL LIFEPRODUCT REVIEWS

POLITICSTECHNOLOGY

TOURISMSPORTS

ENTERTAINMENT

Page 6: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

WHY SHOULD WE CARE?

Page 7: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

HUGE DATA REPOSITORY

WILL CONTINUE TO GROW

EXTRACT PUBLIC OPINION

VALUABLE INSIGHTS

Page 8: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

KEY INSIGHTS

MARKET RESEARCH

PUBLIC RELATION STRATEGIES

CUSTOMER OPINION TRACKING

Page 9: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

CHALLENGES AND OPPORTUNITIES

Page 10: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

HUGE AMOUNTS OF UNSTRUCTURED TEXT

Page 11: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

Page 12: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

MACHINE CREATED WEBLOGS

MORE THAN HALF OF BLOGSPOT IS SPAM

33% OF WEBSPAM HOSTED AT BLOGSPOT

Page 13: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

TEMPORAL DIMENSION

Page 14: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

GEOGRAPHICAL ASSOCIATION

Page 15: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

CONVERSATION

Page 16: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

Gruhl et al., The Predictive Power of Online Chatter, KKD 2005

Kumar et al., On the Bursty Evolution of Blogspace, WWW 2003

Chi et al., Eigen-trend: trend analysis in the blogosphere based on singular value decompositions, CIKM 2006

Mishne et al., MoodViews: Tool for Blog Mood Analysis, AAAI-CAAW 2006

Mei et al., Topic sentiment mixture: modeling facets and opinions in weblogs, WWW 2007

Page 17: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

BLOGSCOPE

Page 18: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

Page 19: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

CRAWLER RUNNING 24x7

TRACKING 9M BLOGS

INDEXING 70M ARTICLES

AGGREGATION AND PREPROCESSING

INTERACTIVE SEARCH AND ANALYSIS

Page 20: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

ANY STREAMING TEXT SOURCE

NEWS

MAILING LISTS

FORUMS

SOCIAL MEDIA

Page 21: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

www.blogscope.net

HotKeywords

HotKeywords

Page 22: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

RelatedTerms

RelatedTerms

PopularityCurve

PopularityCurve

SearchResultsSearchResults

GeoSearch

GeoSearch

Page 23: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

Hawaii Earthquake

TaiwanUndersea

Earthquake Sumatra Earthquake

Page 24: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

December 15 2006

March 06 2007

Page 25: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

IPHONE ON JAN 09 2007

Page 26: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

Curves are usually correlated, except

at one point

Page 27: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

TECHNIQUES

Page 28: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

CRAWLS RSS FEEDS

250 THOUSAND NEW POSTS DAILY

PING SERVER: WEBLOGS.COM

Page 29: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

[Wang et al.] Spam Double-Funnel: Connecting Web Spammers with Advertisers, WWW 2007[Gyongi et al.] Combating Web Spam With TrustRank, VLDB 2004[Kolari et al.] Detecting Spam Blogs, A Machine Learning Approach, AAAI 2006

LINK BASED ANALYSIS IS NOT EFFECTIVE

SPAMMERS ARE INTELLIGENT

WE USE HEURISTICS

ON GOING BATTLE

Page 30: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

INTERACTIVE APPLICATION

TWO SECOND RESPONSE TIME

HUGE AMOUNTS OF DATA

SEVEN THOUSAND UNIQUE IP ADDRESSES DAILY

SCALABILITY

Page 31: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

Page 32: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

BURST DETECTION

[Kleinberg] Bursty and Hierarchical Structures in Streams, DMKD 2007[Fung et al.] Parameter Free Bursty Events Detection in Text Streams, VLDB 2005

Page 33: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

POPULARITY = BASE + ZERO MEAN GAUSSIAN

BURST = STATISTICAL OUTLIER

),0( 2 Nx

2x

Page 34: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

IDENTIFYING RELATED TERMS

Page 35: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

COLLOCATIONS

POINTWISE MUTUAL INFORMATION

EXPENSIVE

[Ott and Longnecker] An Introduction to Statistical Methods and Data Analysis[Manning and Schutze] Foundation of Natural Statistical Language Processing[Church and Hanks] Word Association Norms, Mutual Information and Lexicography, ACL 1989

)(

)|(

)(

)|(),(

DbP

DaDbP

DaP

DbDaPbascore

Page 36: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

FAST COMPUTATION OF RELATED TERMS

RANDOM SAMPLE

MUTUAL INFORMATION IN EXPECTATION

USE TF WITH PRECOMPUTED IDF

)()(

)(),(

|}|{|

|||}|{|),(

dqPdtP

dtdqPqts

dtd

DdtDddqts

Page 37: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

COMPUTING HOT KEYWORDS

Page 38: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

POPULAR DOES NOT MEAN HOT

INTERESTING = SURPRISING

MIXTURE OF DIFFERENT SCORING FUNCTIONS

DEVIATION FROM EXPECTED

Page 39: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

INTELLIGENT ALERT SERVICE

BURST SYNOPSIS

AUTHORATIVE RANKING

Page 40: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007

Nilesh Bansal, Fei Chiang, Nick Koudas, Frank Wm. Tompa, Seeking Stable Clusters in the Blogosphere, to appear in VLDB 2007.

Nilesh Bansal, Nick Koudas, BlogScope: System for Online Analysis of High Volume Text Streams, to appear in VLDB 2007 (Demonstration Proposal).

JUST THE BEGINNING

Page 41: Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

Nilesh Bansal and Nick Koudas

WebDB 2007Source: xkcd.com

THANK YOU. QUESTIONS?