22
Blog Track Open Task: Spam Blog Detection Tim Finin http://ebiquity.umbc.edu/paper/ html/id/318/ Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin Martineau University of Maryland, Baltimore County NIST Blog Pre-Track, 14 Nov 2006 James Mayfield Johns Hopkins University Applied Physics Laboratory

Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Embed Size (px)

Citation preview

Page 1: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Blog Track Open Task: Spam Blog Detection

Tim Finin

http://ebiquity.umbc.edu/paper/html/id/318/

Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin Martineau

University of Maryland, Baltimore County

NIST Blog Pre-Track, 14 Nov 2006

James MayfieldJohns Hopkins University Applied Physics Laboratory

Page 2: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Blogosphere Reputation at Stake!

Page 3: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Spam in the Blogosphere

• Types: comment spam, ping spam, spam blogs• Akismet: “87% of all comments are spam”• 75% of update pings are spam (ebiquity 2005)• 20% of indexed blogs by popular blog search

engines is spam (Umbria 2006, ebiquity 2005)• “Spam blogs, sometimes referred to by the

neologism splogs, are weblog sites which the author uses only for promoting affiliated websites” 1

• “Spings, or ping spam, are pings that are sent from spam blogs” 1

1Wikipedia

Page 4: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Why a problem?

• Blogosphere increasingly important segment of Web; ~12 hours from post to Google index

• Splog content provides no additional value• Splog content is often plagiarized• Splogs demote value of authentic content• Splogs steal advertising (referral) revenue

from authentic content producers• Splogs stress the blogosphere infrastructure• Splogs can skew Blog Analytics, as was

observed in TREC Blog Track 2006

Page 5: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Nature of Splogs in TREC 2006• Around 83K identifiable blog home-pages

in the collection, with 3.2M permalinks• 81K blogs could be processed • We use splog detection models developed

on blog home-pages; 87% accuracy• We identified 13,542 splogs• Blacklisted 543K permalinks from these

splogs• ~16% of the entire collection• ~17% splog posts injected into TREC

dataset1 1

The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis

Page 6: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Impact of Splogs in TREC Queries

Distribution of Splogs that appear TREC queries

0

20

40

60

80

100

120

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Top 100 Search Results ranked using TFIDF Scoring

Num

ber

of Splo

gs

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

American Idol

CholesterolHybrid Cars

Page 7: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Splog Detection Task Proposal

• Motivation– Detecting and eliminating spam is an

essential requirement for any blog analysis– Splog detection has characteristics that

set it appart from e-mail and web spam detection

• Constraint– Simulate how blog search systems operate

• Task Statement– Is an input permalink (post) spam?

Page 8: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Relation to E-mail Spam Detection• TREC has an E-mail Spam

Classification Task• Similar in

– Fast online spam detection

• Different in– Nature of spamming: links, RSS feeds,

web graph, metadata– Users targeted indirectly through

search engines, e.g. “N1ST” not relevant for “NIST” query

Page 9: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Relation to Web Spam Detection•TREC does not have a web spam track•Similar in

–Spamming web link structure

•Different in–Coverage: Blog Analytics Engines don’t look

beyond blogosphere–Speed of detection is important, 150K

posts/hour–Presence of structured text through RSS

feeds presents new opportunities, and challenges

Page 10: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Difficulty•We have been experimenting with multiple

approaches starting mid 2005•Data: http://ebiquity.umbc.edu/resource/html/id/212

Page 11: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Difficulty

•Evolving spamming techniques and splog creation genres

•Most basic technique spam techniques– Generate content by stuffing key dictionary words– Generate link to affiliates, through link dumps on

blogrolls, linkrolls or after post content•Evolving spam techniques

– Scrape contextually similar content to generate posts

– RSS hijacking– Aggregation software, e.g. Planet X– Intersperse links randomly– Make link placement meaningful– Add spam comments and then ping. Repeat.

Page 12: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin
Page 13: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Task Details - Dataset Creation• Similar to TREC Blog 2006, a

collection of feeds, blog home-pages and permalinks

• View dataset D as two sets – Dbase , Dtest

• Dbase to span (n-x) days, and Dtest to span the rest of x days for x≤1

• D could collected as a combination of– D as collected in 2006– Sample a subset of pings from a ping

server over the period that D is collected

Page 14: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Task Details - Assessment

•Assessors classify spam post into one or more classes based on the kind of spam this post, or the blog hosting it features– Non-blog– Keyword-stuffed– Post-stitching– Post-plagiarism– Post-weaving– Blog/link-roll spam

•Each assessment typically takes 1-2 minutes•Detailed assessment will enable participants

to identify classes they handle well and where they can improve

Page 15: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Non-Blog ping at weblogs.com• No RSS Feeds• No Dated Entry, no comments• Possibly plagiarized content

Page 16: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Keyword Stuffed Blog• ‘coupon codes’, ‘casino’

Page 17: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Post Stitching• Excerpts scraped from other sources

Page 18: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Post Weaving• Spam Links contextually placed in post

Page 19: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Link-roll spam• With fully plagiarized text

Page 20: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Evaluation

• Dbase distributed first, Dtest subsequently with 50 independent sets of permalinks

• Dbase, Dtest division will mimic how blog search engines operate– Build models to detect splogs – using individual

posts, feeds or blog homepages of what is seen– Detect spam in an incoming stream of new

blog postings

• Teams will be judged by how well they detect “spamminess” for new posts

Page 21: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Input/Output<set>

<num>...</num>

<test>

<permalink>

<url>...</url>

<homepage>...</homepage>

<feed>...</feed>

<when>... </when>

</permalink>

<permalink>

...

</permalink>

...

</test>

</set>

{set Q0 docno rank prob runtag}

Individual set of test input.1 or y such sets can be used, with each set biased to a specific splog genre, blog Publishing host or TLD

Each permalink to bejudged by participants

Output format

Page 22: Blog Track Open Task: Spam Blog Detection Tim Finin  Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin

Summary

• Spam Blogs present a major challenge to the quality of blog mining/analytics

• Splog Detection is different from spam in other communication platforms

• Development of TREC Task will help furthering state of the art

• Task requirements can be easily aligned with existing task of opinion identification