Transcript
Page 1: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

1

INFSCI 2480!RSS Feeds !

Document Filtering!Yi-ling Lin!

02/02/2011!

Feed? RSS? Atom? !

  RSS = Rich Site Summary!

  RSS = RDF (Resource Description Framework) Site Summary!

  RSS = Really Simple Syndicate!

  ATOM!

Page 2: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

2

Feeds!

  Feed = “A document (often XML-based) which contain content items, often summaries of stories or weblog posts with web links to longer versions”!

  Feed > RSS, Atom!

Feeds!

RSS 2.0!

RSS 0.92!

RSS 0.91!

RSS 1.0!

Atom!

RSS Versions!

  Version distribution collected by an RSS search engine (Feb 2010)!

  2.0 > 1.0 > 0.91 > 0.92!

  http://www.syndic8.com/stats.php?Section=rss#tabtable!

Page 3: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

3

Comparison of RSS versions!

RSS 0.91 RSS 0.92 RSS 2.0 Categories on channel or item X O O Elements on the channel : language, copyright, docs, lastBuildDate, managingEditor, pubDate, rating, skipDays, skipHours, generator, ttl

X X O

Item enclosures X O O

Elements on items: authors, comments, pubDate X X O

Item count limitation 15 X X

Notes Channel-level metadata only

Allows both channel and

item metadata Modularized

Revealing RSS in Web pages!

Page 4: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

4

RSS content Structure!

  RSS 0.90 to 2.0 family!

  XML!

  <channel> & <item> parts!Feed information (channel)!Each article content (item)!

  Additional features with higher versions — 0.90 to 2.0!

  RSS 1.0 & Atom are in different formats!!

RSS 0.92

Page 5: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

5

RSS 2.0

RSS 1.0 “uses RDF” http://www.w3.org/RDF/

Page 6: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

6

ATOM

In more detail...!

  Specifications!

  RSS 0.91: http://www.rssboard.org/rss-0-9-1-netscape!

  RSS 2.0: http://cyber.law.harvard.edu/rss/rss.html!

Page 7: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

7

Parsing RSS Feeds!

  Problem — extract texts from RSS structure!

  They are XML!

  Parsers!

  SAX!

  DOM!

  Out-of-box parser !

SAX and DOM!

  SAX (Simple API for XML) — serial access parser!

  Stream of XML data goes in!

  Event-driven parsing!

  DOM (Document Object Model)!

  Use hierarchical structure for parsing!

Page 8: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

8

SAX Example!

DOM Example!

Page 9: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

9

Ready-made Parser!

  Universal Feed Parser <http://www.feedparser.org>!

Universal Feedparser!

Page 10: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

10

Core Attributes!

  Follows RSS/ATOM syntax normalization!

  However, not always!

  updated!

  /atom10:feed/atom10:updated!

  /atom03:feed/atom03:modified!

  /rss/channel/pubDate!

  /rss/channel/dc:date!

  /rdf:RDF/rdf:channel/dc:date!

  /rdf:RDF/rdf:channel/dcterms:modified!

Advanced features!

  Date parsing!

  HTML sanitization!

  Content normalization!

  Namespace handling!

  and more...!

Page 11: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

11

Document classification!

Probability Calculation!

  Pr(word|classification)!

  Ex. Pr(“drug”|spam) = 80 docs / total 100 spam docs = 0.8!

Page 12: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

12

Weighted Probability!

  Doc1[… money …](s), Doc2[ … money …](s), Doc3[ … money …](s), Doc4[……](s), Doc5[……](ns)!

  Pr(“money”|spam) = 3/4 = 0.75!

  Pr(“money”|no-spam) = 0/1 = 0!

  Pr = 0.5 (we don’t know) may be better than Pr = 0 (never)!

  Ex. After finding one spam instance!

Naive Bayesian Classifier!

  Goal = Pr(Category|Document) !

  Ex. Pr(Spam|Doc1) = 0.001, Pr(No-spam|Doc1) = 0.5 → Doc1 = No-pam!

  What we have is? = Pr(Feature|Category)!

  Process = Pr(Feature|Category) → Pr(Document|Category) → Pr(Category|Document)!

Page 13: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

13

Pr(Document|Category) !

  Pr(Document|Category) = Pr(Feature1|Cat) * Pr(Feature2|Cat) * Pr(Feature3|Cat) … Pr(FeatureN|Cat) !

  Pr(A ^ B) = Pr(A) * Pr(B)!

  Assumption — A and B are independent from each other!

  Not true — social vs. Web, social vs. Probability!

  But still useful!

Pr(Category|Document)!

  Pr(A|B) = Pr(B|A) * Pr(A) / Pr(B)!

  Thomas Bayes!

  Pr(Category|Document) != Pr(Document|Category) * Pr(Category) / Pr(Document) !

  Pr(Category) = # of docs in Cat / total # of docs!

  Pr(Document) = Constant!

Page 14: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

14

Choosing a Category!

  Take one with the highest probability!

  What if, Pr(Spam|Doc) = 0.000001, Pr(No-spam|Doc) = 0.0000005!

  Answer may be “Not sure”!

Choosing a Category!

  Thresholding!

  If Pr(Spam|Doc) > 3 * Pr(No-spam|Doc),!

  Then spam!

  → which is more reasonable!

Page 15: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

15

Persisting Trained Classifier!

  Classifier in python,!

  Dictionaries in memory — fc, cc!

  Disappears after quitting from Python interpreter!

  Should be saved to disc!

  MySQL — client/server RDBMS!

  SQLite — file-based RDBMS!

Persisting Trained Classifier!

  Python shelve!

  Put/Get any Python object into disk files!

Page 16: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

16

Alternative Methods!

  Supervised learning methods!

  Neural network!

  Support Vector Machine!

  Decision Tree!

  Software packages!

  Weka, R, SPSS Clementine, etc!

Weka Example!

  Example Data!

  Weather condition !→ To play or not to play?!

  4 attributes, 1 class variable!

Page 17: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

17

Weka Example!

Weka Example!

Page 18: INFSCI 2480 RSS Feeds Document Filteringpeterb/2480-122/RSSFeeds_DocFiltering.pdf · Feeds! Feed = “A document (often XML-based) which contain content items, often summaries of

2/7/11

18

Weka Example!


Recommended