INFSCI 2480 RSS Feeds Document peterb/2480-122/RSSFeeds_ آ  Feeds! Feed = “A document

  • View
    0

  • Download
    0

Embed Size (px)

Text of INFSCI 2480 RSS Feeds Document peterb/2480-122/RSSFeeds_ آ  Feeds! Feed = “A document

  • 2/7/11


    1


    INFSCI 2480! RSS Feeds !

    Document Filtering! Yi-ling Lin!

    02/02/2011!

    Feed? RSS? Atom? !

      RSS = Rich Site Summary!

      RSS = RDF (Resource Description Framework) Site Summary!

      RSS = Really Simple Syndicate!

      ATOM!

  • 2/7/11


    2


    Feeds!

      Feed = “A document (often XML-based) which contain content items, often summaries of stories or weblog posts with web links to longer versions”!

      Feed > RSS, Atom!

    Feeds!

    RSS 2.0!

    RSS 0.92!

    RSS 0.91!

    RSS 1.0!

    Atom!

    RSS Versions!

      Version distribution collected by an RSS search engine (Feb 2010)!

      2.0 > 1.0 > 0.91 > 0.92!

      http://www.syndic8.com/stats.php?

    Section=rss#tabtable!

  • 2/7/11


    3


    Comparison of RSS versions!

    RSS 0.91 RSS 0.92 RSS 2.0 Categories on channel or item X O O Elements on the channel : language, copyright, docs, lastBuildDate, managingEditor, pubDate, rating, skipDays, skipHours, generator, ttl

    X X O

    Item enclosures X O O

    Elements on items: authors, comments, pubDate X X O

    Item count limitation 15 X X

    Notes Channel-level metadata only

    Allows both channel and

    item metadata Modularized

    Revealing RSS in Web pages!

  • 2/7/11


    4


    RSS content Structure!

      RSS 0.90 to 2.0 family!

      XML!

      & parts!

    Feed information (channel)! Each article content (item)!

      Additional features with higher versions — 0.90 to 2.0!

      RSS 1.0 & Atom are in different formats!!

    RSS 0.92

  • 2/7/11


    5


    RSS 2.0

    RSS 1.0 “uses RDF” http://www.w3.org/RDF/

  • 2/7/11


    6


    ATOM

    In more detail...!

      Specifications!

      RSS 0.91: http://www.rssboard.org/rss-0-9-1-netscape!

      RSS 2.0: http://cyber.law.harvard.edu/rss/rss.html!

  • 2/7/11


    7


    Parsing RSS Feeds!

      Problem — extract texts from RSS structure!

      They are XML!

      Parsers!

      SAX!

      DOM!

      Out-of-box parser !

    SAX and DOM!

      SAX (Simple API for XML) — serial access parser!

      Stream of XML data goes in!

      Event-driven parsing!

      DOM (Document Object Model)!

      Use hierarchical structure for parsing!

  • 2/7/11


    8


    SAX Example!

    DOM Example!

  • 2/7/11


    9


    Ready-made Parser!

      Universal Feed Parser !

    Universal Feedparser!

  • 2/7/11


    10


    Core Attributes!

      Follows RSS/ATOM syntax normalization!

      However, not always!

      updated!

      /atom10:feed/atom10:updated!

      /atom03:feed/atom03:modified!

      /rss/channel/pubDate!

      /rss/channel/dc:date!

      /rdf:RDF/rdf:channel/dc:date!

      /rdf:RDF/rdf:channel/dcterms:modified!

    Advanced features!

      Date parsing!

      HTML sanitization!

      Content normalization!

      Namespace handling!

      and more...!

  • 2/7/11


    11


    Document classification!

    Probability Calculation!

      Pr(word|classification)!

      Ex. Pr(“drug”|spam) = 80 docs / total 100 spam docs = 0.8!

  • 2/7/11


    12


    Weighted Probability!

      Doc1[… money …](s), Doc2[ … money …](s), Doc3[ … money …](s), Doc4[……](s), Doc5[……](ns)!

      Pr(“money”|spam) = 3/4 = 0.75!

      Pr(“money”|no-spam) = 0/1 = 0!

      Pr = 0.5 (we don’t know) may be better than Pr = 0 (never)!

      Ex. After finding one spam instance!

    Naive Bayesian Classifier!

      Goal = Pr(Category|Document) !

      Ex. Pr(Spam|Doc1) = 0.001, Pr(No-spam|Doc1) = 0.5 →

    Doc1 = No-pam!

      What we have is? = Pr(Feature|Category)!

      Process = Pr(Feature|Category) → Pr(Document|

    Category) → Pr(Category|Document)!

  • 2/7/11


    13


    Pr(Document|Category) !

      Pr(Document|Category) = Pr(Feature1|Cat) * Pr(Feature2| Cat) * Pr(Feature3|Cat) … Pr(FeatureN|Cat) !

      Pr(A ^ B) = Pr(A) * Pr(B)!

      Assumption — A and B are independent from each other!

      Not true — social vs. Web, social vs. Probability!

      But still useful!

    Pr(Category|Document)!

      Pr(A|B) = Pr(B|A) * Pr(A) / Pr(B)!

      Thomas Bayes!

      Pr(Category|Document) !

    = Pr(Document|Category) * Pr(Category) / Pr(Document) !

      Pr(Category) = # of docs in Cat / total # of docs!

      Pr(Document) = Constant!

  • 2/7/11


    14


    Choosing a Category!

      Take one with the highest probability!

      What if, Pr(Spam|Doc) = 0.000001, Pr(No-spam|Doc) =

    0.0000005!

      Answer may be “Not sure”!

    Choosing a Category!

      Thresholding!

      If Pr(Spam|Doc) > 3 * Pr(No-spam|Doc),!

      Then spam!

      → which is more reasonable!

  • 2/7/11


    15


    Persisting Trained Classifier!

      Classifier in python,!

      Dictionaries in memory — fc, cc!

      Disappears after quitting from Python interpreter!

      Should be saved to disc!

      MySQL — client/server RDBMS!

      SQLite — file-based RDBMS!

    Persisting Trained Classifier!

      Python shelve!

      Put/Get any

    Python object into disk files!

  • 2/7/11


    16


    Alternative Methods!

      Supervised learning methods!

      Neural network!

      Support Vector Machine!

      Decision Tree!

      Software packages!

      Weka, R, SPSS Clementine, etc!

    Weka Example!

      Example Data!

      Weather condition !

    → To play or not to play?!

      4 attributes, 1 class

    variable!

  • 2/7/11


    17


    Weka Example!

    Weka Example!

  • 2/7/11


    18


    Weka Example!