16
Internals of an Aggregated Web News Feed newsfeed.ijs.si Mitja Trampuš and Blaž Novak AI Lab, Jozef Stefan Institute

Internals Of An Aggregated Web News Feed

Embed Size (px)

Citation preview

Page 1: Internals Of An Aggregated Web News Feed

Internals of anAggregated Web News Feed

newsfeed.ijs.si

Mitja Trampuš and Blaž NovakAI Lab, Jozef Stefan Institute

Page 2: Internals Of An Aggregated Web News Feed

Monitor.Download.

txt

Clean.Enrich.

Expose.Use.

Page 3: Internals Of An Aggregated Web News Feed

Monitor.Download.

txt

Expose.Use.

Clean.Enrich.

Page 4: Internals Of An Aggregated Web News Feed

Monitor. Download.

• Sources: RSS, Google News, private feeds– 150 000 feeds– 15 000 publishers

• Sources of sources:– Bootstrap from public listings– Parse news articles for <link> entries

Page 5: Internals Of An Aggregated Web News Feed

Monitor. Download.

• Quality management:– Punish technical errors– Adjustable crawl time

• Discovery delay for articles: 3 hours

Page 6: Internals Of An Aggregated Web News Feed

txt

Expose.Use.

Clean.Enrich.

Monitor.Download.

Page 7: Internals Of An Aggregated Web News Feed

Clean.1/2

• Methods in published papers work great– If evaluated on 10 sites

• Heuristic: Find the first block-level HTML element with lots of <p>aragraphs– failing that, a <td> or <div> with lots of text– avoid elements with lots of markup– site-independent

• Support for rNews/Schema.org

Page 8: Internals Of An Aggregated Web News Feed

Clean.2/2

• Pitfalls– Pages with no content– Comments– Copyright notices

• Evaluation– 150 sites, one page per site• include content-less pages

– 95% precision, 95% recall

Page 9: Internals Of An Aggregated Web News Feed

txt

Expose.Use.

Clean.

Enrich.Monitor.

Download.

Page 10: Internals Of An Aggregated Web News Feed

Enrich.1/2

• Language detection:– 50 common languages: Chromium CLD– Long tail: Naive Bayes on character trigrams

• Language stats:– English 52%, German 7%, Spanish 7%,

French 4%, Russian 3%, ...,Chinese 1%, Slovene 0.2%

– 40 languages with >100 articles daily– 99% accuracy

Page 11: Internals Of An Aggregated Web News Feed

Enrich.2/2

• enrycher.ijs.si– DMOZ categorization– Named entity detection, resolution– (Sentiment)– (Deep parsing)– English, Slovene, more languages coming

• Geo-tagging– Publisher (WHOIS, public listings)– Content (named entities)

Page 12: Internals Of An Aggregated Web News Feed

txt

Monitor.Download.

Expose.Use.

Clean.Enrich.

Page 13: Internals Of An Aggregated Web News Feed

Expose. Use.

• XML, gzip filesystem cache• HTTP service (polling)• Command-line client

• Live demo, API:http://newsfeed.ijs.si/

Page 14: Internals Of An Aggregated Web News Feed

Technology.• Data volume: 100 000 articles/day

Peak throughput: 10 articles/second

• One machine for semantic processingOne machine for everything else

• Processing: Python, (Java, C++)Infrastructure: PostgreSQL, zeromq– Downloaders communicate through the DB– Processing strictly sequential, service-oriented• Each service: In case of errors, pass through

Page 15: Internals Of An Aggregated Web News Feed

The Bright Future.

• Feed quality management

• Increase the number of sources– Non-western in particular

• Compute news clusters