Upload
render-project
View
534
Download
1
Tags:
Embed Size (px)
Citation preview
Internals of anAggregated Web News Feed
newsfeed.ijs.si
Mitja Trampuš and Blaž NovakAI Lab, Jozef Stefan Institute
Monitor.Download.
txt
Clean.Enrich.
Expose.Use.
Monitor.Download.
txt
Expose.Use.
Clean.Enrich.
Monitor. Download.
• Sources: RSS, Google News, private feeds– 150 000 feeds– 15 000 publishers
• Sources of sources:– Bootstrap from public listings– Parse news articles for <link> entries
Monitor. Download.
• Quality management:– Punish technical errors– Adjustable crawl time
• Discovery delay for articles: 3 hours
txt
Expose.Use.
Clean.Enrich.
Monitor.Download.
Clean.1/2
• Methods in published papers work great– If evaluated on 10 sites
• Heuristic: Find the first block-level HTML element with lots of <p>aragraphs– failing that, a <td> or <div> with lots of text– avoid elements with lots of markup– site-independent
• Support for rNews/Schema.org
Clean.2/2
• Pitfalls– Pages with no content– Comments– Copyright notices
• Evaluation– 150 sites, one page per site• include content-less pages
– 95% precision, 95% recall
txt
Expose.Use.
Clean.
Enrich.Monitor.
Download.
Enrich.1/2
• Language detection:– 50 common languages: Chromium CLD– Long tail: Naive Bayes on character trigrams
• Language stats:– English 52%, German 7%, Spanish 7%,
French 4%, Russian 3%, ...,Chinese 1%, Slovene 0.2%
– 40 languages with >100 articles daily– 99% accuracy
Enrich.2/2
• enrycher.ijs.si– DMOZ categorization– Named entity detection, resolution– (Sentiment)– (Deep parsing)– English, Slovene, more languages coming
• Geo-tagging– Publisher (WHOIS, public listings)– Content (named entities)
txt
Monitor.Download.
Expose.Use.
Clean.Enrich.
Expose. Use.
• XML, gzip filesystem cache• HTTP service (polling)• Command-line client
• Live demo, API:http://newsfeed.ijs.si/
Technology.• Data volume: 100 000 articles/day
Peak throughput: 10 articles/second
• One machine for semantic processingOne machine for everything else
• Processing: Python, (Java, C++)Infrastructure: PostgreSQL, zeromq– Downloaders communicate through the DB– Processing strictly sequential, service-oriented• Each service: In case of errors, pass through
The Bright Future.
• Feed quality management
• Increase the number of sources– Non-western in particular
• Compute news clusters