7
JSI News Crawler Blaz Novak, Mitja Trampus, Blaz Fortuna, Marko Grobelnik JSI

JSI News Crawler

  • Upload
    leena

  • View
    75

  • Download
    0

Embed Size (px)

DESCRIPTION

JSI News Crawler. Blaz Novak, Mitja Trampus , Blaz Fortuna, Marko Grobelnik JSI. JSI News Crawler. The goal is to collect most of worlds news articles including relevant blog posts Why collecting data? To be independent of commercial data providers - PowerPoint PPT Presentation

Citation preview

Page 1: JSI News Crawler

JSI News CrawlerBlaz Novak, Mitja Trampus, Blaz Fortuna, Marko Grobelnik

JSI

Page 2: JSI News Crawler

JSI News Crawler0 The goal is to collect most of worlds news articles including

relevant blog posts

0 Why collecting data?0 To be independent of commercial data providers0 Since commercial data providers (like Spinn3r, GNIP, DataSift) are

expensive and not flexible in terms of data sources and additional services

0 To provide data stream free of charge for research

0 What data is available?0 Database dumps0 Articles annotated with Enrycher metadata0 Similar articles clusters0 Real-time feed

Page 3: JSI News Crawler

Architecture

Open Web

JSI Crawler

Database of Collected Articles

Web Service API

ArchiveExplorer

Content in form:• Clean text• Linguistics• Social Graph• LOD Links• Time

Control Panel

Enrycher

Real-TimeAnalytics

Developers

XML/RDF

Page 4: JSI News Crawler

Current statistics

0Data sources: ~110.000 unique websites0Stream size: ~192.000 articles/day

0 ~150 distinct languages0 good coverage of minority languages

0Current archive of ~35.000.000 articles

0Clear-text and language identification available

Page 5: JSI News Crawler

Sample Article from the stream

Page 6: JSI News Crawler

Download volume, yearly scale (2010)

Todays download volume, after adding 3k new sources + 1 week of backlog

Average and maximum number of story articles in a cluster (today)

Control Panel

Page 7: JSI News Crawler

Plans

0 In the first half of 2012 the plan is to release the service for public use

0…in the future additional semantic annotation services will be added to providing additional value to the streamed data