Upload
georgi-kobilarov
View
2.431
Download
2
Embed Size (px)
Citation preview
Georgi Kobilarov, Chris Bizer, Christian Becker
Freie Universität Berlin
Hello again
Georgi Kobilarov
Researcher at Freie Universität Berlin
DBpedia Development Lead
Agenda
Status Quo
Technical Overview
Challenges
Outlook
How to extract Wikipedia dataand how to not do it
Lessons learned
Title
Description
Languages
Web Links
Categorization
Domain specificData
Images
Infoboxes
<http://dbpedia.org/resource/Hewlett-Packard>rdfs:label “Hewlett-Packard”
p:foundation dbpedia:Palo_Alto
p:keypeople dbpedia:Bill_Hewlettp:keypeople dbpedia:David_Packardp:keypeople dbpedia:Mark_V._Hurd
p:industry dbpedia:Computer_Systemsp:industry dbpedia:Computer_software
p:revenue 104300000000 $
p:netincome 7300000000 $
p:employees 156000
p:slogan “Invent”
Problems
Poor Abstract extraction
Property synomys
Redirects
Missing class hierarchy
Range validation
Property Synonyms
Redirects
Florida located_in USA
California located_in United_States
USA redirects_to United_States
Class Hierarchy
„Select all PEOPLE born in …“
Range Validation
dbpedia:Google
keyperson Eric Schmidtkeyperson Sergey Brinkeyperson Larry Pagekeyperson CEOkeyperson Chairman
Range Validation
Technical Overview
And how does it work?
Extraction Framework(and a lot of regular expressions)
Extraction Framework
Open Source http://dbpedia.svn.sourceforge.net
implemented in PHP
Extraction Framework
Data Input (PageCollections)
DatabaseWikipediaLiveWikipedia
Extraction Framework
Data Processing (Extractors)
InfoboxExtractorLabelExtractor
CategoryExtractorRedirectExtractor
GeoExtracor
Extraction Framework
Data Output (Destinations)
SimpleDumpDestination (stdout)NTripleDumpDestination
Extraction Framework
Tie things together
Extraction ManagerExtraction Jobs
DBpedia Dataset
Provided as RDF Dumps
Updated every 3 month
Hosted by Openlink Software
Available as Linked Data
SPARQL Endpoint
http://dbpedia.org/sparql
Linked Data
Use URIs as names for thingsUse HTTP URIs so that people can look up those names.When someone looks up a URI, provide useful information.Include links to other URIs. so that they can discover more things.
HTTP URIs
Information Resources
http://dbpedia.org/page/Bristol
HTTP GET -> 200 OK
Non-Information Resources
http://dbpedia.org/resource/Bristol
HTTP GET -> 303 See other http://dbpedia.org/page/Bristol http://dbpedia.org/data/Bristol
-> 200 OK
How to get started
Documentation http://wiki.dbpedia.org/Documentation
Source Codestart.php
Next TasksImprove Extractors
Cleaner AbstractsInclude Redirects into Extraction Process
Fix more Extraction Bugs http://sourceforge.net/projects/dbpedia/
Provide Live Update Service
Infobox Extraction
One script to rule them all
Not sufficient
Next Challenges
Next challenges
Higher Data Quality + Ontologies
Consistency Checks
Augmentation
Live Updates
Live Updates
Wikipedia Update Stream
Extraction Cluster
Named Graphs
Augmentation
Enrich DBpedia with data from:
1. other languages
2. external datasets
Consistency Checks
German Wikipedia says, Berlin‘s population is X
Italian Wikipedia says, it‘s Y
Data Quality
We need humans
The Vision
Semantic Web
Users shouldn’t care
Semantic Web
Users shouldn’t have to care(del.icio.us lesson)
DBpedia Extraction
Wikipedia DBpedia Extraction Framework
Triple Store
Freebase Extraction
Wikipedia Extraction Metaweb Graph Store
What is the Wikipedia for Data?
Wikipedia is the Wikipedia for Data
Crowd Sourced Extraction
Where‘s the user benefit?
Users
Mashup Developer
Benefit
Outlook
Infobox Extraction
We need a new approach
Break it down into smaller pieces
Step 1: Create an ontology
Five domains:
people, places, organisations, events, works
People
ActorsAthleteJournalistMusicalArtistPoliticianScientistWriter
Places
AirportCityCountryIslandMountainRiver
Organisations
BandCompanyEducational InstitutionRadio StationSports Team
Event
ConventionMilitary ConflictMusic EventSport Event
Work
BookBroadcastFilmSoftwareTelevision
Step 2: Template Mapping
Infobox CricketerInfobox Historic CricketerInfobox Recent Cricketer
Infobox Old CricketerInfobox Cricketer Biography
=> Class Cricketer (Athlete)
Step 2: Template Mapping
Class TV Episode (Work)
Wikipedia Templates:Television EpisodeUK Office EpisodeSimpsons Episode
DoctorWhoBox
Step 3: Parsers
Handle Templates Values specifically
Example: Property splittingPerson born „1.1.1980, [[Berlin]]“
=> split to birthplace Berlinbirthdate 1980-01-01
Step 3: Parsers
Example: Class RulesMusicalArtist
If property „currentMembers“ is set => Group
Otherwise=> Person
Step 3: Parsers
Example: Range ValidationGoogle keypeople
„[[Eric Schmidt]] ([[CEO]], [[Chairman]]), [[Sergey Brin]], [[Larry Page]]
Company#keyperson range Person#Class
Google keyperson Eric SchmidtSergey BrinLarry Page
Step 4: Crowd Source it
Step 4: Crowd Source it
Linking Framework
Interlinking Framework
Interlinking Framework
„Apple“
Apple
Microsoft
Apple
Orange
Pear
Orange
Vodafone
T-Mobile
Context
Similarity
Linking: The Future
Hosted Webservice for Linked Data publishers
Summary
http://dbpedia.org
Georgi KobilarovFreie Universität Berlin