71
Georgi Kobilarov , Chris Bizer, Christian Becker Freie Universität Berlin

DBpedia Framework - BBC Talk

Embed Size (px)

Citation preview

Page 1: DBpedia Framework - BBC Talk

Georgi Kobilarov, Chris Bizer, Christian Becker

Freie Universität Berlin

Page 2: DBpedia Framework - BBC Talk

Hello again

Georgi Kobilarov

Researcher at Freie Universität Berlin

DBpedia Development Lead

Page 3: DBpedia Framework - BBC Talk

Agenda

Status Quo

Technical Overview

Challenges

Outlook

Page 4: DBpedia Framework - BBC Talk

How to extract Wikipedia dataand how to not do it

Page 5: DBpedia Framework - BBC Talk

Lessons learned

Page 6: DBpedia Framework - BBC Talk

Title

Description

Languages

Web Links

Categorization

Domain specificData

Images

Infoboxes

Page 7: DBpedia Framework - BBC Talk
Page 8: DBpedia Framework - BBC Talk

<http://dbpedia.org/resource/Hewlett-Packard>rdfs:label “Hewlett-Packard”

p:foundation dbpedia:Palo_Alto

p:keypeople dbpedia:Bill_Hewlettp:keypeople dbpedia:David_Packardp:keypeople dbpedia:Mark_V._Hurd

p:industry dbpedia:Computer_Systemsp:industry dbpedia:Computer_software

p:revenue 104300000000 $

p:netincome 7300000000 $

p:employees 156000

p:slogan “Invent”

Page 9: DBpedia Framework - BBC Talk

Problems

Poor Abstract extraction

Property synomys

Redirects

Missing class hierarchy

Range validation

Page 10: DBpedia Framework - BBC Talk

Property Synonyms

Page 11: DBpedia Framework - BBC Talk

Redirects

Florida located_in USA

California located_in United_States

USA redirects_to United_States

Page 12: DBpedia Framework - BBC Talk

Class Hierarchy

„Select all PEOPLE born in …“

Page 13: DBpedia Framework - BBC Talk

Range Validation

dbpedia:Google

keyperson Eric Schmidtkeyperson Sergey Brinkeyperson Larry Pagekeyperson CEOkeyperson Chairman

Page 14: DBpedia Framework - BBC Talk

Range Validation

Page 15: DBpedia Framework - BBC Talk

Technical Overview

Page 16: DBpedia Framework - BBC Talk

And how does it work?

Extraction Framework(and a lot of regular expressions)

Page 17: DBpedia Framework - BBC Talk

Extraction Framework

Open Source http://dbpedia.svn.sourceforge.net

implemented in PHP

Page 18: DBpedia Framework - BBC Talk

Extraction Framework

Data Input (PageCollections)

DatabaseWikipediaLiveWikipedia

Page 19: DBpedia Framework - BBC Talk

Extraction Framework

Data Processing (Extractors)

InfoboxExtractorLabelExtractor

CategoryExtractorRedirectExtractor

GeoExtracor

Page 20: DBpedia Framework - BBC Talk

Extraction Framework

Data Output (Destinations)

SimpleDumpDestination (stdout)NTripleDumpDestination

Page 21: DBpedia Framework - BBC Talk

Extraction Framework

Tie things together

Extraction ManagerExtraction Jobs

Page 22: DBpedia Framework - BBC Talk

DBpedia Dataset

Provided as RDF Dumps

Updated every 3 month

Hosted by Openlink Software

Available as Linked Data

Page 23: DBpedia Framework - BBC Talk

SPARQL Endpoint

http://dbpedia.org/sparql

Page 24: DBpedia Framework - BBC Talk

Linked Data

Use URIs as names for thingsUse HTTP URIs so that people can look up those names.When someone looks up a URI, provide useful information.Include links to other URIs. so that they can discover more things.

Page 25: DBpedia Framework - BBC Talk

HTTP URIs

Information Resources

http://dbpedia.org/page/Bristol

HTTP GET -> 200 OK

Non-Information Resources

http://dbpedia.org/resource/Bristol

HTTP GET -> 303 See other http://dbpedia.org/page/Bristol http://dbpedia.org/data/Bristol

-> 200 OK

Page 26: DBpedia Framework - BBC Talk

How to get started

Documentation http://wiki.dbpedia.org/Documentation

Source Codestart.php

Page 27: DBpedia Framework - BBC Talk

Next TasksImprove Extractors

Cleaner AbstractsInclude Redirects into Extraction Process

Fix more Extraction Bugs http://sourceforge.net/projects/dbpedia/

Provide Live Update Service

Page 28: DBpedia Framework - BBC Talk

Infobox Extraction

One script to rule them all

Not sufficient

Page 29: DBpedia Framework - BBC Talk

Next Challenges

Page 30: DBpedia Framework - BBC Talk

Next challenges

Higher Data Quality + Ontologies

Consistency Checks

Augmentation

Live Updates

Page 31: DBpedia Framework - BBC Talk

Live Updates

Wikipedia Update Stream

Extraction Cluster

Named Graphs

Page 32: DBpedia Framework - BBC Talk

Augmentation

Enrich DBpedia with data from:

1. other languages

2. external datasets

Page 33: DBpedia Framework - BBC Talk

Consistency Checks

German Wikipedia says, Berlin‘s population is X

Italian Wikipedia says, it‘s Y

Page 34: DBpedia Framework - BBC Talk

Data Quality

We need humans

Page 35: DBpedia Framework - BBC Talk

The Vision

Page 36: DBpedia Framework - BBC Talk

Semantic Web

Users shouldn’t care

Page 37: DBpedia Framework - BBC Talk

Semantic Web

Users shouldn’t have to care(del.icio.us lesson)

Page 38: DBpedia Framework - BBC Talk

DBpedia Extraction

Wikipedia DBpedia Extraction Framework

Triple Store

Page 39: DBpedia Framework - BBC Talk

Freebase Extraction

Wikipedia Extraction Metaweb Graph Store

Page 40: DBpedia Framework - BBC Talk

What is the Wikipedia for Data?

Page 41: DBpedia Framework - BBC Talk

Wikipedia is the Wikipedia for Data

Page 42: DBpedia Framework - BBC Talk
Page 43: DBpedia Framework - BBC Talk

Crowd Sourced Extraction

Where‘s the user benefit?

Page 44: DBpedia Framework - BBC Talk

Users

Mashup Developer

Page 45: DBpedia Framework - BBC Talk

Benefit

Page 46: DBpedia Framework - BBC Talk

Outlook

Page 47: DBpedia Framework - BBC Talk

Infobox Extraction

We need a new approach

Break it down into smaller pieces

Page 48: DBpedia Framework - BBC Talk

Step 1: Create an ontology

Five domains:

people, places, organisations, events, works

Page 49: DBpedia Framework - BBC Talk

People

ActorsAthleteJournalistMusicalArtistPoliticianScientistWriter

Page 50: DBpedia Framework - BBC Talk

Places

AirportCityCountryIslandMountainRiver

Page 51: DBpedia Framework - BBC Talk

Organisations

BandCompanyEducational InstitutionRadio StationSports Team

Page 52: DBpedia Framework - BBC Talk

Event

ConventionMilitary ConflictMusic EventSport Event

Page 53: DBpedia Framework - BBC Talk

Work

BookBroadcastFilmSoftwareTelevision

Page 54: DBpedia Framework - BBC Talk

Step 2: Template Mapping

Infobox CricketerInfobox Historic CricketerInfobox Recent Cricketer

Infobox Old CricketerInfobox Cricketer Biography

=> Class Cricketer (Athlete)

Page 55: DBpedia Framework - BBC Talk

Step 2: Template Mapping

Class TV Episode (Work)

Wikipedia Templates:Television EpisodeUK Office EpisodeSimpsons Episode

DoctorWhoBox

Page 56: DBpedia Framework - BBC Talk

Step 3: Parsers

Handle Templates Values specifically

Example: Property splittingPerson born „1.1.1980, [[Berlin]]“

=> split to birthplace Berlinbirthdate 1980-01-01

Page 57: DBpedia Framework - BBC Talk

Step 3: Parsers

Example: Class RulesMusicalArtist

If property „currentMembers“ is set => Group

Otherwise=> Person

Page 58: DBpedia Framework - BBC Talk

Step 3: Parsers

Example: Range ValidationGoogle keypeople

„[[Eric Schmidt]] ([[CEO]], [[Chairman]]), [[Sergey Brin]], [[Larry Page]]

Company#keyperson range Person#Class

Google keyperson Eric SchmidtSergey BrinLarry Page

Page 59: DBpedia Framework - BBC Talk

Step 4: Crowd Source it

Page 60: DBpedia Framework - BBC Talk

Step 4: Crowd Source it

Page 61: DBpedia Framework - BBC Talk

Linking Framework

Page 62: DBpedia Framework - BBC Talk

Interlinking Framework

Page 63: DBpedia Framework - BBC Talk

Interlinking Framework

Page 64: DBpedia Framework - BBC Talk

„Apple“

Page 65: DBpedia Framework - BBC Talk

Apple

Google

Microsoft

Page 66: DBpedia Framework - BBC Talk

Apple

Orange

Pear

Page 67: DBpedia Framework - BBC Talk

Orange

Vodafone

T-Mobile

Page 68: DBpedia Framework - BBC Talk

Context

Similarity

Page 69: DBpedia Framework - BBC Talk

Linking: The Future

Hosted Webservice for Linked Data publishers

Page 70: DBpedia Framework - BBC Talk

Summary

Page 71: DBpedia Framework - BBC Talk

http://dbpedia.org

Georgi KobilarovFreie Universität Berlin