Upload
ros
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
KD2R: a Key Discovery method for semantic Reference Reconciliation. Danai Symeonidou , Nathalie Pernelle and Fatiha Sa ϊ s LRI ( University Paris-Sud) WOD’2013 June , 3th. More and more heterogeneous RDF sources Links can be asserted between them - PowerPoint PPT Presentation
Citation preview
KD2R: a Key Discovery method for semantic Reference
Reconciliation
Danai Symeonidou, Nathalie Pernelle and Fatiha SaϊsLRI (University Paris-Sud)
WOD’2013June, 3th
Danai Symeonidou, WOD’2013 2
Data Linking• More and more heterogeneous RDF sources • Links can be asserted between them
▫Same as is one of the most important types of links: combine information given in different data sources
▫ LOD: the number of already existing links is very small • How to create links automatically ?
Linked Open Data cloud
Danai Symeonidou, WOD’2013 3
FirstName: GeorgeLastName: Thomson
SSN : 011223456Job : Artist
FirstName: GeorgeLastName: Thomson
SSN : 444223456Job: Professor
FirstName: GeorgeLastName: Thomson
SSN : 011223456Age : 45
Dataset1 Dataset2
Data Linking Problem
P1
P2
P3
Danai Symeonidou, WOD’2013 4
FirstName: GeorgeLastName: Thomson
SSN : 011223456Job : Artist
FirstName: GeorgeLastName: Thomson
SSN : 444223456Job: Professor
FirstName: GeorgeLastName: Thomson
SSN : 011223456Age : 45
Dataset1 Dataset2
SameAs
Data Linking Problem
P1
P2
P3
Danai Symeonidou, WOD’2013 5
FirstName: GeorgeLastName: Thomson
SSN : 011223456Job : Artist
FirstName: GeorgeLastName: Thomson
SSN : 444223456Job: Professor
FirstName: GeorgeLastName: Thomson
SSN : 011223456Age : 45
Dataset1 Dataset2
SameAs
SameAs
Data Linking Problem
P1
P2
P3
Danai Symeonidou, WOD’2013 6
Data Linking with or without key constraints
• No knowledge given about the properties: all the properties have the same importance.
• Knowledge given by an expert: Specific expert rules [Arasu and al.’09, Low and al.’01, Volz and
al.’09 (Silk)]Example: max(jaro(phone-number;phone-number; jaro-winkler(SSN;SSN)) > 0.88
Key constraints [Saïs, Pernelle and Rousset’09]Example: hasKey(Museum (museumName) (museumAddress))
• OWL2 Key for a class expression: a combination of (inverse) properties which uniquely identify an entity▫ hasKey( CE ( OPE1 ... OPEm ) ( DPE1 ... DPEn ) )
Example: hasKey(Museum (museumName) (museumAddress)) expresses:Museum(x1)∧Museum(x2)∧museumName(x1, y)∧museumName(x2, y)∧museumAddress(x1, w)∧museumAddress(x2, w) sameAs(x1, x2)
Danai Symeonidou, WOD’2013 7
Problem: when data sources contain numerous data and/or complex ontologies Some keys are not obvious to find. Erroneous keys can be given by the expert.
• Aim: automatic discovery of a complete set of keys from data
• Naïve automatic way to discover keys: examine all the possible combinations of properties▫ Example: given an instance described by 15 properties the
number of candidate keys is 215-1 = 32767 ▫ For each candidate key we have to scan all the instances of the
data
• Objective: find efficiently keys by:▫ Reducing the combinations ▫ Partially scanning the data
Key Discovery Problem
Danai Symeonidou, WOD’2013 8
• RDF data sources (conforming to an OWL 2 ontology) • Mappings between classes and properties of the different
ontologies• Open world assumption (incomplete data) and multivalued
properties may exist
How to discover keys when we do not know if : i1 =?= i2 =?=i3 =?=i4hasFriend(i1,i4), hasFriend(i2, i3) …. ?? firstName(i1, Elodie) … ?
Key Discovery Problem
id lastName firstName
hasFriend
i1 Tompson Manuel i2,i3
i2 Tompson Maria
i3 David George i2, i4
i4 Solgar Michel
Danai Symeonidou, WOD’2013 9
• Unique Name Assumption (UNA): two different URIs refer to distinct entities (data sources generated from relational databases , Yago)i1 <> i2<> i3 <> i4
• Two literals that are syntactically different are semantically different ▫ (e.g. “Napoleon Bonaparte” <> “Napoleon”)
Key Discovery Problem:Assumptions
Danai Symeonidou, WOD’2013 10
• Heuristic 1 - Pessimistic: ▫ Not instantiated property all the values are possible
Example: hasFriend(i2, i3), hasFriend(i4, i2) are possible.
▫ Instantiated property only given values are considered Example: not hasFriend(i1, i4)
Non keys: {lastName}, {hasFriend} Keys: {firstName}, {lastName, firstName}, {firstName, hasFriend} Undetermined keys: {hasFriend, lastName}
Key Discovery:Heuristics
id lastName firstName
hasFriend
i1 Tompson Manuel i2,i3
i2 Tompson Maria
i3 David George i2, i4
i4 Solgar Michel
Danai Symeonidou, WOD’2013 11
• Heuristic 1 - Optimistic: ▫ Not instantiated property value not one of the already existing ones
Example: not hasFriend(i2, i3), not hasFriend(i2, i1), not hasFriend(i2, i4).
▫ Instantiated property only given values are considered Example: not hasFriend(i1, i4)
Non keys: {lastName}, {hasFriend} Keys: {firstName}, {lastName, firstName}, {firstName, hasFriend}, {hasFriend, lastName}
Key Discovery:Heuristics
id lastName firstName
hasFriend
i1 Tompson Manuel i2,i3
i2 Tompson Maria
i3 David George i2, i4
i4 Solgar Michel
Danai Symeonidou, WOD’2013 12
KD2R approach Topological sort of the classes (subsumption)
• Key Finder▫ Discover non keys
Ex: {lastName}, {hasFriend} ▫ Derive keys using non keys
Ex: {firstName}, {lastName, firstName}, {firstName, hasFriend}, {hasFriend, lastName}
• Key Merge ▫ Cartesian product of minimal key sets in S1,S2
Ex. Ks1 = {firstName} Ks2 = {hasFriend} Ks1-s2 = {firstName, hasFriend}
Technical report available:https://www.lri.fr/~bibli/Rapports-internes/2013/RR1559.pdf
Danai Symeonidou, WOD’2013 13
KD2R approach: Key Finder • Computation of maximal non keys and undetermined
keys ▫ Represent data in a prefix-tree (a compact representation of the
data of one class)
Danai Symeonidou, WOD’2013 14
Validation of approach• Datasets where KD2R has been tested:
Datasets RDF files #instances
Optimistic
Pessimistic
OAEI Restaurants
Dataset
Restaurant1 339 Yes Yes
Restaurant2 1390 Yes Yes
OAEI PersonsDataset
Person11 1000 Yes Yes
Peson12 1000 Yes Yes
Person21 1200 Yes Yes
Dbpedia Dataset(properties
instasiated in at least 80% of the
data)
Person 763644 Yes No
NaturalPlace 78400 Yes No
BodyOfWater 34008 Yes No
Lake 33348 Yes No
googleFusion Dataset
G_Restaurant
372813 Yes Yes
ChefMoz Dataset
C_Restaurant
1047 Yes Yes
Danai Symeonidou, WOD’2013 15
Demo• Ontologies
▫ Data conforming to one ontology
• RDF data▫ Dbpedia NaturalPlace dataset (78400 instances)▫ OAEIPerson dataset (2000 instances)
• Data linking▫ Link data using LN2R ▫ Measure quality of linking using:
recall precision f-measure
Danai Symeonidou, WOD’2013 16
QUESTIONS???
Danai Symeonidou, WOD’2013 17
THANK YOU!!!