Upload
thanh-tran
View
124
Download
0
Tags:
Embed Size (px)
Citation preview
DelftUniversity ofTechnology
SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Samur Araujo, Duc Thanh Tran, Arjen de Vries, Jan Hidders, Daniel Schwabe
Delft University of TechnologyWebDB 2012
2SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Me You
3SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
AppleMe
4SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
You
5SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
You
?
Ambiguous
6SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Me
7SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Me
8SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Me
9SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
You
10SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Round Shape
Green Color
Eatable
Spherical Shape
Red Color
Eatable
My Apple Your Apple
11SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Shape
Color
Eatable
Shape
Color
Eatable
My Apple Your Apple
12SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
FruitRound Shape
Green Color
Eatable
Spherical Shape
Red Color
Eatable
My Apple Your Apple
13SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
My Apple Your Apple
14SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Instance Matching
Source
Target
15SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
“Instance matching uses a direct comparison paradigm”.
16SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
17SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Source
Target
Is your Apple like my Apple?
Humm.. Maybe!
18SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Homogenous data and schema.
19SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
The source and target descriptions overlap.
TargetSource
20SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Syntactic Overlap
Population = TotalPopulation
21SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Semantic Overlap
Population = Num_Inhabitants
22SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Web of Data: heterogeneous data and schema
23SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
None or limited overlap between schemas
TargetSource
24SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Instances do not instantiate the schema, properly.
TargetSource
25SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Apple
Nutritional Information
Botanical Information
26SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
“Direct comparison paradigm does not apply”.
TargetSource
27SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
AppleMe
28SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Me
AppleOrangePineapple
29SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
You
Apple
Me Orange
Pineapple
32SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
You
34SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Food
35SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
36SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
FoodEatable
37SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Source
38SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
My Apple Your Apple
My Orange Your Orange
My Pineapple Your Pineapple
39SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
40SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
“We use a class-based disambiguation paradigm …”
41SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
“We use a class-based disambiguation paradigm …” “… when there is no overlap between schemas.”
42SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Instance Matching with SERIMI
Source
Target
43SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Instance Representation
44SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Instance Representation
ValuePredicate
Instance
45SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Instance Representation
Roundshape
Apple1
Appletitle
Apple1
Redcolor
Apple1
EatableApple1category
46SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Instance Representation
Roundshape
Apple1
Appletitle
Apple1
Redcolor
Apple1
EatableApple1category
€
P(X) = {p | (s, p, o) ∈ IR(G,X) ∧s ∈ X},
D(X) = {o | (s, p, o) ∈ IR(G,X) ∧s ∈ X ∧o ∈ L},
O(X) = {o | (s, p, o) ∈ IR(G,X) ∧s ∈ X ∧o ∈ U},
T(X) = {(p, o) | (s, p, o) ∈ IR(G,X) ∧s ∈ X}.
47SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Instance Representation
Roundshape
Apple1
Appletitle
Apple1
Redcolor
Apple1
EatableApple1category
€
P(X) = {shape,title,color,category},
D(X) = {Round,Apple,Red,Eatable},
O(X) = {}
T(X) = {(shape,round),(title,Apple),(color,Red), (category,Eatable)}.
48SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Instance Representation
[P(hi), D(hi), O(hi), T(hi)]
49SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 1: Cluster the source
Source
Cars
FruitsCompanies
51SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 2: Blocking Key Selection
Key Selection
Source instances
52SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 2: Blocking Key Selection
Key Selection
key
keykey
Source instances
53SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 2: Blocking Key Selection
Key Selection
key
keykey
Source instances
e.g.Title
54SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 3: Pseudo-Homonyms Builder
Title=apple
Title=orangeTitle=pineapple
Pseudo-Homonyms
Builder
Target
55SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 3: Pseudo-Homonyms Builder
Pseudo-Homonyms
Builder
Target
Source instances
Target
Pseudo-homonyms sets
Everything called Apple
56SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
Target
Pseudo-homonyms sets
Disambiguation
Class-based Disambiguat
or
57SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
Target
Pseudo-homonyms sets
Disambiguation
Class-based Disambiguat
or Source instances
Target
Pseudo-homonyms sets
58SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
59SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
60SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
h11
h12
h13
h14
h21
h22
h31
h32
h33
H1 H2 H3
pseudo-homonym sets
inst
an
ces
61SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
h11
h12
h13
h14
h21
h22
h31
h32
h33
H1 H2 H3
pseudo-homonym sets
inst
an
ces
[P(hi11), D(hi
11), O(hi11),
T(hi11)]
62SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Instance Representation
Roundshape
Apple1
Appletitle
Apple1
Redcolor
Apple1
EatableApple1category
€
P(X) = {shape,title,color,category},
D(X) = {Round,Apple,Red,Eatable},
O(X) = {}
T(X) = {(shape,round),(title,Apple),(color,Red), (category,Eatable)}.
63SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
h11
h12
h13
h14
h21
h22
h31
h32
h33
H1 H2 H3
inst
an
ces
pseudo-homonym sets
0.98
0.32
0.32
0.76
0.95
0.53
0.94
0.91
0.87
H1 H2 H3
64SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
h11
h12
h13
h14
h21
h22
h31
h32
h33
H1 H2 H3€
SetSim(A,B) = | A ∩ B | -| A - B | + | B - A |
2 | A∪B |
⎛
⎝ ⎜
⎞
⎠ ⎟
65SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
h11
h12
h13
h14
h21
h22
h31
h32
h33
H1 H2 H3
€
SetSim(P(h11),P(H2)) + SetSim(P(h11),P(H3))
[P(hi11), D(hi
11), O(hi11),
T(hi11)]
66SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
€
RDS(A,B) = SetSim(P(A), P(B)) +SetSim(D(A),D(B)) +
SetSim(O(A),O(B)) + SetSim(T(A), T(B))
67SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
h11
h12
h13
h14
h21
h22
h31
h32
h33
H1 H2 H3
0.98
h12
h13
h14
h21
h22
h31
h32
h33
H1 H2 H3
68SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
0.98
0.32
0.32
0.76
0.95
0.53
0.94
0.91
0.87
H1 H2 H3
69SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
€
URDS(t,PH(S)−) =RDS({t},PH(s'))
|PH(S') |PH (s' )∈PH (S )−
∑
70SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
0.98 0.95 0.94
H1 H2 H3
TOP-K or Threshold
71SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
72SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
Disambiguation
Class-based Disambiguat
or Source instances
Target
Pseudo-homonyms sets
73SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Experiment
• Ontology Alignment Evaluation Initiative (OAEI 2010)
• Collections: the life science (LS) collection (DBPedia, Sider, Drugbank, LinkedCT, Dailymed, TCM, and Diseasome) and the Person-Restaurant (PR)
• 20 gigabytes of data, millions of triples.
• We compared SERIMI to ObjectCoref and RiMON
• Precision, Recall and F1
74SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Results
75SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Results
76SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Results
77SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
80SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Results
81SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Step 4: Class-based disambiguation
0.98 0.95 0.94
H1 H2 H3
TOP-K or Threshold
82SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Results for Top-K
Sider-Daily. Sider-Drug. Drug.-Sider P11-P120.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Top-1
Top-2
Top-5
Top-10
Dataset Pair
F1
83SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Results for δ threshold
Sider-Daily. Sider-Drug. Drug.-Sider P11-P120.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
δ >= δmδ = 1.0δ >= 0.95δ >= 0.90δ >= 0.85
Dataset Pair
F1
84SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
Conclusion
• SERIMI is complementary approach to direct-match based instance matching tools.
• SERIMI is recommended for heterogeneous data where
there is no overlap between schemas.
• It is recommended for multi-class disambiguation.
85SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data
THANK YOU!
• Samur Araujo [email protected]
SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data