79
Delft University of Technology SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data Samur Araujo, Duc Thanh Tran, Arjen de Vries, Jan Hidders, Daniel Schwabe Delft University of Technology WebDB 2012

SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

Embed Size (px)

Citation preview

Page 1: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

DelftUniversity ofTechnology

SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Samur Araujo, Duc Thanh Tran, Arjen de Vries, Jan Hidders, Daniel Schwabe

Delft University of TechnologyWebDB 2012

Page 2: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

2SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Me You

Page 3: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

3SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

AppleMe

Page 4: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

4SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

You

Page 5: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

5SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

You

?

Ambiguous

Page 6: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

6SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Me

Page 7: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

7SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Me

Page 8: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

8SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Me

Page 9: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

9SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

You

Page 10: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

10SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Round Shape

Green Color

Eatable

Spherical Shape

Red Color

Eatable

My Apple Your Apple

Page 11: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

11SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Shape

Color

Eatable

Shape

Color

Eatable

My Apple Your Apple

Page 12: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

12SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

FruitRound Shape

Green Color

Eatable

Spherical Shape

Red Color

Eatable

My Apple Your Apple

Page 13: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

13SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

My Apple Your Apple

Page 14: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

14SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Matching

Source

Target

Page 15: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

15SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

“Instance matching uses a direct comparison paradigm”.

Page 16: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

16SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Page 17: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

17SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Source

Target

Is your Apple like my Apple?

Humm.. Maybe!

Page 18: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

18SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Homogenous data and schema.

Page 19: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

19SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

The source and target descriptions overlap.

TargetSource

Page 20: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

20SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Syntactic Overlap

Population = TotalPopulation

Page 21: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

21SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Semantic Overlap

Population = Num_Inhabitants

Page 22: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

22SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Web of Data: heterogeneous data and schema

Page 23: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

23SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

None or limited overlap between schemas

TargetSource

Page 24: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

24SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instances do not instantiate the schema, properly.

TargetSource

Page 25: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

25SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Apple

Nutritional Information

Botanical Information

Page 26: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

26SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

“Direct comparison paradigm does not apply”.

TargetSource

Page 27: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

27SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

AppleMe

Page 28: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

28SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Me

AppleOrangePineapple

Page 29: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

29SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

You

Apple

Me Orange

Pineapple

Page 30: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

32SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

You

Page 31: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

34SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Food

Page 32: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

35SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Page 33: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

36SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

FoodEatable

Page 34: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

37SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Source

Page 35: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

38SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

My Apple Your Apple

My Orange Your Orange

My Pineapple Your Pineapple

Page 36: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

39SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Page 37: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

40SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

“We use a class-based disambiguation paradigm …”

Page 38: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

41SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

“We use a class-based disambiguation paradigm …” “… when there is no overlap between schemas.”

Page 39: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

42SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Matching with SERIMI

Source

Target

Page 40: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

43SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

Page 41: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

44SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

ValuePredicate

Instance

Page 42: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

45SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

Roundshape

Apple1

Appletitle

Apple1

Redcolor

Apple1

EatableApple1category

Page 43: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

46SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

Roundshape

Apple1

Appletitle

Apple1

Redcolor

Apple1

EatableApple1category

P(X) = {p | (s, p, o) ∈ IR(G,X) ∧s ∈ X},

D(X) = {o | (s, p, o) ∈ IR(G,X) ∧s ∈ X ∧o ∈ L},

O(X) = {o | (s, p, o) ∈ IR(G,X) ∧s ∈ X ∧o ∈ U},

T(X) = {(p, o) | (s, p, o) ∈ IR(G,X) ∧s ∈ X}.

Page 44: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

47SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

Roundshape

Apple1

Appletitle

Apple1

Redcolor

Apple1

EatableApple1category

P(X) = {shape,title,color,category},

D(X) = {Round,Apple,Red,Eatable},

O(X) = {}

T(X) = {(shape,round),(title,Apple),(color,Red), (category,Eatable)}.

Page 45: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

48SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

[P(hi), D(hi), O(hi), T(hi)]

Page 46: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

49SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 1: Cluster the source

Source

Cars

FruitsCompanies

Page 47: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

51SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 2: Blocking Key Selection

Key Selection

Source instances

Page 48: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

52SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 2: Blocking Key Selection

Key Selection

key

keykey

Source instances

Page 49: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

53SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 2: Blocking Key Selection

Key Selection

key

keykey

Source instances

e.g.Title

Page 50: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

54SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 3: Pseudo-Homonyms Builder

Title=apple

Title=orangeTitle=pineapple

Pseudo-Homonyms

Builder

Target

Page 51: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

55SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 3: Pseudo-Homonyms Builder

Pseudo-Homonyms

Builder

Target

Source instances

Target

Pseudo-homonyms sets

Everything called Apple

Page 52: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

56SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Target

Pseudo-homonyms sets

Disambiguation

Class-based Disambiguat

or

Page 53: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

57SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Target

Pseudo-homonyms sets

Disambiguation

Class-based Disambiguat

or Source instances

Target

Pseudo-homonyms sets

Page 54: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

58SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Page 55: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

59SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Page 56: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

60SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

pseudo-homonym sets

inst

an

ces

Page 57: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

61SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

pseudo-homonym sets

inst

an

ces

[P(hi11), D(hi

11), O(hi11),

T(hi11)]

Page 58: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

62SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Instance Representation

Roundshape

Apple1

Appletitle

Apple1

Redcolor

Apple1

EatableApple1category

P(X) = {shape,title,color,category},

D(X) = {Round,Apple,Red,Eatable},

O(X) = {}

T(X) = {(shape,round),(title,Apple),(color,Red), (category,Eatable)}.

Page 59: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

63SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

inst

an

ces

pseudo-homonym sets

0.98

0.32

0.32

0.76

0.95

0.53

0.94

0.91

0.87

H1 H2 H3

Page 60: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

64SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3€

SetSim(A,B) = | A ∩ B | -| A - B | + | B - A |

2 | A∪B |

⎝ ⎜

⎠ ⎟

Page 61: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

65SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

SetSim(P(h11),P(H2)) + SetSim(P(h11),P(H3))

[P(hi11), D(hi

11), O(hi11),

T(hi11)]

Page 62: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

66SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

RDS(A,B) = SetSim(P(A), P(B)) +SetSim(D(A),D(B)) +

SetSim(O(A),O(B)) + SetSim(T(A), T(B))

Page 63: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

67SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

h11

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

0.98

h12

h13

h14

h21

h22

h31

h32

h33

H1 H2 H3

Page 64: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

68SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

0.98

0.32

0.32

0.76

0.95

0.53

0.94

0.91

0.87

H1 H2 H3

Page 65: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

69SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

URDS(t,PH(S)−) =RDS({t},PH(s'))

|PH(S') |PH (s' )∈PH (S )−

Page 66: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

70SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

0.98 0.95 0.94

H1 H2 H3

TOP-K or Threshold

Page 67: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

71SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Page 68: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

72SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

Disambiguation

Class-based Disambiguat

or Source instances

Target

Pseudo-homonyms sets

Page 69: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

73SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Experiment

• Ontology Alignment Evaluation Initiative (OAEI 2010)

• Collections: the life science (LS) collection (DBPedia, Sider, Drugbank, LinkedCT, Dailymed, TCM, and Diseasome) and the Person-Restaurant (PR)

• 20 gigabytes of data, millions of triples.

• We compared SERIMI to ObjectCoref and RiMON

• Precision, Recall and F1

Page 70: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

74SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results

Page 71: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

75SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results

Page 72: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

76SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results

Page 73: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

77SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Page 74: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

80SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results

Page 75: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

81SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Step 4: Class-based disambiguation

0.98 0.95 0.94

H1 H2 H3

TOP-K or Threshold

Page 76: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

82SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results for Top-K

Sider-Daily. Sider-Drug. Drug.-Sider P11-P120.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Top-1

Top-2

Top-5

Top-10

Dataset Pair

F1

Page 77: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

83SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Results for δ threshold

Sider-Daily. Sider-Drug. Drug.-Sider P11-P120.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

δ >= δmδ = 1.0δ >= 0.95δ >= 0.90δ >= 0.85

Dataset Pair

F1

Page 78: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

84SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

Conclusion

• SERIMI is complementary approach to direct-match based instance matching tools.

• SERIMI is recommended for heterogeneous data where

there is no overlap between schemas.

• It is recommended for multi-class disambiguation.

Page 79: SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

85SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data

THANK YOU!

• Samur Araujo [email protected]

SERIMI: Class-based Disambiguation for EffectiveInstance Matching over Heterogeneous Web Data