27
2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Yee Fan Tan and Min-Yen Kan School of Computing National University of Singapore {tanyeefa,kanmy}@comp.nus.edu.sg

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

Embed Size (px)

DESCRIPTION

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) 3 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching Record Matching using Web Resources KDD PAKDD DKMD KDID Knowledge Discovery and Data Mining Pacific-Asia Conference on Knowledge Discovery and Data Mining Research Issues on Data Mining and Knowledge Discovery Knowledge Discovery in Inductive Databases ? Record linkage of short forms to long forms Conference or workshop websites, etc. Disambiguation of author names in publication lists Publication pages of authors and institutions, etc. Cooccurrence counts Hit counts Webpage similarity URL information Snippet similarity AB WI-IAT Web Intelligence

Citation preview

Page 1: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

Hierarchical Cost-sensitiveWeb Resource Acquisition

for Record Matching

Yee Fan Tan and Min-Yen Kan

School of ComputingNational University of Singapore

{tanyeefa,kanmy}@comp.nus.edu.sg

Page 2: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

2 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Contents• Background and Motivation

• Record matching using Web resources• Issues with existing work

• Cost-sensitive Record Acquisition Framework• Algorithm for Record Matching Problems• Evaluation• Conclusion

Page 3: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

3 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Record Matching using Web Resources

KDDPAKDDDKMDKDID

Knowledge Discovery and Data MiningPacific-Asia Conference on Knowledge Discovery and Data MiningResearch Issues on Data Mining and Knowledge DiscoveryKnowledge Discovery in Inductive Databases

?

Record linkage of short forms to long forms

Conference or workshop websites, etc.

Disambiguation of author names in publication listsPublication pages of authors and institutions, etc.

Cooccurrence countsHit countsWebpage similarity

URL informationSnippet similarity

A B

WI-IATWeb Intelligence

Page 4: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

4 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Record Matching using Web Resources• Typical scheme for comparing records a and b

• Query search engine with queries of the form:a, b, and/or a b

(Optionally, append additional terms or tokens)• Compute similarity metric(s) and/or construct test instance

• Examples (IR, NLP, IE, data mining, linkage, etc.)• Mihalcea and Moldovan (1999), Cimiano et al. (2005),

Tan et al. (2006), Bollegala et al. (2007),Elmacioglu et al., (2007), Oh and Isahara (2008),Kalashnikov et al. (2008)

Page 5: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

5 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Cost of Acquiring Web Resources• Acquiring web resources are slow and may incur

other access costs• Google SOAP Search takes a day or more to query 1000

records individually, even longer for pairwise queries• For large datasets, not feasible to query and

download everything• Must acquire only selected web resources

Page 6: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

6 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

• Basically ignored in existing work• Including work on cost-sensitive attribute value acquisition,

e.g., Ling et al. (2006), Saar-Tsechansky et al. (2009)

• Same resource may be used to extract different types of attribute values

• Different instances can share attribute values

Hierarchical Dependencies

search(a)

webpage(a)

term-freq(a) hitcount(a)

Page 7: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

7 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Contents• Background and Motivation• Cost-sensitive Record Acquisition Framework

• Resource dependency graph• Resource acquisition problem• Observations and challenges for record matching

problems• Algorithm for Record Matching Problems• Evaluation• Conclusion

Page 8: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

8 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Resource Dependency Graph

• Directed acyclic graph G = (V, E)• Vertex types T with |T| = 7

Page 9: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

9 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

• Feasible vertex set V’ V• V’ can be acquired as-is without violating acquisition

dependencies

Resource Acquisition Graph

V’ V’

Page 10: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

10 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Cost and Benefit FunctionsAcquisition cost function• search(.) cost 10 each• webpage(.) cost 100 each• All other cost 1 each

Benefit function• Positive benefit for vertices

that are attribute values in test instances

• Zero benefit otherwise

Zero benefit

Positive benefit

10 10 10

1 1 1

1

Page 11: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

11 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Resource Acquisition Problem• Given

• Resource dependency graph G = (V, E)• Acquisition cost function cost: V → R+ {0}• Benefit function benefit: F(G) → R• Vertex type function type: V → T• Budget budget

• Acquire feasible vertex set V’ with cost(V’) ≤ budgetto maximize obj(V’) = benefit(V’) – cost(V’)

• Can be applied on many problems and variants

Page 12: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

12 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Observations for Record Matching• Depth of 2 to 4 is typical• Usually possible to flatten graph to two levels

• Root level: expensive search engine results andweb page downloads

• Leaf level: cheap attribute values

Page 13: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

13 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Challenges for Record Matching• Heuristic search

required• Not feasible to enumerate

all possible V’ to acquire

• Many local maxima• V’ = φ is a local maxima• Easy for states to be

revisited

• No guidance as to which root and child vertices to acquire• Each search(.) has the

same objective value• Vertex out-degree can be

more than max(|A|, |B|)

Page 14: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

14 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Contents• Background and Motivation• Cost-sensitive Record Acquisition Framework• Algorithm for Record Matching Problems

• Application of Tabu search• More intelligent legal moves• Propagation of benefit values

• Evaluation• Conclusion

Page 15: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

15 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Tabu Search (Glover, 1990)Simple Hill Climbing• Set V’ = φ• While termination condition

not satisfied• Let V* be neighbouring

state of V’ with best obj(V*) generated by legal moves

• Set V’ = V*

Simple Tabu Search• Set V’ = φ• Initialize empty Tabu list• While termination condition

not satisfied• Let V* be neighbouring

state of V’ with best obj(V*) generated by legal moves not in Tabu list

• Add move(s) that undo effect of selected move into Tabu list, and keep them in Tabu list for k iterations

• Set V’ = V*

Tabu list = List of prohibited moves1. Avoids getting stuck in local maxima2. Better exploration of state space

Page 16: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

16 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Let D(V’) be the set ofchildren from any vertex in V’ that are not in V’

If V’ is empty, then set D(V’) = V

• Add(v): Add v from D(V’) and its ancestors into V’

• Remove(v): Remove v and its descendants from V’

• AddType(t): Add a maximal set of vertices (within acquisition budget) of type t from D(V’) into V’

Legal Moves

V’

D(V’)

Page 17: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

17 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Propagated Benefit

v

u

v

where

Benefit propagation from leaf to root: vertices → edges → vertices → edges → …

Page 18: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

18 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

• Current state V’ and we consider adding V’’• Surrogate benefit function

• Surrogate objective function• surr-obj(V’, V’’) = surr-benefit(V’, V’’) – cost(V’’)

Surrogate Benefit and Objective

V’

V’’

Page 19: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

19 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Benefit Function for SVM• We use SVM to classify test instances• Apply cost-sensitive attribute value acquisition work

in Tan and Kan (2010) to compute benefit function

Page 20: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

20 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Contents• Background and Motivation• Cost-sensitive Record Acquisition Framework• Algorithm for Record Matching Problems• Evaluation

• Experimental setup• Results for linking short forms to long forms• Results for author name disambiguation

• Conclusion

Page 21: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

21 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Experimental Setup Algorithms• Baselines

• Random• Least cost• Best benefit• Best cost-benefit ratio• Best type

• Our algorithm• Manual (with access to

solution set)

Evaluation Methodology• Start with no vertices

acquired• For each iteration

• Run acquisition algorithm with some budget

• Record: total acquisition cost, total misclassification cost

• Run for some iterations

Page 22: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

22 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Datasets• Linking short forms to long

forms• SL-GENOMES: Human

genomes• SL-DBLP: Publication

venues

• Disambiguating author names in publication lists• AUTHOR-DBLP: 352

ambiguous author names

Page 23: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

23 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Results• SL-GENOMES

Average improvement of difference with Manual over second best algorithm: 49%

Page 24: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

24 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Results• AUTHOR-DBLP

Average improvement of difference with Manual over second best algorithm: 74%

Page 25: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

25 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Search Engine Queries• Saved 90% of acquisitions of search(.) vertices for

test instances correctly classified by our algorithm• Mainly acquired search(a) or search(b) vertices, which

services many test instances• Typically, no need to acquire search(a b) for correct

classification

Page 26: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

26 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Contents• Background and Motivation• Cost-sensitive Record Acquisition Framework• Algorithm for Record Matching Problems• Evaluation• Conclusion

• Contributions

Page 27: 2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition

2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT)

27 of 27Yee Fan Tan and Min-Yen Kan - Hierarchical Cost-sensitive Web Resource Acquisition for Record Matching

Contributions

• Model using resource dependency graph• Versatile, applicable to many problems• Acquisition algorithm for record matching problems• Effective on record matching problems of different

domains

Thank you for your attention!

Framework for cost-sensitive acquisitions ofweb resources with hierarchical dependencies