Mayank Kejriwalkejriwalresearch.azurewebsites.net/pdf/adaptive-candidate-generatio… · Mayank...

Preview:

Citation preview

Mayank KejriwalInformation Sciences Institute/USC

kejriwal@isi.edu

http://usc-isi-i2.github.io/kejriwal/

Given one or more attribute-rich graphs, a training set of linked node pairs, how do we avoid evaluating allnode pairs (O|V|2)?

Blocks 1 2 3 4

5

Apply blocking keye.g. Tokens(LastName)

Generate candidate set (7 pairs), apply similarity function on each pair

?

?

?

?

?

?

?

Dataset 1

Dataset 2

‘Exhaustive’ set: 4 X 6=24 pairs

Idea: Candidate Generation via blocking

Even better…learn candidate generation function• Doing it efficiently without losing (much) expressive power:

Disjunctive Normal Form (DNF) blocking keys

• Example:• CharTriGrams(Last_Name) U (Numbers(Address) X Last4Chars(SSN))

• Use functional elements like CharTriGrams to construct complex blocking keys

• Optimal search is NP-Complete, use greedy approximation with guarantees

Some resultsDNF blocking for RDF Attribute Clustering (AC)

Name Recall Reduction FMeasure Recall Reduction FMeasure

Persons 1 100 99.75 99.88 100 98.86 99.43

Persons 2 99.00 99.79 99.39 99.75 99.02 99.38

Restaurants 100 99.73 99.87 100 95.57 99.79

Eprints-Rexa 98.16 99.28 98.72 99.60 99.37 99.48

IM-Similarity 100 98.14 99.06 100 62.79 77.14

IIMB-059 99.76 93.35 96.45 97.33 73.09 83.49

IIMB-062 47.73 98.11 64.22 77.27 90.80 83.49

Libraries 97.96 99.99 98.96 99.99 99.87 99.93

Parks 95.96 94.41 95.18 99.07 88.27 93.36

Video Game 98.73 99.96 99.34 99.72 99.85 99.79

Average 93.73 98.25 95.11 97.27 91.15 93.53