170
Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join CIKM 2019 Tutorial Jiaheng Lu, University of Helsinki Chunbin Lin, Amazon AWS Jin Wang, University of California Los Angeles Chen Li, University of California Irvine

Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join

CIKM 2019 Tutorial

Jiaheng Lu, University of HelsinkiChunbin Lin, Amazon AWSJin Wang, University of California Los AngelesChen Li, University of California Irvine

Page 2: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 2

Outline

Motivation and Background

History and Classification

Databases Techniques

Machine Learning Models

Open Challenges and Discussion

Page 3: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 3

String data is ubiquitous

A string is a sequence of characters.

Example of string data:

• Product catalogs

• DNA sequence data

• Text description

• Customer relationship management data

Page 4: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

4

4

Example: approximate string processing for a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Iron man 2008 Sci-Fi

Schwarzenegger Terminator: Dark Fate 2019 Sci-Fi

Samuel Jackson The man 2006 Crime

Find movies starred “Schwarzeneger” (missing one ‘g’)

Page 5: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

5

5

Data may not clean

Star

Keanu Reeves

Samuel Jackson

Schwarzenegger

Relation R Relation S

Data integration and cleaning:

Star

Keanu Reeves

Samuel L. Jackson

Schwarzenegger

Page 6: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

6

6

Problem definition: approximate string searches

Schwarzenger

Samuel Jackson

Keanu Reeves

Star

Query q:

Collection of strings s

Search

Output: strings s that satisfy Sim(q,s)≤δ

Sim functions: edit distance, Jaccard Coefficient and Cosine similarity

Schwarrzenger

Page 7: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

7

7

Problem definition: approximate string joins

Schwarzenger

Samuel Jackson

Keanu Reeves

Star

Collection of strings s

Join

Output: strings s and t that satisfy Sim(s,t)≤δ

Sim functions: edit distance, Jaccard Coefficient and Cosine similarity

Schwarzengger

Samuel Jackson

Keanu Reeves

Star

Collection of strings t

Page 8: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 8

Edit distance

A widely used metric to define string similarity

Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2.

Example:s1: Tom Hankss2: Ton Hanked(s1,s2) = 2

Page 9: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

VLDB 2019 9

9

Jaccard Coefficient

⮚ Jaccard (X,Y) = |X∩Y| / |X∪Y|

⮚ q-grams

(un),(ni),(iv),(ve),(er),(rs),(sa),(al)

For example: 2-gram

u n i v e r s a l

Page 10: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 10

Jaccard Coefficient

Jaccard (X,Y) = |X∩Y| / |X∪Y|

Example:s1: Tom Hankss2: Ton Hank

3-Gram(s1):{Tom,om_,m_H,_Ha,Han,ank,nks}3-Gram(s12):{Ton,on_,n_H,_Ha,Han,ank}

Jaccard(s1,s2)= 3/10 =0.3

Page 11: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 11

Cosine similarity

Example:s1: Tom Hankss2: Ton Hank

3-Gram(s1):{Tom,om_,m_H,_Ha,Han,ank,nks}3-Gram(s12):{Ton,on_,n_H,_Ha,Han,ank}

Cosine(s1,s2)= 3/sqrt(6x7) =0.46

cosine 𝐴, 𝐵 =σ𝑖=1𝑛 𝐴𝑖 𝐵𝑖

σ𝑖=1𝑛 𝐴𝑖

2 σ𝑖=1𝑛 𝐵𝑖

2

Page 12: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 12

Synonym based similarity

Synonyms:CIKM = ACM International Conference on Information and Knowledge Management China = CN

Example:S1=“ACM International Conference on Information and

Knowledge Management China”S2=“CIKM 2019 CN”

How to use the existing synonyms?

Page 13: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

String Similarity Measures (Full-expansion)

S1=“ACM International Conference on Information and Knowledge Management China”

S2=“CIKM 2019 CN”

CIKM = ACM International Conference on Information and Knowledge Management

China = CN

Synonyms

S1’= “ ACM International Conference on Information and Knowledge Management

China CIKM CN”

S2’= “ CIKM 2019 CN ACM International Conference on Information and

Knowledge Management China ”

Expanding using all synonyms

Jaccard(S1’,S2’)= 11/12 = 0.92

Page 14: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

⮚ Database techniques❖ String similarity measure❖ String Similarity Search❖ String Similarity Join

⮚ Machine leaning techniques❖ Traditional Machine Learning based Approaches (e.g. SVM)❖ Deep learning based techniques❖ End-to-end systems❖ Related topics in Natural Language Processing

14

Scope of this tutorial

Page 15: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Timeline of string similarity processing mechanism

15

Page 16: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 16

Two parts

Part 1: String similarity searches and joins in databases

Part 2: String similarity searches and joins with machine learning

Page 17: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 17

Part 1

String similarity searches and joins in

databases

Page 18: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 18

Motivation & Problem Definition

String Similarity Measures

String Similarity Search

String Similarity Join

Conclusion

Outline

Page 19: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 19

Motivation & Problem Definition

Name Affiliation Address

Victor

D.

Vianu

UCSD 9500 Gilman Dr CA

Name Address Institution

Victor

Viannu

9500 Gilman

Drive

University of

California

San Diegotypo

different representation

Inconsistent data stored in different sources

Typos (Vianu vs. Viannu)

Synonyms (UCSD University of California San Diego)

Taxonomy (Coffee vs. Latte)

……

Page 20: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 20

Motivation & Problem Definition

t1

t2

…...

tm

(s, tj)

String similarity Search

s s and tj are similar~

Page 21: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 21

Motivation & Problem Definition

s1

s2

…...

sn

t1

t2

…...

tm

(si, tj)

String similarity Join

si and tj are similar~

Page 22: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 22

Motivation & Problem Definition

Basic operation: compute the similarity for two strings

s = “VLDB”, t = “PVLDB”

s = “UW”, t = “University of Washington”

s = “UW”, t = “University of Waterloo”

s = “cafe”, t = “coffee shop”

s = “Vianu”, t = “Viannu”

s = “Papakonstantinou”, t = “Papaconstantnou”

Are they similar or not?

Challenge: accuracy

Page 23: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 23

Motivation & Problem Definition

Naïve method: compute similarity for each two strings

s1

s2

…...

sn

t1

t2

…...

tmBad performance!

String similarity Join

t1

t2

…...

tm

String similarity Search

sChallenge: performance

Page 24: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 24

String Similarity Joins

Filter-and-Verification Framework

Filter Prune dissimilar string pairs, generate candidates

Verification Computing similarity scores for candidates

s1

s2

s3

s4

s5

t1

t2

t3

t4

t5

s1

s2

s3

s4

s5

t1

t2

t3

t4

t5

All pairs Candidates Results

Filtering Verification

s1

s2

s3

s4

s5

t1

t2

t3

t4

t5

Page 25: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 25

Motivation & Problem Definition

String Similarity Measures

String Similarity Search

String Similarity Join

Conclusion

Outline

Page 26: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 26

String Similarity MeasuresString Similarity Measures

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

SimilaritySynonym-based

similarity

Taxonomy-based

similarity

Jaccard

Dice

Cosine

……

Edit Distance

Edit Similarity

Hamming

Distance

……

JaccT [ICDE/ArasuCK08]

Full Expansion [SIGMOD/LuLWLW13]

Selective Expansion [SIGMOD/LuLWLW13]

Pkduck [PVLDB/TaoDS17]

JACCAR [EDBT/WangLLZ19]

K-Join [ICDE/ShangLLF17]

GTS [CIKM/XuL18]

USIM (Unified String Similarity) [PVLDB/XuL19]

Hybrid Similarity

Fast-Join [ICDE/WangLF11]

Silkmoth [PVLDB/DengKMS17]

MF-Join [ICDE/WangLZ19]

Page 27: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27

Token-based similarity: Jaccard, Dice, Cosine, etc.

s = “University of California San Diego”

t = “University of California Los Angeles”

s’ = {University, of, California, San, Diego}

t’ = {University, of, California, Los, Angeles}

Overlap-based similarity, higher score means more similar

String Similarity Measures

Convert a string into tokens

Page 28: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 28

Token-based similarity: Jaccard, Dice, Cosine, etc.

String Similarity Measures

Convert a string into q-grams

2-gram set of “AMAZON” is {“AM”, “MA”, “AZ”, “ZO”, “ON”}

A M A Z O N

s = “AMAZON”

t = “AMACON”

s’ = {AM, MA, AZ, ZO, ON}

t’ = {AM, MA, AC, CO, ON}

Page 29: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 29

Character-based similarity: Edit distance, Edit Similarity, etc.

ED(“me”, “my”) = 1

Edit distance ED: The minimum number of single-character edits (insertions,

deletions or substitutions) required to change one word into the other.

Edit Similarity EDS:

Edit-distance based similarity, lower score means more similar

ED(“starbucks”, “starbukk”) = 2

starbucks

starbukks

starbukk

1st edit: substitute c with k

2nd edit: delete s

String Similarity Measures

Page 30: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 30

Hybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine

1. Tokenize each string as a set of tokens.

2. Allow fuzzy matching between tokens.

s = “International Conference on Inforomation and Knowledges Management”

t = “Internaitional Conferrence on Information and Knowledge Management”

String Similarity Measures

Jaccard (s, t) = 2/12 = 0.167

Page 31: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 31

Hybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine

1. Tokenize each string as a set of tokens.

2. Allow fuzzy matching between tokens.

s = “International Conference on Inforomation and Knowledges Management”

t = “Internaitional Conferrence on Information and Knowledge Management”

String Similarity Measures

Fuzzy-Jaccard(s, t) = 7/7 = 1.0

E.g., allow edit distance less than 2 between tokens

Page 32: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 32

Hybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine

1. Tokenize each string as a set of tokens.

2. Allow fuzzy matching between tokens.

s = “International Conference on Inforomation and Knowledges Management”

t = “on Information and Knowledge Management Internaitional Conferrence”

String Similarity Measures

Fuzzy-Jaccard(s, t) = 7/7 = 1.0

E.g., allow edit distance less than 2 between tokens

Page 33: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 33

Hybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine

1. Tokenize each string as a set of tokens.

2. Allow fuzzy matching between tokens.

Weighted bigrpah

s1

s2

sn-1

sn

t1

t2

tm-1

tm

s3

……

……

w1,1

w2,1

w1,mw2,m-1

wn-1,m

wn,1

String Similarity Measures

Build a bigraph to model two token sets

Node: refer to each token

Edge: if the character-based similarity of two

tokens exceeds a threshold

Weight: the character-based similarity

Fuzzy-overlap = maximum weight matching

Page 34: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 34

Weighted bigrpah

s1

s2

sn-1

sn

t1

t2

tm-1

tm

s3

……

……

w1,1

w2,1

w1,mw2,m-1

wn-1,m

wn,1

String Similarity MeasuresHybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine

1. Tokenize each string as a set of tokens.

2. Allow fuzzy matching between tokens.

Page 35: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 35

String Similarity MeasuresHybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine

1. Tokenize each string as a set of tokens.

2. Allow fuzzy matching between tokens.

Fast-Join [ICDE/WangLF11]

Token level : token-based similarity

Element level: Character-based similarity

Weighted bigrpah

s1

s2

sn-1

sn

t1

t2

tm-1

tm

s3……

……

w1,1

w2,1

w1,mw2,m-1

wn-1,m

wn,1

Token level

Element level

Page 36: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 36

String Similarity MeasuresHybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine

1. Tokenize each string as a set of tokens.

2. Allow fuzzy matching between tokens.

Silkmoth [PVLDB/DengKMS17],

MF-Join [ICDE/WangLZ19]

Token level : token-based similarity

Element level: different kinds of similarities Weighted bigrpah

s1

s2

sn-1

sn

t1

t2

tm-1

tm

s3

……

……

w1,1

w2,1

w1,mw2,m-1

wn-1,m

wn,1

Token level

Element level

Page 37: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 37

Syntactic similarity

s = “University of California San Diego Computer Science”

t = “UCSD CS”

Jaccard(s, t) = Dice(s, t) = Cosine(s,t) = 0

Edit-distance(s, t) = 44

Fuzzy-Jaccard(s, t) = Fuzzy-Dice(s, t) = Fuzzy-Cosine(s, t) = 0

s and t are not similar at all

String Similarity Measures

Token-based similarity: Jaccard, Dice, Cosine, etc.

Character-based similarity: Edit distance

Hybrid similarity: Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine

Missing cases

Page 38: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 38

String Similarity MeasuresString Similarity Measures

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

SimilaritySynonym-based

similarity

Taxonomy-based

similarity

Jaccard

Dice

Cosine

……

Edit Distance

Edit Similarity

Hamming

Distance

……

JaccT [ICDE/ArasuCK08]

Full Expansion [SIGMOD/LuLWLW13]

Selective Expansion [SIGMOD/LuLWLW13]

Pkduck [PVLDB/TaoDS17]

JACCAR [EDBT/WangLLZ19]

K-Join [ICDE/ShangLLF17]

GTS [CIKM/XuL18]

USIM (Unified String Similarity) [PVLDB/XuL19]

Hybrid Similarity

Fast-Join [ICDE/WangLF11]

Silkmoth [PVLDB/DengKMS17]

MF-Join [ICDE/WangLZ19]

Page 39: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 39

Source of synonyms:

Obtained by machine learning models

Provided by domain experts

Extracted from knowledge bases

AWS

CIKM

UW

UW

UW

Amazon Web Services

ACM International Conference on Information and Knowledge Management

University of Washington

University of Wisconsin

University of Waterloo

String Similarity MeasuresUsing synonyms to enhance string similarity measures

Example synonyms

Page 40: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 40

String Similarity Measures

JaccT [ICDE/ArasuCK08]

Transformation-based operation

Enumerate all the transformed strings, pick the pair with largest score

s = “VLDB Conf”

t = “Large Database Conference”

VLDB Very Large Database

Conf Conference

s1 = “VLDB Conf”

s2 = “Very Large Database Conf”

s3 = “VLDB Conference”

s4 = “Very Large Database Conference”

t1 = “Large Database Conference”

t2 = “Large Database Conf”

SIM(s, t) = Jaccard(s4, t1) = ¾ = 0.75 Strings

Synonyms

Intermediate Results

Page 41: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 41

String Similarity Measures

PKduck [PVLDB/TaoDS17]

Transformation-based operation

Apply to one side each time

s = “VLDB Conf”

t = “Large Database Conference”

VLDB Very Large Database

Conf Conference

s1 = “VLDB Conf”

s2 = “Very Large Database Conf”

s3 = “VLDB Conference”

s4 = “Very Large Database Conference”

t = “Large Database Conference”

SIM(s, t) = Jaccard(s4, t1) = ¾ = 0.75

Strings

Synonyms

Intermediate Results

Page 42: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 42

String Similarity Measures

JACCAR [EDBT/WangLLZ19]

Transformation-based operation

Apply to only one side

s = “VLDB Conf”

VLDB Very Large Database

Conf Conference

s1 = “VLDB Conf”

s2 = “Very Large Database Conf”

s3 = “VLDB Conference”

s4 = “Very Large Database Conference”

SIM(s, t) = Jaccard(s4, t) = 4/4 = 1.0

…… Important dates for contributions to the 35th

international very large database conference held in

USA ….

Document

Page 43: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 43

String Similarity Measures

Full-Expansion [SIGMOD/LuLWLW13]

Expansion-based operation

Apply all the applicable rules

s =“ACM's Special Interest Group on Management Of Data NY USA”

t =“SIGMOD New York United States of America”

ACM’s Association for Computing Machinery’s

SIGMOD ACM's Special Interest Group on Management Of Data

SIGMOD International Conference on Management of Data

NY New York

USA United States of America

s’ = " ACM's Special Interest Group on Management Of Data SIGMOD NY New York USA

United States America Association for Computing Machinery’s "

t’ = " ACM's Special Interest Group on Management of Data SIGMOD NY New York USA

United States America International Conference "

SIM(s, t) = Jaccard(s’, t’) = 16/22 = 0.72

Strings

Synonyms

Intermediate Results

Page 44: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 44

String Similarity Measures

Selective-Expansion [SIGMOD/LuLWLW13]

Expansion-based operation

Apply only good applicable rules

s =“ACM's Special Interest Group on Management Of Data NY USA”

t =“SIGMOD New York United States of America”

ACM’s Association for Computing Machinery’s

SIGMOD ACM's Special Interest Group on Management Of Data

SIGMOD International Conference on Management of Data

NY New York

USA United States of America

SIM(s, t) = Jaccard(s’, t’) = 16/16 = 1.0

s’=" ACM's Special Interest Group on Management Of Data SIGMOD NY New York USA United States

America "

t’=" ACM's Special Interest Group on Management of Data SIGMOD NY New York USA United States

America "

Strings

Synonyms

Intermediate Results

Page 45: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 45

String Similarity MeasuresString Similarity Measures

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

SimilaritySynonym-based

similarity

Taxonomy-based

similarity

Jaccard

Dice

Cosine

……

Edit Distance

Edit Similarity

Hamming

Distance

……

JaccT [ICDE/ArasuCK08]

Full Expansion [SIGMOD/LuLWLW13]

Selective Expansion [SIGMOD/LuLWLW13]

Pkduck [PVLDB/TaoDS17]

JACCAR [EDBT/WangLLZ19]

K-Join [ICDE/ShangLLF17]

GTS [CIKM/XuL18]

USIM (Unified String Similarity) [PVLDB/XuL19]

Hybrid Similarity

Fast-Join [ICDE/WangLF11]

Silkmoth [PVLDB/DengKMS17]

MF-Join [ICDE/WangLZ19]

Page 46: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 46

String Similarity Measures

GTS [CIKM/XuL18] Using taxonomy to enhance string similarity

Wikipedia categories

Italy food restaurants

Turin coffee

coffee

drinks

latte espresso

type of

restaurants

bar coffeehouse

Arteries of

turin

Via nizza

Hierarchical taxonomy

s = “American latte”

t = “American espresso”

Page 47: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 47

String Similarity Measures

GTS [CIKM/XuL18] Using taxonomy to enhance string similarity

Wikipedia categories

Italy food restaurants

Turin coffee

coffee

drinks

latte espresso

type of

restaurants

bar coffeehouse

Arteries of

turin

Via nizza

Hierarchical taxonomy

Node similarity:

Set similarity:

Page 48: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 48

String Similarity MeasuresString Similarity Measures

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

SimilaritySynonym-based

similarity

Taxonomy-based

similarity

Jaccard

Dice

Cosine

……

Edit Distance

Edit Similarity

Hamming

Distance

……

JaccT [ICDE/ArasuCK08]

Full Expansion [SIGMOD/LuLWLW13]

Selective Expansion [SIGMOD/LuLWLW13]

Pkduck [PVLDB/TaoDS17]

JACCAR [EDBT/WangLLZ19]

K-Join [ICDE/ShangLLF17]

GTS [CIKM/XuL18]

USIM (Unified String Similarity) [PVLDB/XuL19]

Hybrid Similarity

Fast-Join [ICDE/WangLF11]

Silkmoth [PVLDB/DengKMS17]

MF-Join [ICDE/WangLZ19]

Page 49: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 49

USIM [PVLDB/XuL19]

String Similarity Measures

Wikipedia categories

food

coffee

coffee

drinks

latte espresso

Hierarchical taxonomy Synonyms

VLDB = Very Large Databases

USA = American

Info = Information

CS = Computer Science

UW = University of Washington

UW = University of Wisconsin

UW = University of Waterloo

Syntactic

ED (string, strng) = 1

ED (database, databases) = 1

ED (Univeristy, University) = 2

ED (cup, cups) = 1

+ +

Page 50: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 50

USIM [PVLDB/XuL19]

String Similarity Measures

Wikipedia categories

food

coffee

coffee

drinks

latte espresso

Hierarchical taxonomy Synonyms

VLDB = Very Large Databases

USA = American

Info = Information

CS = Computer Science

UW = University of Washington

UW = University of Wisconsin

UW = University of Waterloo

Syntactic

ED (string, strng) = 1

ED (database, databases) = 1

ED (Univeristy, University) = 2

ED (cold, cool) = 2

+ +

s = “USA cold latte”

t = “American cool espresso”

when strings are segmented

Page 51: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 51

USIM [PVLDB/XuL19]

String Similarity Measures

Simple case: when strings are segmented, max similarity can be obtained.

Example: three kinds of inconsistencies in the figure are captured.

Page 52: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 52

USIM [PVLDB/XuL19]

String Similarity Measures

However, raw strings are not segmented. To make things even harder, a pair of strings may have more than one plan of segmentation:

How can we know P1 is the

best?

Page 53: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 53

Motivation & Problem Definition

String Similarity Measures

String Similarity Search

String Similarity Join

Conclusion

Outline

Page 54: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 54

String similarity Search

t1

t2

…...

tm

(s, tj)

…s s and tj are similar~

A set of strings S, a query q, a

threshold value θ

A set of string pairs (s, q) where

SIM(s, q) > θ

Threshold based

string similarity

search

A set of strings S, a query q, an

integer 𝑘

k string pairs (s, q) with the

highest similaritiesTop-k based string

similarity search

Page 55: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 55

String Similarity Search

Threshold-based Search Top-k Search

DivideSkip[ICDE/LiLL08]

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

Similarity

V-Gram [vldb/LiWY07]

HS-Topk [icde/WangLDZF15]

DivideSkip [ICDE/LiLL08]

SI-Search [TODS/LuLWLX15]

Token-based

Similarity

Syntactic Similarity

Character-based

Similarity

HS-Topk [icde/WangLDZF15]

TopkSearch [ICDE/DengLFL13]

AppGram [pvldb/WangDTZ13]

String similarity Search

Page 56: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 56

Filter-and-Verification framework

String similarity Search

count filter

If two strings are similar, they must have at least T common tokens or

q-grams

Token-based similarity Character-based similarity

Page 57: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 57

Filter-and-Verification framework

String similarity Search

count filter

If two strings are similar, they must have at least T common tokens or

q-grams

s = “Information and Knowledge Management”

t = “Management of Knowledge Base”

Threshold = 0.6

𝑇 =0.6

1 + 0.6∗ 4 + 4 = 3 Need to have 3 common tokens

Only 2 common tokens, not an answer (Jaccard = 0.33 < 0.6)

Page 58: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 58

For a query Q, compute the T

String similarity Search

Index: inverted list with q-gram as key and strings as values

s1 = “abcd”

s2 = “ab”

s3 = “bcde”

s1 = {ab, bc, cd}

s2 = {ab}

s3 = {bc, cd, de}

Create 2-gram Create indexab bc cd de

s1

s2

s1

s3

s1

s3

s3

Page 59: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 59

String similarity Search

Expensive operation

ab bc cd de

s1

s2

s1

s3

s1

s3

s3

Q = abc Q’ = {ab, bc}

Page 60: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 60

String similarity Search

Expensive operation

ScanCount, MergeSkip

and DivideSkip improve

this step

Page 61: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 61

ScanCount

Maintain an |s|-length

array with initial value 0.

For each record in

inverted list, increase

the count by 1.

Counts greater than

threshold are

candidates

String similarity Search

MergeSkip

Sort string IDs in

inverted list

Maintain a heap and

increase counts for

popped strings

DivideSkip [ICDE/LiLL08]

Improve MergeSkip by

sorting and grouping

strings

Need to process all

inverted list containing

tokens in the query

Prune irrelevant recordsAvoid scanning long

inverted lists

Page 62: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 62

String similarity Search

Observation: some grams may be very frequent and others are infrequent

fix-length gram may be inefficient

Solution: propose variable-length grams to avoid generating very frequent grams

V-Gram [VLDB/LiWY07]

Page 63: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 63

String Similarity Search

Threshold-based Search Top-k Search

DivideSkip[ICDE/LiLL08]

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

Similarity

V-Gram [vldb/LiWY07]

HS-Topk [icde/WangLDZF15]

DivideSkip [ICDE/LiLL08]

SI-Search [TODS/LuLWLX15]

Token-based

Similarity

Syntactic Similarity

Character-based

Similarity

HS-Topk [icde/WangLDZF15]

TopkSearch [ICDE/DengLFL13]

AppGram [pvldb/WangDTZ13]

String similarity Search

Page 64: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 64

String similarity Search

SI-Search

Query: build a QP-tree

Strings: build an SI-tree

QP-tree

SI-Tree

Length filter

Prefix filter

Page 65: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 65

String Similarity Search

Threshold-based Search Top-k Search

DivideSkip[ICDE/LiLL08]

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

Similarity

V-Gram [vldb/LiWY07]

HS-Topk [icde/WangLDZF15]

DivideSkip [ICDE/LiLL08]

SI-Search [TODS/LuLWLX15]

Token-based

Similarity

Syntactic Similarity

Character-based

Similarity

HS-Topk [icde/WangLDZF15]

TopkSearch [ICDE/DengLFL13]

AppGram [pvldb/WangDTZ13]

String similarity Search

Page 66: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 66

String similarity Search

Threshold-based Search top-k search

iteratively tuning thresholds Performance bad. Hard to estimate thresholds

Efficient way: utilize a priority queue, which always calculates the strings

that have the most probability to be in the final results.

Page 67: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 67

String similarity Search

HS-Topk

Create a hierarchical segment tree index, HS-Tree

-- First groups the strings by length.

--- Then constructs a complete binary tree for each group of strings

Proposed two pruning strategies

-- greedy-match strategy: prune the strings with consecutive errors

-- batch- pruning-based strategy: prune the strings by computing upper and

lower bounds

Inefficient filter step for large thresholds

Page 68: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 68

String similarity Search

TopkSearch

Inefficient for long strings

Improve the traditional dynamic-programming algorithm to calculate edit

distance to avoid trying large numbers of edit-distance

Page 69: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 69

String similarity Search

AppGram

Use approximate q-gram matchings.

Use a queue to prune strings with small distances

Use two filtering strategies to prune others

-- the CA strategy and f-queue strategy

Use a max heap to maintain top-k similar strings with the query.

High space complexity

Page 70: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 70

Motivation & Problem Definition

String Similarity Measures

String Similarity Search

String Similarity Join

Outline

Page 71: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 71

String Similarity Join

s1

s2

…...

sn

t1

t2

…...

tm

(si, tj)

…si and tj are similar

~

A set of strings S, a set of

strings T, a threshold value θ

A set of string pairs (s, t) where

SIM(s, t) > θ

Threshold based

string similarity join

A set of strings S, a set of

strings T, an integer 𝑘

k string pairs (s, t) with the

highest similaritiesTop-k based string

similarity join

Page 72: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 72

String Similarity Join

Threshold-based Join Top-k Join

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

Similarity

Token-based

Similarity

Syntactic Similarity

Token-based

Similarity

Character-based

Similarity

GramCount [vldb/GravanoIJKMS01]

PPJoin [www/XiaoWLY08]

PassJoin [pvldb/LiDWF11]

Trie-Join [pvldb/WangLF10]

PartEnum [VLDB/ArasuGK06]

Qchunk [sigmod/QinWLXL11]

ED-Join [pvldb/XiaoWL08]

JaccT [ICDE/ArasuCK08]

SI-Join [SIGMOD/LuLWLW13]

Pkduck [PVLDB/TaoDS17]

AP-Join [CIKM/XuL18]

USIM [PVLDB/XuL19]

Topk-Join[ICDE/XiaoWLS09]

Bed-Join[SIGMOD/ZhangHOS10]

String Similarity Join

Page 73: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Prefix filter

String Similarity Join

1. Define a global order

2. Order tokens based on the global order

3. Select T tokens as signatures

4. Check whether there is overlap for signatures

Page 74: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Prefix filter

String Similarity Join

Global ordering: {a b c d e f g h i j k l}

S1=“c k, e, a, f” S2=“d, b, f, e, k”

Threshold=0.8

Order the strings

S1’=“a, c, e, f, k” S2’=“b, d, e, f, k”

Sig(s1)=“a, c” Sig(s2)=“b, d”

Get signatures

No overlap Jacc(s1,s2)<0.8

1

2

3

4

(1-0.8)*5 + 1 = 2

Page 75: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Length filter

String Similarity Join

If two strings are similar, their length difference cannot be large

Threshold = 0.8 s = “Database Conference”

r = “International Conference on Information and Knowledge Management”

|s| = 2

|r| = 7|r| > |s|/0.8 Jaccard(s, r) < 0.8

Page 76: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 76

String Similarity Join

GramCount [VLDB/GravanoIJKMS01]

Use length-filter and count-filter

Support token-based similarity and character-based similarity

Page 77: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 77

String Similarity Join

ED-Join [PVLDB/XiaoWL08]

Use prefix filtering

Support only edit distance

Two further optimizations: Position Filtering and Content Filtering

Position Filtering: Remove unnecessary signatures produced in prefix filter

Content Filtering: Use the L1 distance as a bound of edit distance.

Edit distance cannot be smaller than half of the L1

Signatures(“sigmod”) = {“si”, “gm”, “od”} Last signature takes at least τ + 1 edit operation to

destroy both “si” and “ gm”

s = “sigmod”, t = “sigkdd”

L1 (s, t)/ 2 = 4/2 = 2 ED(s, t) >= 2

Page 78: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 78

String Similarity Join

Qchunk [SIGMOD/QinWLXL11]

Use prefix filtering

Support only edit distance

Two types of signatures: q-grams and q-chunks q-grams with starting positions at

i∗q+1 (0 ≤ i ≤ (l−1)/q)

Strategy 1: index q-grams of r and use q-chunks of s to generate candidates.

Strategy 2: index q-chunks of r and use q-grams of s to generate candidates.

Page 79: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 79

String Similarity Join

PPJoin [WWW/XiaoWLY08]

Support Jaccard, Cosine, Dice

Two further optimizations: Position Filtering and Suffix Filtering

Position Filtering: Enhance prefix filter by computing upper bounds

Suffix Filtering: probe tokens in the suffix to estimate a tighter upper bound

Threshold = 0.8

Sig(s) = {B,C,D,E,F}

Sig(t) = {A,B,C,D,F}

lower bound of the union size (3+3=6)

upper bound of the intersection size

(1+3=4)

upper bound of the

Jaccard is 4/6 < 0.8

{A,B,D,E,?,?,?,?,?,?,Q,?,?,?,?,?,?,?}

{A,C,D,E,?,?,?,?,Q,?,?,?,?,?,?,?,?,?} lower bound of the union size (5+1+6+9 = 21)

upper bound of the intersection size (3+1+4+7 = 15)

upper bound of the

Jaccard is 15/21 <

0.84

7

Page 80: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 80

String Similarity Join

Threshold-based Join Top-k Join

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

Similarity

Token-based

Similarity

Syntactic Similarity

Token-based

Similarity

Character-based

Similarity

GramCount [vldb/GravanoIJKMS01]

PPJoin [www/XiaoWLY08]

PassJoin [pvldb/LiDWF11]

Trie-Join [pvldb/WangLF10]

PartEnum [VLDB/ArasuGK06]

Qchunk [sigmod/QinWLXL11]

ED-Join [pvldb/XiaoWL08]

JaccT [ICDE/ArasuCK08]

SI-Join [SIGMOD/LuLWLW13]

Pkduck [PVLDB/TaoDS17]

AP-Join [CIKM/XuL18]

USIM [PVLDB/XuL19]

Topk-Join[ICDE/XiaoWLS09]

Bed-Join[SIGMOD/ZhangHOS10]

String Similarity Join

Page 81: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 81

String Similarity Join

JaccT[ICDE/ArasuCK08]

Use synonym rules

Transformation-based operation

Filtering: Prefix filter based signature

Verification: Enumerate all possible transformed strings, then apply Jaccard

4

Page 82: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 82

String Similarity Join

SI-Join [SIGMOD/LuLWLW13]

Use synonym rules

Expansion-based operation

Filtering: length filter and prefix filter

Propose signature size estimation to choose good signatures

Verification: Use Full-Expansion and Selective-Expansion

4

Page 83: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 83

Prefix filter

Length filter

Filtering

candidates Full expansion

Selective expansion

Verification

String Similarity Join

SI-Join [SIGMOD/LuLWLW13]

Page 84: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 84

String Similarity Join

AP-Join [CIKM/XuL18]

Use taxonomy rules

Transformation-based operation

Filtering: length filter and prefix filter

Propose an approximate segmentation algorithm to segment strings

Verification: Use GTS similarity measure

Page 85: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 85

String Similarity Join

USIM [PVLDB/XuL19]

Use synonym and taxonomy rules

Transformation-based operation

Filtering: length filter and prefix filter

Propose an approximate segmentation algorithm to segment strings

Verification: Use USIM similarity measure

Page 86: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

1. Generate pebbles for all similarities;

2. Select signature for prefix filtering;• Extension: allowing more than one overlaps.

3. Perform filtering by finding pairs

having enough overlaps as candidates;

4. Verify the unified similarity for each

candidate.

86

Taxonomy Synonym Jaccard

Pebbles

Prefix Signature

Candidates

Results

Prefix Selection

Prefix Filtering

Calculate sim.

String Similarity Join

USIM [PVLDB/XuL19]

Page 87: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 87

String Similarity Join

Threshold-based Join Top-k Join

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

Similarity

Token-based

Similarity

Syntactic Similarity

Token-based

Similarity

Character-based

Similarity

GramCount [vldb/GravanoIJKMS01]

PPJoin [www/XiaoWLY08]

PassJoin [pvldb/LiDWF11]

Trie-Join [pvldb/WangLF10]

PartEnum [VLDB/ArasuGK06]

Qchunk [sigmod/QinWLXL11]

ED-Join [pvldb/XiaoWL08]

JaccT [ICDE/ArasuCK08]

SI-Join [SIGMOD/LuLWLW13]

Pkduck [PVLDB/TaoDS17]

AP-Join [CIKM/XuL18]

USIM [PVLDB/XuL19]

Topk-Join[ICDE/XiaoWLS09]

Bed-Join[SIGMOD/ZhangHOS10]

String Similarity Join

Page 88: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 88

String Similarity Join

Topk-Join [ICDE/XiaoWLS09]

extend the prefix filtering

Assign each token a weight, the largest possible similarity

Use priority queue to store current top-k candidates

“FCS, Journal, Computer, Science”1 0.75 0.5 0.25

“… Journal …”

the maximum similarity is 0.75

Page 89: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 89

String Similarity Join

Bed-Join [SIGMOD/ZhangHOS10]

Use B+ tree as index

Support only edit distance

Use several mapping functions to transform strings to integer values

1. Dictionary order

2. Gram counting order

3. Gram location order

Page 90: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Algorithm

# runs

of

reading

original

data

Filtering PhaseVerification

Phase

avoid

duplicate

s

apply

filters

load balancing predictable

size in

Reduce

candidates

generation

avoid

reading

original dataMapReduc

e

RIDPairsPPJ

oin[SIGMOD/VernicaCL10]

≥3 × √ √ √ × common prefix ×

V-Smart-Join[PVLDB/MetwallyF12]

≥3 √ × √ × ×enumerate pairs

in inverted lists√

MassJoin[ICDE/DengLHWF14]

≥3 × √ √ × ×enumerate pairs

in inverted lists×

FS-Join[ICDE/RongLSWLD17]

2 √ √ √ √ √common prefix

+ segmentation√

Distributed String Similarity Join

Page 91: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

global

order

HDFS

string

collection

MapReduce

Ordering

global order

calculator

MapReduce

data

partitioner

candidate

generator

Filtering

MapReduce

candidate

aggregator

VerificationHDFS

Results

(similar

pairs)

candidatesordered

token

sets

Distributed String Similarity Join

FS-Join [ICDE/RongLSWLD17]

Page 92: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

S1 = {D, E, F, K}

S2 = {B, C, I, J, K}

S3 = {A, B, H, I}

S4 = {A, C, D, F, J}

S5 = {D, E, G, I}

S6 = {B, G, H, J, K}

S1 = {D, K, E, F}

S2 = {I, C, J, B, K}

S3 = {H, B, I, A}

S4 = {A, D, C, J, F}

S5 = {G, I, D, E}

S6 = {H, J, B, G, K}

(a) Original sets (b) Re-ordered sets

Seg1 = {} Seg1 = {D, E, F} Seg1 ={} Seg1 ={K}

Seg2 = {B, C} Seg2 ={} Seg2 ={I} Seg2 ={J, K}

Seg3 = {A, B} Seg3 ={} Seg3 ={H, I} Seg3 ={}

Seg4 = {A, C} Seg4 ={D, F} Seg4 ={} Seg4 ={J}

Seg5 = {} Seg5={D, E} Seg5 ={G, I} Seg5 ={}

Seg6 = {B} Seg6 ={} Seg6 ={G,H} Seg6 ={J, K}

(c) Partitioning based on pivots {C, F, I}

segment of S1fragment

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

Global Ordering: A B CD E F G H I J K

Pivots: { C, F, I }

Distributed String Similarity Join

FS-Join [ICDE/RongLSWLD17]

Page 93: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Vertical Partitioning

There are two problems that should be considered:

1. How to select the pivots?

❖ Random Selection (Random)

❖ Even Interval (Even-Interval)

❖ Even Token Frequency (Even-TF)

❑ Generate fragments with the same number to tokens

2. How many pivots should be selected?

Page 94: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Computation Framework of FS-Join

map

map

map

1, Seg1

2, Seg1

3, Seg1

4, Seg1

1, Seg2

2, Seg2

3, Seg2

4, Seg2

Key, Value

1, Seg3

2, Seg3

1, Seg5

2, Seg5

gro

up

by k

eys

reduceglobal order

1, Seg1

1, Seg2

1, Seg3

1, Seg4

1, Seg5

1, Seg6

Key, Value

2, Seg1

2, Seg2

3, Seg1

3, Seg2

4, Seg1

4, Seg2

reduce

reduce

reduce

(S2, S3), 1(S2, S4), 1(S2, S6), 1(S3, S4), 1(S3, S6), 1(S1, S4), 2(S1, S5), 2(S4, S5), 1(S2, S3),1(S2, S5), 1(S3, S5), 1(S3, S6), 1(S5, S6), 1(S1, S2), 1(S1, S6), 1(S2, S4), 1(S2, S6), 2(S4, S6), 1……

Pair, # of common token

map

map

map

(S2, S3), 1(S2, S4), 1(S2, S6), 1(S3, S4), 1(S3, S6), 1

(S1, S4), 2(S1, S5), 2(S4, S5), 1(S2, S3), 1(S2, S5), 1(S3, S5), 1

(S3, S6), 1(S5, S6), 1(S1, S2), 1(S1, S6), 1(S2, S4), 1(S2, S6), 2(S4, S6), 1

gro

up b

y k

eys

(S1, S2), 1

Key, Value

(S2, S3), 1(S2, S3), 1

(S3, S6), 1(S3, S6), 1

(S5, S6), 1

reduce

reduce

reduce

a segment of S1

S1 = {D, E, F, K}

S2 = {B, C, I, J, K}

S3 = {A, B, H, I}

S4 = {A, C, D, F, J}

S5 = {D, E, G, I}

S6 = {B, G, H, J, K}

Res

ult

s

Filter phase: generate candidate string pairs Verification phase: produce final similarity join results

a fragment

# common tokens of Seg2

and Seg3 obtained from the first reduce

1

2

3

4

1

2

3

4

1

2

1

2

1

1

1

1

1

1

2

2

3

3

4

4

(S1, S4), 2

(S1, S5), 2

(S1, S6), 1

(S2, S4), 1(S2, S4), 1

(S2, S6), 1(S2, S6), 2

(S2, S5), 1

(S4, S5), 1

(S4, S6), 1

1

1

Page 95: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Filtering Methods

⮚ String Length Filtering (StrL-Filter)

⮚ Segment Length Filtering (SegL-Filter)

⮚ Segment Intersection Filtering (SegI-Filter)

⮚ Segment Difference Filtering (SegD-Filter)

Page 96: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Comparison with Existing Methods

0

1

2

3

4

5

6

0,75 0,8 0,85 0,9

RidPairsPPJoin FS-Join

Threshold

0

5

10

15

20

25

0,75 0,8 0,85 0,9

Online-Aggregation Merge

Merge+Light RidPairsPPJoin

FS-Join

Execution tim

e (

Sec)

X10

2

Threshold

Execution tim

e (

Sec)

X10

2

Page 97: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 97

References[VLDB/GravanoIJKMS01] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava:

Approximate String Joins in a Database (Almost) for Free. VLDB 2001: 491-500

[VLDB/ArasuGK06] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006: 918-929

[VLDB/LiWY07] Chen Li, Bin Wang, Xiaochun Yang: VGRAM: Improving Performance of Approximate Queries on String Collections

Using Variable-Length Grams. VLDB 2007: 303-314

[ICDE/ArasuCK08] Arvind Arasu, Surajit Chaudhuri, Raghav Kaushik: Transformation-

based Framework for Record Matching. ICDE 2008: 40-49

[ICDE/LiLL08] Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE

2008: 257-266

[WWW/XiaoWLY08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient similarity joins for near duplicate detection. WWW

2008: 131-140

[PVLDB/XiaoWL08] Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: an efficient algorithm for similarity joins with edit distance

constraints. PVLDB 1(1): 933-944 (2008)

[ICDE/XiaoWLS09] Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang: Top-k Set Similarity Joins. ICDE 2009: 916-927

[SIGMOD/ZhangHOS10] Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Divesh Srivastava: Bed-tree: an all-purpose index

structure for string similarity search based on edit distance. SIGMOD Conference 2010: 915-926

[PVLDB/WangLF10] Jiannan Wang, Guoliang Li, Jianhua Feng: Trie-Join: Efficient Trie-based String Similarity Joins with Edit-

Distance Constraints. PVLDB 3(1): 1219-1230 (2010)

[SIGMOD/VernicaCL10] Rares Vernica, Michael J. Carey, Chen Li: Efficient parallel set-similarity joins using MapReduce. SIGMOD

Conference 2010: 495-506

[SIGMOD/QinWLXL11] Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin: Efficient exact edit similarity query processing

with the asymmetric signature scheme. SIGMOD Conference 2011: 1033-1044

Page 98: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 98

References[ICDE/WangLF11] Jiannan Wang, Guoliang Li, Jianhua Feng: Fast-

join: An efficient method for fuzzy token matching based string similarity join. ICDE 2011: 458-469

[PVLDB/LiDWF11] Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng: PASS-JOIN: A Partition-based Method for Similarity

Joins. PVLDB 5(3): 253-264 (2011)

[PVLDB/MetwallyF12] Ahmed Metwally, Christos Faloutsos: V-SMART-Join: A Scalable MapReduce Framework for All-Pair

Similarity Joins of Multisets and Vectors. PVLDB 5(8): 704-715 (2012)

[SIGMOD/LuLWLW13] Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang: String similarity measures and joins with

synonyms. SIGMOD Conference 2013: 373-384

[ICDE/DengLFL13] Dong Deng, Guoliang Li, Jianhua Feng, Wen-Syan Li: Top-k string similarity search with edit-distance

constraints. ICDE 2013: 925-936

[PVLDB/WangDTZ13] Xiaoli Wang, Xiaofeng Ding, Anthony K. H. Tung, Zhenjie Zhang: Efficient and Effective KNN Sequence

Search with Approximate n-grams. PVLDB 7(1): 1-12 (2013)

[ICDE/DengLHWF14] Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, Jianhua Feng: MassJoin: A mapreduce-based method

for scalable string similarity joins. ICDE 2014: 340-351

[TODS/LuLWLX15] Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Xiaokui Xiao: Boosting the Quality of Approximate String

Matching by Synonyms. TODS. 40(3): 15:1-15:42 (2015)

[ICDE/WangLDZF15] Jin Wang, Guoliang Li, Dong Deng, Yong Zhang, Jianhua Feng: Two birds with one stone: An efficient

hierarchical framework for top-k and threshold-based string similarity search. ICDE 2015: 519-530

[ICDE/RongLSWLD17] Chuitian Rong, Chunbin Lin, Yasin N. Silva, Jianguo Wang, Wei Lu, Xiaoyong Du: Fast and Scalable

Distributed Set Similarity Joins for Big Data Analytics. ICDE 2017: 1059-1070

Page 99: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 99

References[ICDE/ShangLLF17 ] Zeyuan Shang, Yaxiao Liu, Guoliang Li, Jianhua Feng: K-Join: Knowledge-

Aware Similarity Join. ICDE 2017: 23-24

[SEMWEB/SlabbekoornHH12] Kristian Slabbekoorn, Laura Hollink, Geert-Jan Houben: Domain-

Aware Ontology Matching. International Semantic Web Conference (1) 2012: 542-558

[PVLDB/TaoDS17] Wenbo Tao, Dong Deng, Michael Stonebraker: Approximate String Joins with Abbreviations. PVLDB 11(1): 53-

65 (2017)

[PVLDB/DengKMS17] Dong Deng, Albert Kim, Samuel Madden, Michael Stonebraker:

SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints. PVLDB 10(10): 1082-1093 (2017)

[CIKM/XuL18] Pengfei Xu, Jiaheng Lu: Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint. CIKM 2018: 1563-

1566

[ICDE/WangLZ19] Jin Wang, Chunbin Lin, Carlo Zaniolo: MF-Join: Efficient Fuzzy String Similarity Join with Multi-level

Filtering. ICDE 2019: 386-397

[EDBT/WangLLZ19] Jin Wang, Chunbin Lin, Mingda Li, Carlo Zaniolo: An Efficient Sliding Window Approach for Approximate Entity

Extraction with Synonyms. EDBT 2019: 109-120

[PVLDB/XuL19] Pengfei Xu, Jiaheng Lu: Towards a Unified Framework for String Similarity Joins. PVLDB 12(11): 1289-1302 (2019)

Page 100: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 100

Part 2

String similarity searches and joins with

machine learning

Page 101: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 101

Outline

• Background

• Traditional ML based Approach

• Deep Learning based Approach

• Comprehensive Approach

• Related topics in Natural Language Processing

Page 102: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 102

String Matching can be formulated as…

Page 103: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 103

String as Entity

⮚ String matching problem is always known as Entity Matching

⮚ Definition: identifying and linking/grouping different manifestations of the same real-world object, e.g

❖ Different ways of addressing (names, emails, Facebook accounts), the same person in text

❖ Web pages with different descriptions of the same business

❖ Different photos taken for the same object etc.

Page 104: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 104

Challenges

⮚ Name/attribute ambiguity, data entry errors, missing data, formatting differences, changing attributes…

Page 105: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 105

Brief introduction of ML

⮚ Machine Learning (ML)

❖ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. -- Tom Mitchell, 1954

⮚ ML categories

According to whether the learning process is with labelled data:

❖ Supervised Learning

❖ Unsupervised Learning

Page 106: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 106

Supervised vs. Unsupervised Learning

⮚ String matching mostly adopts supervised methods

Page 107: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 107

Supervised ML models

Page 108: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 108

Recurrent Neural Network (RNN)

⮚ Recurrent Neural Networks take the previous output as inputs. The composite input at time t summarized historical information about the happenings from the first to t-1 time steps.

⮚ RNNs are useful as their intermediate values (state) can store information about past inputs for a time that is not fixed a priori.

Page 109: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 109

LSTM: A variant of RNN [HS 1997]

The hidden activation vector corresponding

to the last word is the sentence embedding

vector (blue).

To solve the vanishing

gradients problem

Add gating mechanism to RNN, keep a memory

cell to contain information outside the normal flow

of RNN.

Taking last step’s output as input of current step.

Recurrently updating representation of sentence.

Page 110: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 110

RNN Encoder-decoder [CMG 2014]

⮚ Create a reversible sentence representation.

⮚ The representation can be reconstructed to an actual sentence form which is reasonable and novel.

Page 111: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 111

RNN Encoder-decoder (cont.)

• The conditional distribution of next symbol.

• Add a summary(constant) symbol, it will hold

the semantics of sentence.

• For long sentences, adding hidden unit to

remember/forget memory.

ሻ𝑃(𝑦𝑡|𝑦𝑡−1, 𝑦𝑡−2, … , 𝑦1, 𝑐ሻ = 𝑔(ℎ<𝑡>, 𝑦𝑡−1, 𝑐

ሻℎ<𝑡> = 𝑓(ℎ<𝑡−1>, 𝑦𝑡−1, 𝑐

Gated Recurrent Unit (GRU):

- Another RNN variant

- simpler than LSTM

Page 112: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 112

Word Embedding

⮚ Foundation: the Distributional Hypothesis

⮚ Each word in the vocabulary is represented by a low dimensional vector

⮚ All words are embedded into the same space

⮚ Similar words have similar vectors

Page 113: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 113

RNN Language Model

⮚ Generate much more meaningful text than n-gram models

⮚ The sparse history is projected into some continuous low-dimensional space, where similar histories get clustered

ሻ𝑠(𝑡ሻ = 𝑓(Uw(𝑡ሻ + W𝑠(𝑡 − 1ሻ

ሻ𝑦(𝑡ሻ = 𝑔(V𝑠(𝑡ሻ

Output Values:

𝑤 𝑡 : 𝑖𝑛𝑝𝑢𝑡 𝑤𝑜𝑟𝑑 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡

y 𝑡 : 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 𝑤𝑜𝑟𝑑𝑠

U,V,W: 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑚𝑎𝑡𝑟𝑖𝑥

s 𝑡 : ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟

Page 114: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 114

Word2vec [MCC 2013]

Directly learn the representation of words using context words

(𝑤,𝑐ሻ∈𝐷

𝑤𝑗∈𝑐

൯log𝑃(𝑤|𝑤𝑗Maximizing the objective function in whole corpus.

Skip-gram Given the context, predicting the word

Works well with small training data,

represents well even rare words or phrases

CBOW Given the word, predicting the context

Faster to train the model, better accuracy

for the frequent words

Two variants:

Page 115: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 115

Glove [PSM 2014]

⮚ Without Distributional Hypothesis.

⮚ Constructing the word-word co-occurrence matrix of whole corpus.

⮚ Inspired from LSA, using matrix factorization to produce word representation.

መ𝐽 =

𝑖,𝑗

൯𝑓(𝑋𝑖𝑗 𝑤𝑖𝑇 𝑤𝑗 − log𝑋𝑖𝑗

2

X-ij is the count of if j-th word occurs, the occurrence of i-th word. w are word

vectors. Minimize loss function.

Loss function:

Page 116: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 116

Phrase2vec [MSC 2013]

⮚ From Word2vec to Phrase2vec

❖ Enlarge phrase vocabulary by analogical reasoning task.

❑ A : B = C : ?

e.g. In word vector space, “Man” + “King” – “Woman” = “Queen”

❑ Phrase A : Phrase B = Phrase C : Phrase D

❖ Phrase Skip-gram Model

❑ Treat phrase in vocabulary like a word.

❑ Incorporates Hierarchical Softmax and Negative Sampling.

New York : New York Times = San Jose : ?

San Jose Airport

San Jose State Univ.

. . .

Phrase D is approximately produced,

which could enlarge phrase vocabulary.

words that appear

frequently together

in corpus

San Jose Mercury News

Page 117: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 117

Paragraph2vec [LM 2014]

⮚ An extension of word2vec model, using a global sentence vector as context.

⮚ When updating word vectors in each iteration, paragraph matrix will also be updated.

Predict the next word according to context

Page 118: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 118

Categorization of ML based String Matching

⮚ Traditional ML approach

❖ Acquire features from entities

❖ Use classification/clustering methods to make decision

⮚ Deep Learning approach

❖ Encode entity/attribute with low-dimensional vectors

❖ Use deep neural network for representation learning

⮚ Comprehensive approach

❖ End-to-end system

❖ Combination of different approaches

❖ Improved usability

Page 119: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 119

Basics of ML approaches

⮚ ML model can be either traditional ones or deep learning ones

Page 120: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 120

Feature Engineering

⮚ For Deep Learning approaches, representation learning is used instead of such explicit features.

Page 121: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 121

Outline

• Background

• Traditional ML based Approach

• Deep Learning based Approach

• Comprehensive Approach

• Related topic in Natural Language Processing

Page 122: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 122

Rule-based approach [FS 1969]

⮚ r = (x,y) is record pair, 𝛾 is comparison vector, M: matches, U: non-matches

⮚ Decision Rule

⮚ Naïve Bayes Assumption

Page 123: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 123

Supervised approaches

⮚ The most popular category

❖ Decision trees [TKM 2001]

❖ Support vector machines [BM 2003]

❖ Ensembles of classifiers [CKM 2009]

❖ Conditional Random Fields (CRF) [GS 2009]

⮚ However, there might be potential problems…

Page 124: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 124

Potential drawback: insufficient training data

⮚ Constructing a training set is hard – since most pairs of records are “easy non‐matches”, E.g.

❖ 100 records from 100 cities

❖ Only 1% of the total pairs come from the same city

⮚ Some pairs are hard to judge even by humans

❖ Inherently ambiguous

E.g., Paris Hilton (person or business)

❖ Missing attributes

Starbucks, Toronto vs Starbucks, Queen Street ,Toronto

Page 125: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 125

Other alternatives

⮚ Semi-supervised/Unsupervised Learning

❖ EM based techniques to learn parameters [Winkler 2006]

❖ Generative Models [RC 2004]

⮚ Active Learning

❖ Committee of Classifiers [SB 2002]

❖ Provably optimizing precision/recall [AGK 2010], [BIP 2012]

❖ Crowdsourcing [WKF 2012]

Page 126: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 126

Summary

⮚ Supervised Learning: The main stream

⮚ Make use of different features and models

⮚ Bottleneck

❖ Require feature engineering

❖ Insufficient training data

⮚ Active Learning and crowdsourcing methods could be promising

Page 127: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 127

Outline

• Background

• Traditional ML based Approach

• Deep Learning based Approach

• Comprehensive Approach

• Related topic in Natural Language Processing

Page 128: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 128

DeepER [ETJ 2018]

⮚ Adopt Deep Learning models to

❖ Capture both syntactic and semantic information

❖ Avoid human labored feature engineering

⮚ End-to-end training the model

⮚ Use Locality Sensitive Hashing for identifying similar entities

Page 129: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 129

DeepER: Methodology

⮚ Use word embedding on each word in an entity

⮚ Encode the whole entity with RNN models

⮚ Compare the similarity with some predefined metrics, e.g. Element-wise comparison

⮚ Classification layer: use activation functions, e.g. softmax, sigmoid

Page 130: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 130

LinkNBed [TSD 2018]

⮚ General idea: Jointly learn representations and entity linkage

⮚ Key Insights

Page 131: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 131

Model Architecture

Page 132: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 132

Deep Matcher [MLR 2018]

⮚ Three Components❖ Encode attributes into low-

dimensional vectors❖ Evaluate similarity between

representation of entities❖ Perform classification

⮚ 4 Models❖ SIF: An Aggregate Function Model❖ RNN: A Sequence-aware Model❖ Attention: A Sequence Alignment

Model❖ Hybrid Model

Page 133: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 133

Decomposable Attention-based model

⮚ Attention mechanism

❖ Learn a similarity representation

❖ Expressed with token-wise alignment

⮚ Comparison: Two-layer Highway Network

⮚ Aggregation

❖ Sum over all elements

❖ Normalization

Page 134: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 134

Hybrid Attribute Summarization

⮚ Use a Bi-directional RNN to learn the representation before soft alignment

⮚ Improvement over aggregation

❖ Use RNN to compute weight

❖ A weighted average over all elements

Page 135: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 135

Summary

⮚ Pros

❖ Do not need manually created features

❖ More Powerful

⮚ Cons

❖ Efficiency

❖ Only Support Supervised setting

❖ Overfitting

Page 136: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 136

Outline

• Background

• Traditional ML based Approach

• Deep Learning based Approach

• Comprehensive Approach

• Related topic in Natural Language Processing

Page 137: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 137

End-to-end EM System

⮚ Challenges

❖ Cover the entire EM pipeline

❖ Integrate multiple methods into one system

❖ Provide necessary guidance to users

Page 138: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 138

Two-Stage Framework

How is this done today in practice?

Development stage: find an accurate workflow, using data samples

Production stage: execute workflow on entirety of data

A

1M tuples

1M tuplesblock match using supervised learning

B

Page 139: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 139

The Magellan System [KDS 2016]

Development Stage

How-to guide

Supporting tools

(as Python commands)

Data samples

Python Interactive Environment

Script Language

Data Analysis Stack

pandas, scikit-learn, matplotlib,

numpy, scipy, pyqt, seaborn,

Big Data Stack

PySpark, mrjob, Pydoop,

pp, dispy,

Facilities for Lay Users

GUIs, wizards, …

Power Users

EM

Workflow

PyData

eco

system

Production Stage

How-to guide

Supporting tools

(as Python commands)

Original data

Page 140: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 140

Workflow: How to guide

no

A

B

sampleA’

B’

matcher

V

quality

check

blocker

blocker

X

(-,-)

(-,-)

(-,-)

(-,-)

(-,-)

Cx

sample

(-,-)

(-,-)

(-,-)

(-,-)

(-,-)

(-,-)

(-,-)

(-,-)

(-,-) +

(-,-) -

(-,-) +

cross-validatematcher U

cross-validatematcher V

0.89 F1

0.93 F1

matcherlabel

Cx S

G

(-,-) +

(-,-) +

(-,-) -

(-,-) -

(-,-) +

A’

B’

blocker

XCx

A’

B’

blocker

YCy

yes

Page 141: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 141

PyData Ecosystem

⮚ Development stage does a lot of data analysis ❖ So build tools on data analysis stack in PyData

⮚ Production stage focuses on scaling❖ So build tools on Big Data stack in PyData

⮚ PyData ecosystem❖ Used extensively by data scientists❖ 86,800 packages (in PyPI)❖ Data analysis stack❖ Big data stack❖ Tools to manage user work❖ Software infrastructure to build tools❖ Ways to manage/package/distribute tools❖ Companies, conferences, books, etc. 141

Page 142: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 142

Design for Open World

command 1

command 2

data

metadata

A

B

A.ssn is a key

Magellan

System X

System Y

command x1

command x2

data

C

command y1

command y2

metadata…

SQL queries

commands

data

metadata

A

B

A.ssn is a key

RDBMS

Closed-World Systems Open-World Systems

Page 143: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 143

Falcon [DSD 2017]

⮚ Boost EM with crowdsourcing and active learning

❖ Define basic operators & use them to model the EM workflow as a DAG

❖ Scales up operators (using MapReduce)

❑ e.g., executing complex blocking rules over large tables

❖ Conduct optimizations both within and across operators

❑ e.g., use crowd time of an operator to mask machine time of another

Page 144: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 144

The EM DAG of Falcon

⮚ Basic idea: use crowd time to mask machine time

⮚ Other Optimizations: indexing, speculatively execute rules, mask pair selection

Page 145: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 145

Smurf [SAD 2018]

⮚ Replace the labeling step of Falcon with self-service solution

⮚ Use Random Forest to directly match strings with multiple predicates instead of deriving blocking rules

Apply pruning predicates in the fashion of Decision Tree

Multiple Decision Trees: Random Forest

Page 146: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 146

Execution of Random Forests

⮚ In-database fashion of execution

⮚ Optimizations: leveraging the query optimizer of DBMS❖ Join results reuse❖ Intra/Inter path filter reuse and ordering❖ Cost-based plan selection

Overall process of executionExpress rules with Join operation

Page 147: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 147

Summary

⮚ Powerful toolkits

⮚ Integration of different approaches

❖ Similarity functions

❖ ML models

❖ Crowdsourcing

⮚ User friendly API

⮚ Efficiency

❖ Benefit from DB techniques

Page 148: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 148

Outline

• Background

• Traditional ML based Approach

• Deep Learning based Approach

• Comprehensive Approach

• Related topic in Natural Language Processing

Page 149: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 149

Text matching in NLP tasks

⮚ The Problem can be formulated as:

Match(𝑇1, 𝑇2)= 𝐹(𝜙(𝑇1), 𝜙(𝑇2))

❖ 𝜙: mapping text to representation vector

❖ 𝐹: scoring function based on representation

Text Matching is a core Task in NLP

Task Text 1 Text 2

Information Retrieval Query/Document Document

Question Answering Question Answer

Paraphrase Identification Text A Text B

Page 150: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 150

Paraphrase Identification

Where can I get very professional and reliable

envelope printing service in Sydney?

Where can I get very affordable branded envelope

printing service in Sydney?

Why are doctors always late?

Why doctors always make you wait for 15-20 minutes

before they see you?

Page 151: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 151

ARC-I and ARC-II [HLL 2014]

Architecture I: Siamese NetworkArchitecture II: early interaction

Page 152: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 152

PWIM [HL 2016]

Page 153: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 153

Information Retrieval

⮚ Deep learning methods have been successfully applied to Information retrieval (IR)

⮚ Human defined features❖ Time consuming

❖ Incomplete❖ Over specified

Page 154: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 154

DSSM [HHG 2013]

⮚ First composite the representation of each document/sentence

⮚ Then perform matching between documents/sentences

❖ Cosine similarity between semantic vectors

Page 155: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 155

Match Pyramid [WLG 2016]

⮚ Focus on capturing matching patterns from interaction

Page 156: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 156

Question Answering (QA)

Page 157: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 157

Benchmark: Stanford Question Answer Dataset (SQuAD)

Page 158: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 158

Match-LSTM [WJ 2016]

Page 159: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 159

BiDAF: Bi-directional Attention Flow [SKF 2017]

Page 160: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 160

QANet [YDL 2018]

Page 161: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 161

Summary

⮚ Text matching is also a popular and crucial topic in NLP

⮚ Different definitions of texts for different problems

⮚ Basic architecture: Siamese Network

⮚ Models are going deeper…

Page 162: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 162

Reference

⮚ [AGK 2010] On active learning of record matching packages. SIGMOD 2010.[BIP 2012] Active sampling for entity matching. KDD 2012.

⮚ [BM 2003] Adaptive Duplicate Detection Using Learnable String Similarity Measures. KDD 2003.[CKM 2009] Exploiting context analysis for combining multiple entity resolution systems. SIGMOD 2009.

⮚ [CMG 2014] Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.

⮚ [DHM 2005] Reference reconciliation in complex information spaces. SIGMOD 2005.

⮚ [DSD 2017] Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. SIGMOD 2017.

⮚ [ETJ 2018] Distributed representations of tuples for entity resolution. PVLDB 2018.

⮚ [HS 1997] Long Short-Term Memory. Neural Computation, 1997.

⮚ [KDS 2016] Magellan: Toward building entity matching management systems. PVLDB 2016.

Page 163: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 163

Reference (cont.)

⮚ [HHG 2013] Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. CIKM 2013.

⮚ [HL 2016] Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement. NAACL 2016.

⮚ [HLL 2014] Convolutional Neural Network Architectures for Matching Natural Language Sentences. NIPS 2014.

⮚ [GS 2009] Answering Table Augmentation Queries from Unstructured Lists on the Web, PVLDB 2009.

⮚ [LM 2014] Distributed representations of sentences and documents. ICML 2014.⮚ [MLR 2018] Deep learning for entity matching: A design space exploration. SIGMOD 2018.⮚ [MCC 2013] Efficient Estimation of Word Representations in Vector Space. ICLR Workshop, 2013.⮚ [MSC 2013] Distributed representations of words and phrases and their compositionality. NIPS

2013.⮚ [PSM 2014] Glove: Global vectors for word representation. EMNLP, 2014.⮚ [RC 2004] A Hierarchical Graphical Model for Record Linkage. UAI 2004.⮚ [SAD 2018] Smurf: Self-service string matching using random forests. PVLDB 2018.

Page 164: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 164

Reference (cont.)

⮚ [SAD 2018] Smurf: Self-service string matching using random forests. PVLDB 2018.

⮚ [SKF 2017] Bidirectional Attention Flow for Machine Comprehension. ICLR 2017.

⮚ [TKM 2001] Learning object identification rules for information integration. Inf. Syst. 2001.

⮚ [TSD 2018] LinkNBed: Multi-Graph Representation Learning with Entity Linkage. ACL 2018.

⮚ [Winkler 2006] Overview of Record Linkage and Current Research Directions, Research Report Series, US Census, 2006.

⮚ [WJ 2017] Machine Comprehension Using Match-LSTM and Answer Pointer. ICLR 2017.

⮚ [WKF 2012] CrowdER: Crowdsourcing Entity Resolution. PVLDB 2012.

⮚ [WLG 2016] A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations. AAAI 2016.

⮚ [YDL 2018] QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. ICLR 2018.

Page 165: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 165

Open challenges

Page 166: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Open challenges (I)

Optimize the pipeline of String Similarity Queries

⮚ Most of works are algorithm level optimization

⮚ Need an end-to-end pipeline to deal with the whole life cycle of the task.

⮚ Essential to build such a pipeline on relational database management systems like MySQL or NoSQL database like MongoDB.

166

Page 167: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Open challenges (II)

More Efficient ML based Approaches

⮚ Machine learning based methods needs more time to training model than database techniques.

⮚ Supervised ones requires a large amount of labeled data to serve as the training set

167

Page 168: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Open challenges (III)

Combine Human-in-the-Loop Approaches with ML

⮚ Use crowdsourcing approaches in string similarity measurement

⮚ The future direction is to automatically identify when and to what extent should human labor be involved in string similarity processing

168

Page 169: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Conclusion

⮚ String data is ubiquitous

⮚ Two approaches for similarity string processing with database techniques and ML models

⮚ Future work to combine two approaches together for a Human-in-the-Loop and (semi-)autonomous tool

169

Page 170: Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 170

Thank you

Q & A