Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity:

Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join

CIKM 2019 Tutorial

Jiaheng Lu, University of HelsinkiChunbin Lin, Amazon AWSJin Wang, University of California Los AngelesChen Li, University of California Irvine

CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 2

Outline

Motivation and Background

History and Classification

Databases Techniques

Machine Learning Models

Open Challenges and Discussion


String data is ubiquitous

A string is a sequence of characters.

Example of string data:

• Product catalogs

• DNA sequence data

• Text description

• Customer relationship management data

4

4

Example: approximate string processing for a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Iron man 2008 Sci-Fi

Schwarzenegger Terminator: Dark Fate 2019 Sci-Fi

Samuel Jackson The man 2006 Crime

Find movies starred “Schwarzeneger” (missing one ‘g’)

5

5

Data may not clean

Star

Keanu Reeves

Samuel Jackson

Schwarzenegger

Relation R Relation S

Data integration and cleaning:

Star

Keanu Reeves

Samuel L. Jackson

Schwarzenegger

6

6

Problem definition: approximate string searches

…

Schwarzenger

Samuel Jackson

Keanu Reeves

Star

Query q:

Collection of strings s

Search

Output: strings s that satisfy Sim(q,s)≤δ

Sim functions: edit distance, Jaccard Coefficient and Cosine similarity

Schwarrzenger

7

7

Problem definition: approximate string joins

…

Schwarzenger

Samuel Jackson

Keanu Reeves

Star

Collection of strings s

Join

Output: strings s and t that satisfy Sim(s,t)≤δ

Sim functions: edit distance, Jaccard Coefficient and Cosine similarity

…

Schwarzengger

Samuel Jackson

Keanu Reeves

Star

Collection of strings t


Edit distance

A widely used metric to define string similarity

Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2.

Example:s1: Tom Hankss2: Ton Hanked(s1,s2) = 2

VLDB 2019 9

9

Jaccard Coefficient

⮚ Jaccard (X,Y) = |X∩Y| / |X∪Y|

⮚ q-grams

(un),(ni),(iv),(ve),(er),(rs),(sa),(al)

For example: 2-gram

u n i v e r s a l


Jaccard Coefficient

Jaccard (X,Y) = |X∩Y| / |X∪Y|

Example:s1: Tom Hankss2: Ton Hank

3-Gram(s1):{Tom,om_,m_H,_Ha,Han,ank,nks}3-Gram(s12):{Ton,on_,n_H,_Ha,Han,ank}

Jaccard(s1,s2)= 3/10 =0.3


Cosine similarity

Example:s1: Tom Hankss2: Ton Hank

3-Gram(s1):{Tom,om_,m_H,_Ha,Han,ank,nks}3-Gram(s12):{Ton,on_,n_H,_Ha,Han,ank}

Cosine(s1,s2)= 3/sqrt(6x7) =0.46

cosine 𝐴, 𝐵 =σ𝑖=1𝑛 𝐴𝑖 𝐵𝑖

σ𝑖=1𝑛 𝐴𝑖

2 σ𝑖=1𝑛 𝐵𝑖

2


Synonym based similarity

Synonyms:CIKM = ACM International Conference on Information and Knowledge Management China = CN

Example:S1=“ACM International Conference on Information and

Knowledge Management China”S2=“CIKM 2019 CN”

How to use the existing synonyms?

String Similarity Measures (Full-expansion)

S1=“ACM International Conference on Information and Knowledge Management China”

S2=“CIKM 2019 CN”

CIKM = ACM International Conference on Information and Knowledge Management

China = CN

Synonyms

S1’= “ ACM International Conference on Information and Knowledge Management

China CIKM CN”

S2’= “ CIKM 2019 CN ACM International Conference on Information and

Knowledge Management China ”

Expanding using all synonyms

Jaccard(S1’,S2’)= 11/12 = 0.92

⮚ Database techniques❖ String similarity measure❖ String Similarity Search❖ String Similarity Join

⮚ Machine leaning techniques❖ Traditional Machine Learning based Approaches (e.g. SVM)❖ Deep learning based techniques❖ End-to-end systems❖ Related topics in Natural Language Processing

14

Scope of this tutorial

Timeline of string similarity processing mechanism

15


Two parts

Part 1: String similarity searches and joins in databases

Part 2: String similarity searches and joins with machine learning


Part 1

String similarity searches and joins in

databases


Motivation & Problem Definition

String Similarity Measures

String Similarity Search

String Similarity Join

Conclusion

Outline



Name Affiliation Address

Victor

D.

Vianu

UCSD 9500 Gilman Dr CA

Name Address Institution

Victor

Viannu

9500 Gilman

Drive

University of

California

San Diegotypo

different representation

Inconsistent data stored in different sources

Typos (Vianu vs. Viannu)

Synonyms (UCSD University of California San Diego)

Taxonomy (Coffee vs. Latte)

……



t1

t2

…...

tm

…

(s, tj)

…

String similarity Search

s s and tj are similar~



s1

s2

…...

sn

t1

t2

…...

tm

…

(si, tj)

…

String similarity Join

si and tj are similar~



Basic operation: compute the similarity for two strings

s = “VLDB”, t = “PVLDB”

s = “UW”, t = “University of Washington”

s = “UW”, t = “University of Waterloo”

s = “cafe”, t = “coffee shop”

s = “Vianu”, t = “Viannu”

s = “Papakonstantinou”, t = “Papaconstantnou”

Are they similar or not?

Challenge: accuracy



Naïve method: compute similarity for each two strings

s1

s2

…...

sn

t1

t2

…...

tmBad performance!

String similarity Join

t1

t2

…...

tm


sChallenge: performance


String Similarity Joins

Filter-and-Verification Framework

Filter Prune dissimilar string pairs, generate candidates

Verification Computing similarity scores for candidates

s1

s2

s3

s4

s5

t1

t2

t3

t4

t5

s1

s2

s3

s4

s5

t1

t2

t3

t4

t5

All pairs Candidates Results

✓

Filtering Verification

s1

s2

s3

s4

s5

t1

t2

t3

t4

t5

✓






Conclusion

Outline


String Similarity MeasuresString Similarity Measures

Syntactic Similarity Semantic Similarity

Token-based

Similarity

Character-based

SimilaritySynonym-based

similarity

Taxonomy-based

similarity

Jaccard

Dice

Cosine

……

Edit Distance

Edit Similarity

Hamming

Distance

……

JaccT [ICDE/ArasuCK08]

Full Expansion [SIGMOD/LuLWLW13]

Selective Expansion [SIGMOD/LuLWLW13]

Pkduck [PVLDB/TaoDS17]

JACCAR [EDBT/WangLLZ19]

K-Join [ICDE/ShangLLF17]

GTS [CIKM/XuL18]

USIM (Unified String Similarity) [PVLDB/XuL19]

Hybrid Similarity

Fast-Join [ICDE/WangLF11]

Silkmoth [PVLDB/DengKMS17]

MF-Join [ICDE/WangLZ19]


Token-based similarity: Jaccard, Dice, Cosine, etc.

s = “University of California San Diego”

t = “University of California Los Angeles”

s’ = {University, of, California, San, Diego}

t’ = {University, of, California, Los, Angeles}

Overlap-based similarity, higher score means more similar


Convert a string into tokens




Convert a string into q-grams

2-gram set of “AMAZON” is {“AM”, “MA”, “AZ”, “ZO”, “ON”}

A M A Z O N

s = “AMAZON”

t = “AMACON”

s’ = {AM, MA, AZ, ZO, ON}

t’ = {AM, MA, AC, CO, ON}


Character-based similarity: Edit distance, Edit Similarity, etc.

ED(“me”, “my”) = 1

Edit distance ED: The minimum number of single-character edits (insertions,

deletions or substitutions) required to change one word into the other.

Edit Similarity EDS:

Edit-distance based similarity, lower score means more similar

ED(“starbucks”, “starbukk”) = 2

starbucks

starbukks

starbukk

1st edit: substitute c with k

2nd edit: delete s



Hybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine

1. Tokenize each string as a set of tokens.

2. Allow fuzzy matching between tokens.

s = “International Conference on Inforomation and Knowledges Management”

t = “Internaitional Conferrence on Information and Knowledge Management”


Jaccard (s, t) = 2/12 = 0.167






t = “Internaitional Conferrence on Information and Knowledge Management”


Fuzzy-Jaccard(s, t) = 7/7 = 1.0

E.g., allow edit distance less than 2 between tokens






t = “on Information and Knowledge Management Internaitional Conferrence”


Fuzzy-Jaccard(s, t) = 7/7 = 1.0

E.g., allow edit distance less than 2 between tokens





Weighted bigrpah

s1

s2

sn-1

sn

t1

t2

tm-1

tm

s3

……

……

w1,1

w2,1

w1,mw2,m-1

wn-1,m

wn,1


Build a bigraph to model two token sets

Node: refer to each token

Edge: if the character-based similarity of two

tokens exceeds a threshold

Weight: the character-based similarity

Fuzzy-overlap = maximum weight matching


Weighted bigrpah

s1

s2

sn-1

sn

t1

t2

tm-1

tm

s3

……

……

w1,1

w2,1

w1,mw2,m-1

wn-1,m

wn,1

String Similarity MeasuresHybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine








Token level : token-based similarity

Element level: Character-based similarity

Weighted bigrpah

s1

s2

sn-1

sn

t1

t2

tm-1

tm

s3……

……

w1,1

w2,1

w1,mw2,m-1

wn-1,m

wn,1

Token level

Element level





Silkmoth [PVLDB/DengKMS17],


Token level : token-based similarity

Element level: different kinds of similarities Weighted bigrpah

s1

s2

sn-1

sn

t1

t2

tm-1

tm

s3

……

……

w1,1

w2,1

w1,mw2,m-1

wn-1,m

wn,1

Token level

Element level


Syntactic similarity

s = “University of California San Diego Computer Science”

t = “UCSD CS”

Jaccard(s, t) = Dice(s, t) = Cosine(s,t) = 0

Edit-distance(s, t) = 44

Fuzzy-Jaccard(s, t) = Fuzzy-Dice(s, t) = Fuzzy-Cosine(s, t) = 0

s and t are not similar at all



Character-based similarity: Edit distance

Hybrid similarity: Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine

Missing cases




Token-based

Similarity

Character-based


similarity

Taxonomy-based

similarity

Jaccard

Dice

Cosine

……

Edit Distance

Edit Similarity

Hamming

Distance

……







GTS [CIKM/XuL18]


Hybrid Similarity





Source of synonyms:

Obtained by machine learning models

Provided by domain experts

Extracted from knowledge bases

AWS

CIKM

UW

UW

UW

Amazon Web Services

ACM International Conference on Information and Knowledge Management

University of Washington

University of Wisconsin

University of Waterloo

String Similarity MeasuresUsing synonyms to enhance string similarity measures

Example synonyms




Transformation-based operation

Enumerate all the transformed strings, pick the pair with largest score

s = “VLDB Conf”

t = “Large Database Conference”

VLDB Very Large Database

Conf Conference

s1 = “VLDB Conf”

s2 = “Very Large Database Conf”

s3 = “VLDB Conference”

s4 = “Very Large Database Conference”

t1 = “Large Database Conference”

t2 = “Large Database Conf”

SIM(s, t) = Jaccard(s4, t1) = ¾ = 0.75 Strings

Synonyms

Intermediate Results



PKduck [PVLDB/TaoDS17]


Apply to one side each time

s = “VLDB Conf”



Conf Conference






SIM(s, t) = Jaccard(s4, t1) = ¾ = 0.75

Strings

Synonyms






Apply to only one side

s = “VLDB Conf”


Conf Conference





SIM(s, t) = Jaccard(s4, t) = 4/4 = 1.0

…… Important dates for contributions to the 35th

international very large database conference held in

USA ….

Document



Full-Expansion [SIGMOD/LuLWLW13]

Expansion-based operation

Apply all the applicable rules

s =“ACM's Special Interest Group on Management Of Data NY USA”

t =“SIGMOD New York United States of America”

ACM’s Association for Computing Machinery’s

SIGMOD ACM's Special Interest Group on Management Of Data

SIGMOD International Conference on Management of Data

NY New York

USA United States of America

s’ = " ACM's Special Interest Group on Management Of Data SIGMOD NY New York USA

United States America Association for Computing Machinery’s "

t’ = " ACM's Special Interest Group on Management of Data SIGMOD NY New York USA

United States America International Conference "

SIM(s, t) = Jaccard(s’, t’) = 16/22 = 0.72

Strings

Synonyms




Selective-Expansion [SIGMOD/LuLWLW13]


Apply only good applicable rules

s =“ACM's Special Interest Group on Management Of Data NY USA”

t =“SIGMOD New York United States of America”

ACM’s Association for Computing Machinery’s

SIGMOD ACM's Special Interest Group on Management Of Data

SIGMOD International Conference on Management of Data

NY New York

USA United States of America

SIM(s, t) = Jaccard(s’, t’) = 16/16 = 1.0

s’=" ACM's Special Interest Group on Management Of Data SIGMOD NY New York USA United States

America "

t’=" ACM's Special Interest Group on Management of Data SIGMOD NY New York USA United States

America "

Strings

Synonyms





Token-based

Similarity

Character-based


similarity

Taxonomy-based

similarity

Jaccard

Dice

Cosine

……

Edit Distance

Edit Similarity

Hamming

Distance

……







GTS [CIKM/XuL18]


Hybrid Similarity






GTS [CIKM/XuL18] Using taxonomy to enhance string similarity

Wikipedia categories

Italy food restaurants

Turin coffee

coffee

drinks

latte espresso

type of

restaurants

bar coffeehouse

Arteries of

turin

Via nizza

Hierarchical taxonomy

s = “American latte”

t = “American espresso”



GTS [CIKM/XuL18] Using taxonomy to enhance string similarity


Italy food restaurants

Turin coffee

coffee

drinks

latte espresso

type of

restaurants

bar coffeehouse

Arteries of

turin

Via nizza

Hierarchical taxonomy

Node similarity:

Set similarity:




Token-based

Similarity

Character-based


similarity

Taxonomy-based

similarity

Jaccard

Dice

Cosine

……

Edit Distance

Edit Similarity

Hamming

Distance

……







GTS [CIKM/XuL18]


Hybrid Similarity





USIM [PVLDB/XuL19]



food

coffee

coffee

drinks

latte espresso

Hierarchical taxonomy Synonyms

VLDB = Very Large Databases

USA = American

Info = Information

CS = Computer Science

UW = University of Washington

UW = University of Wisconsin

UW = University of Waterloo

Syntactic

ED (string, strng) = 1

ED (database, databases) = 1

ED (Univeristy, University) = 2

ED (cup, cups) = 1

+ +


USIM [PVLDB/XuL19]



food

coffee

coffee

drinks

latte espresso

Hierarchical taxonomy Synonyms

VLDB = Very Large Databases

USA = American

Info = Information

CS = Computer Science

UW = University of Washington

UW = University of Wisconsin

UW = University of Waterloo

Syntactic

ED (string, strng) = 1

ED (database, databases) = 1

ED (Univeristy, University) = 2

ED (cold, cool) = 2

+ +

s = “USA cold latte”

t = “American cool espresso”

when strings are segmented


USIM [PVLDB/XuL19]


Simple case: when strings are segmented, max similarity can be obtained.

Example: three kinds of inconsistencies in the figure are captured.


USIM [PVLDB/XuL19]


However, raw strings are not segmented. To make things even harder, a pair of strings may have more than one plan of segmentation:

How can we know P1 is the

best?






Conclusion

Outline



t1

t2

…...

tm

…

(s, tj)

…s s and tj are similar~

A set of strings S, a query q, a

threshold value θ

A set of string pairs (s, q) where

SIM(s, q) > θ

Threshold based

string similarity

search

A set of strings S, a query q, an

integer 𝑘

k string pairs (s, q) with the

highest similaritiesTop-k based string

similarity search



Threshold-based Search Top-k Search

DivideSkip[ICDE/LiLL08]


Token-based

Similarity

Character-based

Similarity

V-Gram [vldb/LiWY07]

HS-Topk [icde/WangLDZF15]

DivideSkip [ICDE/LiLL08]

SI-Search [TODS/LuLWLX15]

Token-based

Similarity

Syntactic Similarity

Character-based

Similarity


TopkSearch [ICDE/DengLFL13]

AppGram [pvldb/WangDTZ13]



Filter-and-Verification framework


count filter

If two strings are similar, they must have at least T common tokens or

q-grams

Token-based similarity Character-based similarity


Filter-and-Verification framework


count filter

If two strings are similar, they must have at least T common tokens or

q-grams

s = “Information and Knowledge Management”

t = “Management of Knowledge Base”

Threshold = 0.6

𝑇 =0.6

1 + 0.6∗ 4 + 4 = 3 Need to have 3 common tokens

Only 2 common tokens, not an answer (Jaccard = 0.33 < 0.6)


For a query Q, compute the T


Index: inverted list with q-gram as key and strings as values

s1 = “abcd”

s2 = “ab”

s3 = “bcde”

s1 = {ab, bc, cd}

s2 = {ab}

s3 = {bc, cd, de}

Create 2-gram Create indexab bc cd de

s1

s2

s1

s3

s1

s3

s3



Expensive operation

ab bc cd de

s1

s2

s1

s3

s1

s3

s3

Q = abc Q’ = {ab, bc}



Expensive operation

ScanCount, MergeSkip

and DivideSkip improve

this step


ScanCount

Maintain an |s|-length

array with initial value 0.

For each record in

inverted list, increase

the count by 1.

Counts greater than

threshold are

candidates


MergeSkip

Sort string IDs in

inverted list

Maintain a heap and

increase counts for

popped strings


Improve MergeSkip by

sorting and grouping

strings

Need to process all

inverted list containing

tokens in the query

Prune irrelevant recordsAvoid scanning long

inverted lists



Observation: some grams may be very frequent and others are infrequent

fix-length gram may be inefficient

Solution: propose variable-length grams to avoid generating very frequent grams

V-Gram [VLDB/LiWY07]






Token-based

Similarity

Character-based

Similarity





Token-based

Similarity


Character-based

Similarity







SI-Search

Query: build a QP-tree

Strings: build an SI-tree

QP-tree

SI-Tree

Length filter

Prefix filter






Token-based

Similarity

Character-based

Similarity





Token-based

Similarity


Character-based

Similarity







Threshold-based Search top-k search

iteratively tuning thresholds Performance bad. Hard to estimate thresholds

Efficient way: utilize a priority queue, which always calculates the strings

that have the most probability to be in the final results.



HS-Topk

Create a hierarchical segment tree index, HS-Tree

-- First groups the strings by length.

--- Then constructs a complete binary tree for each group of strings

Proposed two pruning strategies

-- greedy-match strategy: prune the strings with consecutive errors

-- batch- pruning-based strategy: prune the strings by computing upper and

lower bounds

Inefficient filter step for large thresholds



TopkSearch

Inefficient for long strings

Improve the traditional dynamic-programming algorithm to calculate edit

distance to avoid trying large numbers of edit-distance



AppGram

Use approximate q-gram matchings.

Use a queue to prune strings with small distances

Use two filtering strategies to prune others

-- the CA strategy and f-queue strategy

Use a max heap to maintain top-k similar strings with the query.

High space complexity






Outline



s1

s2

…...

sn

t1

t2

…...

tm

…

(si, tj)

…si and tj are similar

~

A set of strings S, a set of

strings T, a threshold value θ

A set of string pairs (s, t) where

SIM(s, t) > θ

Threshold based

string similarity join

A set of strings S, a set of

strings T, an integer 𝑘

k string pairs (s, t) with the

highest similaritiesTop-k based string

similarity join



Threshold-based Join Top-k Join


Token-based

Similarity

Character-based

Similarity

Token-based

Similarity


Token-based

Similarity

Character-based

Similarity

GramCount [vldb/GravanoIJKMS01]

PPJoin [www/XiaoWLY08]

PassJoin [pvldb/LiDWF11]

Trie-Join [pvldb/WangLF10]

PartEnum [VLDB/ArasuGK06]

Qchunk [sigmod/QinWLXL11]

ED-Join [pvldb/XiaoWL08]


SI-Join [SIGMOD/LuLWLW13]


AP-Join [CIKM/XuL18]

USIM [PVLDB/XuL19]

Topk-Join[ICDE/XiaoWLS09]

Bed-Join[SIGMOD/ZhangHOS10]


Prefix filter


1. Define a global order

2. Order tokens based on the global order

3. Select T tokens as signatures

4. Check whether there is overlap for signatures

Prefix filter


Global ordering: {a b c d e f g h i j k l}

S1=“c k, e, a, f” S2=“d, b, f, e, k”

Threshold=0.8

Order the strings

S1’=“a, c, e, f, k” S2’=“b, d, e, f, k”

Sig(s1)=“a, c” Sig(s2)=“b, d”

Get signatures

No overlap Jacc(s1,s2)<0.8

1

2

3

4

(1-0.8)*5 + 1 = 2

Length filter


If two strings are similar, their length difference cannot be large

Threshold = 0.8 s = “Database Conference”

r = “International Conference on Information and Knowledge Management”

|s| = 2

|r| = 7|r| > |s|/0.8 Jaccard(s, r) < 0.8



GramCount [VLDB/GravanoIJKMS01]

Use length-filter and count-filter

Support token-based similarity and character-based similarity



ED-Join [PVLDB/XiaoWL08]

Use prefix filtering

Support only edit distance

Two further optimizations: Position Filtering and Content Filtering

Position Filtering: Remove unnecessary signatures produced in prefix filter

Content Filtering: Use the L1 distance as a bound of edit distance.

Edit distance cannot be smaller than half of the L1

Signatures(“sigmod”) = {“si”, “gm”, “od”} Last signature takes at least τ + 1 edit operation to

destroy both “si” and “ gm”

s = “sigmod”, t = “sigkdd”

L1 (s, t)/ 2 = 4/2 = 2 ED(s, t) >= 2



Qchunk [SIGMOD/QinWLXL11]

Use prefix filtering


Two types of signatures: q-grams and q-chunks q-grams with starting positions at

i∗q+1 (0 ≤ i ≤ (l−1)/q)

Strategy 1: index q-grams of r and use q-chunks of s to generate candidates.

Strategy 2: index q-chunks of r and use q-grams of s to generate candidates.



PPJoin [WWW/XiaoWLY08]

Support Jaccard, Cosine, Dice

Two further optimizations: Position Filtering and Suffix Filtering

Position Filtering: Enhance prefix filter by computing upper bounds

Suffix Filtering: probe tokens in the suffix to estimate a tighter upper bound

Threshold = 0.8

Sig(s) = {B,C,D,E,F}

Sig(t) = {A,B,C,D,F}

lower bound of the union size (3+3=6)

upper bound of the intersection size

(1+3=4)

upper bound of the

Jaccard is 4/6 < 0.8

{A,B,D,E,?,?,?,?,?,?,Q,?,?,?,?,?,?,?}

{A,C,D,E,?,?,?,?,Q,?,?,?,?,?,?,?,?,?} lower bound of the union size (5+1+6+9 = 21)

upper bound of the intersection size (3+1+4+7 = 15)

upper bound of the

Jaccard is 15/21 <

0.84

7





Token-based

Similarity

Character-based

Similarity

Token-based

Similarity


Token-based

Similarity

Character-based

Similarity












USIM [PVLDB/XuL19]






JaccT[ICDE/ArasuCK08]

Use synonym rules


Filtering: Prefix filter based signature

Verification: Enumerate all possible transformed strings, then apply Jaccard

4




Use synonym rules


Filtering: length filter and prefix filter

Propose signature size estimation to choose good signatures

Verification: Use Full-Expansion and Selective-Expansion

4


Prefix filter

Length filter

Filtering

candidates Full expansion

Selective expansion

Verification






Use taxonomy rules



Propose an approximate segmentation algorithm to segment strings

Verification: Use GTS similarity measure



USIM [PVLDB/XuL19]

Use synonym and taxonomy rules



Propose an approximate segmentation algorithm to segment strings

Verification: Use USIM similarity measure

1. Generate pebbles for all similarities;

2. Select signature for prefix filtering;• Extension: allowing more than one overlaps.

3. Perform filtering by finding pairs

having enough overlaps as candidates;

4. Verify the unified similarity for each

candidate.

86

Taxonomy Synonym Jaccard

Pebbles

Prefix Signature

Candidates

Results

Prefix Selection

Prefix Filtering

Calculate sim.


USIM [PVLDB/XuL19]





Token-based

Similarity

Character-based

Similarity

Token-based

Similarity


Token-based

Similarity

Character-based

Similarity












USIM [PVLDB/XuL19]






Topk-Join [ICDE/XiaoWLS09]

extend the prefix filtering

Assign each token a weight, the largest possible similarity

Use priority queue to store current top-k candidates

“FCS, Journal, Computer, Science”1 0.75 0.5 0.25

“… Journal …”

the maximum similarity is 0.75



Bed-Join [SIGMOD/ZhangHOS10]

Use B+ tree as index


Use several mapping functions to transform strings to integer values

1. Dictionary order

2. Gram counting order

3. Gram location order

Algorithm

# runs

of

reading

original

data

Filtering PhaseVerification

Phase

avoid

duplicate

s

apply

filters

load balancing predictable

size in

Reduce

candidates

generation

avoid

reading

original dataMapReduc

e

RIDPairsPPJ

oin[SIGMOD/VernicaCL10]

≥3 × √ √ √ × common prefix ×

V-Smart-Join[PVLDB/MetwallyF12]

≥3 √ × √ × ×enumerate pairs

in inverted lists√

MassJoin[ICDE/DengLHWF14]

≥3 × √ √ × ×enumerate pairs

in inverted lists×

FS-Join[ICDE/RongLSWLD17]

2 √ √ √ √ √common prefix

+ segmentation√

Distributed String Similarity Join

global

order

HDFS

string

collection

MapReduce

Ordering

global order

calculator

MapReduce

data

partitioner

candidate

generator

Filtering

MapReduce

candidate

aggregator

VerificationHDFS

Results

(similar

pairs)

candidatesordered

token

sets


FS-Join [ICDE/RongLSWLD17]

S1 = {D, E, F, K}

S2 = {B, C, I, J, K}

S3 = {A, B, H, I}

S4 = {A, C, D, F, J}

S5 = {D, E, G, I}

S6 = {B, G, H, J, K}

S1 = {D, K, E, F}

S2 = {I, C, J, B, K}

S3 = {H, B, I, A}

S4 = {A, D, C, J, F}

S5 = {G, I, D, E}

S6 = {H, J, B, G, K}

(a) Original sets (b) Re-ordered sets

Seg1 = {} Seg1 = {D, E, F} Seg1 ={} Seg1 ={K}

Seg2 = {B, C} Seg2 ={} Seg2 ={I} Seg2 ={J, K}

Seg3 = {A, B} Seg3 ={} Seg3 ={H, I} Seg3 ={}

Seg4 = {A, C} Seg4 ={D, F} Seg4 ={} Seg4 ={J}

Seg5 = {} Seg5={D, E} Seg5 ={G, I} Seg5 ={}

Seg6 = {B} Seg6 ={} Seg6 ={G,H} Seg6 ={J, K}

(c) Partitioning based on pivots {C, F, I}

segment of S1fragment

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

1 2 3 4

Global Ordering: A B CD E F G H I J K

Pivots: { C, F, I }


FS-Join [ICDE/RongLSWLD17]

Vertical Partitioning

There are two problems that should be considered:

1. How to select the pivots?

❖ Random Selection (Random)

❖ Even Interval (Even-Interval)

❖ Even Token Frequency (Even-TF)

❑ Generate fragments with the same number to tokens

2. How many pivots should be selected?

Computation Framework of FS-Join

map

map

map

1, Seg1

2, Seg1

3, Seg1

4, Seg1

1, Seg2

2, Seg2

3, Seg2

4, Seg2

Key, Value

1, Seg3

2, Seg3

1, Seg5

2, Seg5

gro

up

by k

eys

reduceglobal order

1, Seg1

1, Seg2

1, Seg3

1, Seg4

1, Seg5

1, Seg6

Key, Value

2, Seg1

2, Seg2

3, Seg1

3, Seg2

4, Seg1

4, Seg2

reduce

reduce

reduce

(S2, S3), 1(S2, S4), 1(S2, S6), 1(S3, S4), 1(S3, S6), 1(S1, S4), 2(S1, S5), 2(S4, S5), 1(S2, S3),1(S2, S5), 1(S3, S5), 1(S3, S6), 1(S5, S6), 1(S1, S2), 1(S1, S6), 1(S2, S4), 1(S2, S6), 2(S4, S6), 1……

Pair, # of common token

map

map

map

(S2, S3), 1(S2, S4), 1(S2, S6), 1(S3, S4), 1(S3, S6), 1

(S1, S4), 2(S1, S5), 2(S4, S5), 1(S2, S3), 1(S2, S5), 1(S3, S5), 1

(S3, S6), 1(S5, S6), 1(S1, S2), 1(S1, S6), 1(S2, S4), 1(S2, S6), 2(S4, S6), 1

gro

up b

y k

eys

(S1, S2), 1

Key, Value

(S2, S3), 1(S2, S3), 1

(S3, S6), 1(S3, S6), 1

(S5, S6), 1

reduce

reduce

reduce

a segment of S1

S1 = {D, E, F, K}

S2 = {B, C, I, J, K}

S3 = {A, B, H, I}

S4 = {A, C, D, F, J}

S5 = {D, E, G, I}

S6 = {B, G, H, J, K}

…

…

…

…

Res

ult

s

Filter phase: generate candidate string pairs Verification phase: produce final similarity join results

a fragment

# common tokens of Seg2

and Seg3 obtained from the first reduce

1

2

3

4

1

2

3

4

1

2

1

2

1

1

1

1

1

1

…

2

2

3

3

4

4

(S1, S4), 2

(S1, S5), 2

(S1, S6), 1

(S2, S4), 1(S2, S4), 1

(S2, S6), 1(S2, S6), 2

(S2, S5), 1

(S4, S5), 1

(S4, S6), 1

1

1

Filtering Methods

⮚ String Length Filtering (StrL-Filter)

⮚ Segment Length Filtering (SegL-Filter)

⮚ Segment Intersection Filtering (SegI-Filter)

⮚ Segment Difference Filtering (SegD-Filter)

Comparison with Existing Methods

0

1

2

3

4

5

6

0,75 0,8 0,85 0,9

RidPairsPPJoin FS-Join

Threshold

0

5

10

15

20

25

0,75 0,8 0,85 0,9

Online-Aggregation Merge

Merge+Light RidPairsPPJoin

FS-Join

Execution tim

e (

Sec)

X10

2

Threshold

Execution tim

e (

Sec)

X10

2


References[VLDB/GravanoIJKMS01] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava:

Approximate String Joins in a Database (Almost) for Free. VLDB 2001: 491-500

[VLDB/ArasuGK06] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006: 918-929

[VLDB/LiWY07] Chen Li, Bin Wang, Xiaochun Yang: VGRAM: Improving Performance of Approximate Queries on String Collections

Using Variable-Length Grams. VLDB 2007: 303-314

[ICDE/ArasuCK08] Arvind Arasu, Surajit Chaudhuri, Raghav Kaushik: Transformation-

based Framework for Record Matching. ICDE 2008: 40-49

[ICDE/LiLL08] Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE

2008: 257-266

[WWW/XiaoWLY08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient similarity joins for near duplicate detection. WWW

2008: 131-140

[PVLDB/XiaoWL08] Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: an efficient algorithm for similarity joins with edit distance

constraints. PVLDB 1(1): 933-944 (2008)

[ICDE/XiaoWLS09] Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang: Top-k Set Similarity Joins. ICDE 2009: 916-927

[SIGMOD/ZhangHOS10] Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Divesh Srivastava: Bed-tree: an all-purpose index

structure for string similarity search based on edit distance. SIGMOD Conference 2010: 915-926

[PVLDB/WangLF10] Jiannan Wang, Guoliang Li, Jianhua Feng: Trie-Join: Efficient Trie-based String Similarity Joins with Edit-

Distance Constraints. PVLDB 3(1): 1219-1230 (2010)

[SIGMOD/VernicaCL10] Rares Vernica, Michael J. Carey, Chen Li: Efficient parallel set-similarity joins using MapReduce. SIGMOD

Conference 2010: 495-506

[SIGMOD/QinWLXL11] Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin: Efficient exact edit similarity query processing

with the asymmetric signature scheme. SIGMOD Conference 2011: 1033-1044


References[ICDE/WangLF11] Jiannan Wang, Guoliang Li, Jianhua Feng: Fast-

join: An efficient method for fuzzy token matching based string similarity join. ICDE 2011: 458-469

[PVLDB/LiDWF11] Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng: PASS-JOIN: A Partition-based Method for Similarity

Joins. PVLDB 5(3): 253-264 (2011)

[PVLDB/MetwallyF12] Ahmed Metwally, Christos Faloutsos: V-SMART-Join: A Scalable MapReduce Framework for All-Pair

Similarity Joins of Multisets and Vectors. PVLDB 5(8): 704-715 (2012)

[SIGMOD/LuLWLW13] Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang: String similarity measures and joins with

synonyms. SIGMOD Conference 2013: 373-384

[ICDE/DengLFL13] Dong Deng, Guoliang Li, Jianhua Feng, Wen-Syan Li: Top-k string similarity search with edit-distance

constraints. ICDE 2013: 925-936

[PVLDB/WangDTZ13] Xiaoli Wang, Xiaofeng Ding, Anthony K. H. Tung, Zhenjie Zhang: Efficient and Effective KNN Sequence

Search with Approximate n-grams. PVLDB 7(1): 1-12 (2013)

[ICDE/DengLHWF14] Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, Jianhua Feng: MassJoin: A mapreduce-based method

for scalable string similarity joins. ICDE 2014: 340-351

[TODS/LuLWLX15] Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Xiaokui Xiao: Boosting the Quality of Approximate String

Matching by Synonyms. TODS. 40(3): 15:1-15:42 (2015)

[ICDE/WangLDZF15] Jin Wang, Guoliang Li, Dong Deng, Yong Zhang, Jianhua Feng: Two birds with one stone: An efficient

hierarchical framework for top-k and threshold-based string similarity search. ICDE 2015: 519-530

[ICDE/RongLSWLD17] Chuitian Rong, Chunbin Lin, Yasin N. Silva, Jianguo Wang, Wei Lu, Xiaoyong Du: Fast and Scalable

Distributed Set Similarity Joins for Big Data Analytics. ICDE 2017: 1059-1070


References[ICDE/ShangLLF17 ] Zeyuan Shang, Yaxiao Liu, Guoliang Li, Jianhua Feng: K-Join: Knowledge-

Aware Similarity Join. ICDE 2017: 23-24

[SEMWEB/SlabbekoornHH12] Kristian Slabbekoorn, Laura Hollink, Geert-Jan Houben: Domain-

Aware Ontology Matching. International Semantic Web Conference (1) 2012: 542-558

[PVLDB/TaoDS17] Wenbo Tao, Dong Deng, Michael Stonebraker: Approximate String Joins with Abbreviations. PVLDB 11(1): 53-

65 (2017)

[PVLDB/DengKMS17] Dong Deng, Albert Kim, Samuel Madden, Michael Stonebraker:

SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints. PVLDB 10(10): 1082-1093 (2017)

[CIKM/XuL18] Pengfei Xu, Jiaheng Lu: Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint. CIKM 2018: 1563-

1566

[ICDE/WangLZ19] Jin Wang, Chunbin Lin, Carlo Zaniolo: MF-Join: Efficient Fuzzy String Similarity Join with Multi-level

Filtering. ICDE 2019: 386-397

[EDBT/WangLLZ19] Jin Wang, Chunbin Lin, Mingda Li, Carlo Zaniolo: An Efficient Sliding Window Approach for Approximate Entity

Extraction with Synonyms. EDBT 2019: 109-120

[PVLDB/XuL19] Pengfei Xu, Jiaheng Lu: Towards a Unified Framework for String Similarity Joins. PVLDB 12(11): 1289-1302 (2019)


Part 2

String similarity searches and joins with

machine learning


Outline

• Background

• Traditional ML based Approach

• Deep Learning based Approach

• Comprehensive Approach

• Related topics in Natural Language Processing


String Matching can be formulated as…


String as Entity

⮚ String matching problem is always known as Entity Matching

⮚ Definition: identifying and linking/grouping different manifestations of the same real-world object, e.g

❖ Different ways of addressing (names, emails, Facebook accounts), the same person in text

❖ Web pages with different descriptions of the same business

❖ Different photos taken for the same object etc.


Challenges

⮚ Name/attribute ambiguity, data entry errors, missing data, formatting differences, changing attributes…


Brief introduction of ML

⮚ Machine Learning (ML)

❖ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. -- Tom Mitchell, 1954

⮚ ML categories

According to whether the learning process is with labelled data:

❖ Supervised Learning

❖ Unsupervised Learning


Supervised vs. Unsupervised Learning

⮚ String matching mostly adopts supervised methods


Supervised ML models


Recurrent Neural Network (RNN)

⮚ Recurrent Neural Networks take the previous output as inputs. The composite input at time t summarized historical information about the happenings from the first to t-1 time steps.

⮚ RNNs are useful as their intermediate values (state) can store information about past inputs for a time that is not fixed a priori.


LSTM: A variant of RNN [HS 1997]

The hidden activation vector corresponding

to the last word is the sentence embedding

vector (blue).

To solve the vanishing

gradients problem

Add gating mechanism to RNN, keep a memory

cell to contain information outside the normal flow

of RNN.

Taking last step’s output as input of current step.

Recurrently updating representation of sentence.


RNN Encoder-decoder [CMG 2014]

⮚ Create a reversible sentence representation.

⮚ The representation can be reconstructed to an actual sentence form which is reasonable and novel.


RNN Encoder-decoder (cont.)

• The conditional distribution of next symbol.

• Add a summary(constant) symbol, it will hold

the semantics of sentence.

• For long sentences, adding hidden unit to

remember/forget memory.

ሻ𝑃(𝑦𝑡|𝑦𝑡−1, 𝑦𝑡−2, … , 𝑦1, 𝑐ሻ = 𝑔(ℎ<𝑡>, 𝑦𝑡−1, 𝑐

ሻℎ<𝑡> = 𝑓(ℎ<𝑡−1>, 𝑦𝑡−1, 𝑐

Gated Recurrent Unit (GRU):

- Another RNN variant

- simpler than LSTM


Word Embedding

⮚ Foundation: the Distributional Hypothesis

⮚ Each word in the vocabulary is represented by a low dimensional vector

⮚ All words are embedded into the same space

⮚ Similar words have similar vectors


RNN Language Model

⮚ Generate much more meaningful text than n-gram models

⮚ The sparse history is projected into some continuous low-dimensional space, where similar histories get clustered

ሻ𝑠(𝑡ሻ = 𝑓(Uw(𝑡ሻ + W𝑠(𝑡 − 1ሻ

ሻ𝑦(𝑡ሻ = 𝑔(V𝑠(𝑡ሻ

Output Values:

𝑤 𝑡 : 𝑖𝑛𝑝𝑢𝑡 𝑤𝑜𝑟𝑑 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡

y 𝑡 : 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 𝑤𝑜𝑟𝑑𝑠

U,V,W: 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑚𝑎𝑡𝑟𝑖𝑥

s 𝑡 : ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟


Word2vec [MCC 2013]

Directly learn the representation of words using context words

(𝑤,𝑐ሻ∈𝐷

𝑤𝑗∈𝑐

൯log𝑃(𝑤|𝑤𝑗Maximizing the objective function in whole corpus.

Skip-gram Given the context, predicting the word

Works well with small training data,

represents well even rare words or phrases

CBOW Given the word, predicting the context

Faster to train the model, better accuracy

for the frequent words

Two variants:


Glove [PSM 2014]

⮚ Without Distributional Hypothesis.

⮚ Constructing the word-word co-occurrence matrix of whole corpus.

⮚ Inspired from LSA, using matrix factorization to produce word representation.

መ𝐽 =

𝑖,𝑗

൯𝑓(𝑋𝑖𝑗 𝑤𝑖𝑇 𝑤𝑗 − log𝑋𝑖𝑗

2

X-ij is the count of if j-th word occurs, the occurrence of i-th word. w are word

vectors. Minimize loss function.

Loss function:


Phrase2vec [MSC 2013]

⮚ From Word2vec to Phrase2vec

❖ Enlarge phrase vocabulary by analogical reasoning task.

❑ A : B = C : ?

e.g. In word vector space, “Man” + “King” – “Woman” = “Queen”

❑ Phrase A : Phrase B = Phrase C : Phrase D

❖ Phrase Skip-gram Model

❑ Treat phrase in vocabulary like a word.

❑ Incorporates Hierarchical Softmax and Negative Sampling.

New York : New York Times = San Jose : ?

San Jose Airport

San Jose State Univ.

. . .

Phrase D is approximately produced,

which could enlarge phrase vocabulary.

words that appear

frequently together

in corpus

San Jose Mercury News


Paragraph2vec [LM 2014]

⮚ An extension of word2vec model, using a global sentence vector as context.

⮚ When updating word vectors in each iteration, paragraph matrix will also be updated.

Predict the next word according to context


Categorization of ML based String Matching

⮚ Traditional ML approach

❖ Acquire features from entities

❖ Use classification/clustering methods to make decision

⮚ Deep Learning approach

❖ Encode entity/attribute with low-dimensional vectors

❖ Use deep neural network for representation learning

⮚ Comprehensive approach

❖ End-to-end system

❖ Combination of different approaches

❖ Improved usability


Basics of ML approaches

⮚ ML model can be either traditional ones or deep learning ones


Feature Engineering

⮚ For Deep Learning approaches, representation learning is used instead of such explicit features.


Outline

• Background




• Related topic in Natural Language Processing


Rule-based approach [FS 1969]

⮚ r = (x,y) is record pair, 𝛾 is comparison vector, M: matches, U: non-matches

⮚ Decision Rule

⮚ Naïve Bayes Assumption


Supervised approaches

⮚ The most popular category

❖ Decision trees [TKM 2001]

❖ Support vector machines [BM 2003]

❖ Ensembles of classifiers [CKM 2009]

❖ Conditional Random Fields (CRF) [GS 2009]

⮚ However, there might be potential problems…


Potential drawback: insufficient training data

⮚ Constructing a training set is hard – since most pairs of records are “easy non‐matches”, E.g.

❖ 100 records from 100 cities

❖ Only 1% of the total pairs come from the same city

⮚ Some pairs are hard to judge even by humans

❖ Inherently ambiguous

E.g., Paris Hilton (person or business)

❖ Missing attributes

Starbucks, Toronto vs Starbucks, Queen Street ,Toronto


Other alternatives

⮚ Semi-supervised/Unsupervised Learning

❖ EM based techniques to learn parameters [Winkler 2006]

❖ Generative Models [RC 2004]

⮚ Active Learning

❖ Committee of Classifiers [SB 2002]

❖ Provably optimizing precision/recall [AGK 2010], [BIP 2012]

❖ Crowdsourcing [WKF 2012]


Summary

⮚ Supervised Learning: The main stream

⮚ Make use of different features and models

⮚ Bottleneck

❖ Require feature engineering

❖ Insufficient training data

⮚ Active Learning and crowdsourcing methods could be promising


Outline

• Background






DeepER [ETJ 2018]

⮚ Adopt Deep Learning models to

❖ Capture both syntactic and semantic information

❖ Avoid human labored feature engineering

⮚ End-to-end training the model

⮚ Use Locality Sensitive Hashing for identifying similar entities


DeepER: Methodology

⮚ Use word embedding on each word in an entity

⮚ Encode the whole entity with RNN models

⮚ Compare the similarity with some predefined metrics, e.g. Element-wise comparison

⮚ Classification layer: use activation functions, e.g. softmax, sigmoid


LinkNBed [TSD 2018]

⮚ General idea: Jointly learn representations and entity linkage

⮚ Key Insights

❖

❖


Model Architecture


Deep Matcher [MLR 2018]

⮚ Three Components❖ Encode attributes into low-

dimensional vectors❖ Evaluate similarity between

representation of entities❖ Perform classification

⮚ 4 Models❖ SIF: An Aggregate Function Model❖ RNN: A Sequence-aware Model❖ Attention: A Sequence Alignment

Model❖ Hybrid Model


Decomposable Attention-based model

⮚ Attention mechanism

❖ Learn a similarity representation

❖ Expressed with token-wise alignment

⮚ Comparison: Two-layer Highway Network

⮚ Aggregation

❖ Sum over all elements

❖ Normalization


Hybrid Attribute Summarization

⮚ Use a Bi-directional RNN to learn the representation before soft alignment

⮚ Improvement over aggregation

❖ Use RNN to compute weight

❖ A weighted average over all elements


Summary

⮚ Pros

❖ Do not need manually created features

❖ More Powerful

⮚ Cons

❖ Efficiency

❖ Only Support Supervised setting

❖ Overfitting


Outline

• Background






End-to-end EM System

⮚ Challenges

❖ Cover the entire EM pipeline

❖ Integrate multiple methods into one system

❖ Provide necessary guidance to users


Two-Stage Framework

How is this done today in practice?

Development stage: find an accurate workflow, using data samples

Production stage: execute workflow on entirety of data

A

1M tuples

1M tuplesblock match using supervised learning

B


The Magellan System [KDS 2016]

Development Stage

How-to guide

Supporting tools

(as Python commands)

Data samples

Python Interactive Environment

Script Language

Data Analysis Stack

pandas, scikit-learn, matplotlib,

numpy, scipy, pyqt, seaborn,

…

Big Data Stack

PySpark, mrjob, Pydoop,

pp, dispy,

…

Facilities for Lay Users

GUIs, wizards, …

Power Users

EM

Workflow

PyData

eco

system

Production Stage

How-to guide

Supporting tools

(as Python commands)

Original data


Workflow: How to guide

no

A

B

sampleA’

B’

matcher

V

quality

check

blocker

blocker

X

(-,-)

(-,-)

(-,-)

(-,-)

(-,-)

Cx

sample

(-,-)

(-,-)

(-,-)

(-,-)

(-,-)

(-,-)

(-,-)

(-,-)

(-,-) +

(-,-) -

(-,-) +

cross-validatematcher U

cross-validatematcher V

0.89 F1

0.93 F1

matcherlabel

Cx S

G

(-,-) +

(-,-) +

(-,-) -

(-,-) -

(-,-) +

A’

B’

blocker

XCx

A’

B’

blocker

YCy

yes


PyData Ecosystem

⮚ Development stage does a lot of data analysis ❖ So build tools on data analysis stack in PyData

⮚ Production stage focuses on scaling❖ So build tools on Big Data stack in PyData

⮚ PyData ecosystem❖ Used extensively by data scientists❖ 86,800 packages (in PyPI)❖ Data analysis stack❖ Big data stack❖ Tools to manage user work❖ Software infrastructure to build tools❖ Ways to manage/package/distribute tools❖ Companies, conferences, books, etc. 141


Design for Open World

command 1

command 2

data

metadata

A

B

A.ssn is a key

Magellan

System X

System Y

command x1

command x2

data

C

command y1

command y2

metadata…

…

…

…

SQL queries

commands

data

metadata

A

B

A.ssn is a key

RDBMS

Closed-World Systems Open-World Systems


Falcon [DSD 2017]

⮚ Boost EM with crowdsourcing and active learning

❖ Define basic operators & use them to model the EM workflow as a DAG

❖ Scales up operators (using MapReduce)

❑ e.g., executing complex blocking rules over large tables

❖ Conduct optimizations both within and across operators

❑ e.g., use crowd time of an operator to mask machine time of another


The EM DAG of Falcon

⮚ Basic idea: use crowd time to mask machine time

⮚ Other Optimizations: indexing, speculatively execute rules, mask pair selection


Smurf [SAD 2018]

⮚ Replace the labeling step of Falcon with self-service solution

⮚ Use Random Forest to directly match strings with multiple predicates instead of deriving blocking rules

Apply pruning predicates in the fashion of Decision Tree

Multiple Decision Trees: Random Forest


Execution of Random Forests

⮚ In-database fashion of execution

⮚ Optimizations: leveraging the query optimizer of DBMS❖ Join results reuse❖ Intra/Inter path filter reuse and ordering❖ Cost-based plan selection

Overall process of executionExpress rules with Join operation


Summary

⮚ Powerful toolkits

⮚ Integration of different approaches

❖ Similarity functions

❖ ML models

❖ Crowdsourcing

⮚ User friendly API

⮚ Efficiency

❖ Benefit from DB techniques


Outline

• Background






Text matching in NLP tasks

⮚ The Problem can be formulated as:

Match(𝑇1, 𝑇2)= 𝐹(𝜙(𝑇1), 𝜙(𝑇2))

❖ 𝜙: mapping text to representation vector

❖ 𝐹: scoring function based on representation

Text Matching is a core Task in NLP

Task Text 1 Text 2

Information Retrieval Query/Document Document

Question Answering Question Answer

Paraphrase Identification Text A Text B


Paraphrase Identification

Where can I get very professional and reliable

envelope printing service in Sydney?

Where can I get very affordable branded envelope

printing service in Sydney?

Why are doctors always late?

Why doctors always make you wait for 15-20 minutes

before they see you?


ARC-I and ARC-II [HLL 2014]

Architecture I: Siamese NetworkArchitecture II: early interaction


PWIM [HL 2016]


Information Retrieval

⮚ Deep learning methods have been successfully applied to Information retrieval (IR)

⮚ Human defined features❖ Time consuming

❖ Incomplete❖ Over specified


DSSM [HHG 2013]

⮚ First composite the representation of each document/sentence

⮚ Then perform matching between documents/sentences

❖ Cosine similarity between semantic vectors


Match Pyramid [WLG 2016]

⮚ Focus on capturing matching patterns from interaction


Question Answering (QA)


Benchmark: Stanford Question Answer Dataset (SQuAD)


Match-LSTM [WJ 2016]


BiDAF: Bi-directional Attention Flow [SKF 2017]


QANet [YDL 2018]


Summary

⮚ Text matching is also a popular and crucial topic in NLP

⮚ Different definitions of texts for different problems

⮚ Basic architecture: Siamese Network

⮚ Models are going deeper…


Reference

⮚ [AGK 2010] On active learning of record matching packages. SIGMOD 2010.[BIP 2012] Active sampling for entity matching. KDD 2012.

⮚ [BM 2003] Adaptive Duplicate Detection Using Learnable String Similarity Measures. KDD 2003.[CKM 2009] Exploiting context analysis for combining multiple entity resolution systems. SIGMOD 2009.

⮚ [CMG 2014] Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.

⮚ [DHM 2005] Reference reconciliation in complex information spaces. SIGMOD 2005.

⮚ [DSD 2017] Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. SIGMOD 2017.

⮚ [ETJ 2018] Distributed representations of tuples for entity resolution. PVLDB 2018.

⮚ [HS 1997] Long Short-Term Memory. Neural Computation, 1997.

⮚ [KDS 2016] Magellan: Toward building entity matching management systems. PVLDB 2016.


Reference (cont.)

⮚ [HHG 2013] Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. CIKM 2013.

⮚ [HL 2016] Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement. NAACL 2016.

⮚ [HLL 2014] Convolutional Neural Network Architectures for Matching Natural Language Sentences. NIPS 2014.

⮚ [GS 2009] Answering Table Augmentation Queries from Unstructured Lists on the Web, PVLDB 2009.

⮚ [LM 2014] Distributed representations of sentences and documents. ICML 2014.⮚ [MLR 2018] Deep learning for entity matching: A design space exploration. SIGMOD 2018.⮚ [MCC 2013] Efficient Estimation of Word Representations in Vector Space. ICLR Workshop, 2013.⮚ [MSC 2013] Distributed representations of words and phrases and their compositionality. NIPS

2013.⮚ [PSM 2014] Glove: Global vectors for word representation. EMNLP, 2014.⮚ [RC 2004] A Hierarchical Graphical Model for Record Linkage. UAI 2004.⮚ [SAD 2018] Smurf: Self-service string matching using random forests. PVLDB 2018.


Reference (cont.)

⮚ [SAD 2018] Smurf: Self-service string matching using random forests. PVLDB 2018.

⮚ [SKF 2017] Bidirectional Attention Flow for Machine Comprehension. ICLR 2017.

⮚ [TKM 2001] Learning object identification rules for information integration. Inf. Syst. 2001.

⮚ [TSD 2018] LinkNBed: Multi-Graph Representation Learning with Entity Linkage. ACL 2018.

⮚ [Winkler 2006] Overview of Record Linkage and Current Research Directions, Research Report Series, US Census, 2006.

⮚ [WJ 2017] Machine Comprehension Using Match-LSTM and Answer Pointer. ICLR 2017.

⮚ [WKF 2012] CrowdER: Crowdsourcing Entity Resolution. PVLDB 2012.

⮚ [WLG 2016] A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations. AAAI 2016.

⮚ [YDL 2018] QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. ICLR 2018.


Open challenges

Open challenges (I)

Optimize the pipeline of String Similarity Queries

⮚ Most of works are algorithm level optimization

⮚ Need an end-to-end pipeline to deal with the whole life cycle of the task.

⮚ Essential to build such a pipeline on relational database management systems like MySQL or NoSQL database like MongoDB.

166

Open challenges (II)

More Efficient ML based Approaches

⮚ Machine learning based methods needs more time to training model than database techniques.

⮚ Supervised ones requires a large amount of labeled data to serve as the training set

167

Open challenges (III)

Combine Human-in-the-Loop Approaches with ML

⮚ Use crowdsourcing approaches in string similarity measurement

⮚ The future direction is to automatically identify when and to what extent should human labor be involved in string similarity processing

168

Conclusion

⮚ String data is ubiquitous

⮚ Two approaches for similarity string processing with database techniques and ML models

⮚ Future work to combine two approaches together for a Human-in-the-Loop and (semi-)autonomous tool

169


Thank you

Q & A

Documents

Synergy of Database Techniques and Machine Learning ...CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27 Token-based similarity: