Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join
CIKM 2019 Tutorial
Jiaheng Lu, University of HelsinkiChunbin Lin, Amazon AWSJin Wang, University of California Los AngelesChen Li, University of California Irvine
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 2
Outline
Motivation and Background
History and Classification
Databases Techniques
Machine Learning Models
Open Challenges and Discussion
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 3
String data is ubiquitous
A string is a sequence of characters.
Example of string data:
• Product catalogs
• DNA sequence data
• Text description
• Customer relationship management data
4
4
Example: approximate string processing for a movie database
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Iron man 2008 Sci-Fi
Schwarzenegger Terminator: Dark Fate 2019 Sci-Fi
Samuel Jackson The man 2006 Crime
Find movies starred “Schwarzeneger” (missing one ‘g’)
5
5
Data may not clean
Star
Keanu Reeves
Samuel Jackson
Schwarzenegger
Relation R Relation S
Data integration and cleaning:
Star
Keanu Reeves
Samuel L. Jackson
Schwarzenegger
6
6
Problem definition: approximate string searches
…
Schwarzenger
Samuel Jackson
Keanu Reeves
Star
Query q:
Collection of strings s
Search
Output: strings s that satisfy Sim(q,s)≤δ
Sim functions: edit distance, Jaccard Coefficient and Cosine similarity
Schwarrzenger
7
7
Problem definition: approximate string joins
…
Schwarzenger
Samuel Jackson
Keanu Reeves
Star
Collection of strings s
Join
Output: strings s and t that satisfy Sim(s,t)≤δ
Sim functions: edit distance, Jaccard Coefficient and Cosine similarity
…
Schwarzengger
Samuel Jackson
Keanu Reeves
Star
Collection of strings t
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 8
Edit distance
A widely used metric to define string similarity
Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2.
Example:s1: Tom Hankss2: Ton Hanked(s1,s2) = 2
VLDB 2019 9
9
Jaccard Coefficient
⮚ Jaccard (X,Y) = |X∩Y| / |X∪Y|
⮚ q-grams
(un),(ni),(iv),(ve),(er),(rs),(sa),(al)
For example: 2-gram
u n i v e r s a l
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 10
Jaccard Coefficient
Jaccard (X,Y) = |X∩Y| / |X∪Y|
Example:s1: Tom Hankss2: Ton Hank
3-Gram(s1):{Tom,om_,m_H,_Ha,Han,ank,nks}3-Gram(s12):{Ton,on_,n_H,_Ha,Han,ank}
Jaccard(s1,s2)= 3/10 =0.3
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 11
Cosine similarity
Example:s1: Tom Hankss2: Ton Hank
3-Gram(s1):{Tom,om_,m_H,_Ha,Han,ank,nks}3-Gram(s12):{Ton,on_,n_H,_Ha,Han,ank}
Cosine(s1,s2)= 3/sqrt(6x7) =0.46
cosine 𝐴, 𝐵 =σ𝑖=1𝑛 𝐴𝑖 𝐵𝑖
σ𝑖=1𝑛 𝐴𝑖
2 σ𝑖=1𝑛 𝐵𝑖
2
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 12
Synonym based similarity
Synonyms:CIKM = ACM International Conference on Information and Knowledge Management China = CN
Example:S1=“ACM International Conference on Information and
Knowledge Management China”S2=“CIKM 2019 CN”
How to use the existing synonyms?
String Similarity Measures (Full-expansion)
S1=“ACM International Conference on Information and Knowledge Management China”
S2=“CIKM 2019 CN”
CIKM = ACM International Conference on Information and Knowledge Management
China = CN
Synonyms
S1’= “ ACM International Conference on Information and Knowledge Management
China CIKM CN”
S2’= “ CIKM 2019 CN ACM International Conference on Information and
Knowledge Management China ”
Expanding using all synonyms
Jaccard(S1’,S2’)= 11/12 = 0.92
⮚ Database techniques❖ String similarity measure❖ String Similarity Search❖ String Similarity Join
⮚ Machine leaning techniques❖ Traditional Machine Learning based Approaches (e.g. SVM)❖ Deep learning based techniques❖ End-to-end systems❖ Related topics in Natural Language Processing
14
Scope of this tutorial
Timeline of string similarity processing mechanism
15
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 16
Two parts
Part 1: String similarity searches and joins in databases
Part 2: String similarity searches and joins with machine learning
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 17
Part 1
String similarity searches and joins in
databases
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 18
Motivation & Problem Definition
String Similarity Measures
String Similarity Search
String Similarity Join
Conclusion
Outline
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 19
Motivation & Problem Definition
Name Affiliation Address
Victor
D.
Vianu
UCSD 9500 Gilman Dr CA
Name Address Institution
Victor
Viannu
9500 Gilman
Drive
University of
California
San Diegotypo
different representation
Inconsistent data stored in different sources
Typos (Vianu vs. Viannu)
Synonyms (UCSD University of California San Diego)
Taxonomy (Coffee vs. Latte)
……
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 20
Motivation & Problem Definition
t1
t2
…...
tm
…
(s, tj)
…
String similarity Search
s s and tj are similar~
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 21
Motivation & Problem Definition
s1
s2
…...
sn
t1
t2
…...
tm
…
(si, tj)
…
String similarity Join
si and tj are similar~
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 22
Motivation & Problem Definition
Basic operation: compute the similarity for two strings
s = “VLDB”, t = “PVLDB”
s = “UW”, t = “University of Washington”
s = “UW”, t = “University of Waterloo”
s = “cafe”, t = “coffee shop”
s = “Vianu”, t = “Viannu”
s = “Papakonstantinou”, t = “Papaconstantnou”
Are they similar or not?
Challenge: accuracy
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 23
Motivation & Problem Definition
Naïve method: compute similarity for each two strings
s1
s2
…...
sn
t1
t2
…...
tmBad performance!
String similarity Join
t1
t2
…...
tm
String similarity Search
sChallenge: performance
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 24
String Similarity Joins
Filter-and-Verification Framework
Filter Prune dissimilar string pairs, generate candidates
Verification Computing similarity scores for candidates
s1
s2
s3
s4
s5
t1
t2
t3
t4
t5
s1
s2
s3
s4
s5
t1
t2
t3
t4
t5
All pairs Candidates Results
✓
Filtering Verification
s1
s2
s3
s4
s5
t1
t2
t3
t4
t5
✓
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 25
Motivation & Problem Definition
String Similarity Measures
String Similarity Search
String Similarity Join
Conclusion
Outline
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 26
String Similarity MeasuresString Similarity Measures
Syntactic Similarity Semantic Similarity
Token-based
Similarity
Character-based
SimilaritySynonym-based
similarity
Taxonomy-based
similarity
Jaccard
Dice
Cosine
……
Edit Distance
Edit Similarity
Hamming
Distance
……
JaccT [ICDE/ArasuCK08]
Full Expansion [SIGMOD/LuLWLW13]
Selective Expansion [SIGMOD/LuLWLW13]
Pkduck [PVLDB/TaoDS17]
JACCAR [EDBT/WangLLZ19]
K-Join [ICDE/ShangLLF17]
GTS [CIKM/XuL18]
USIM (Unified String Similarity) [PVLDB/XuL19]
Hybrid Similarity
Fast-Join [ICDE/WangLF11]
Silkmoth [PVLDB/DengKMS17]
MF-Join [ICDE/WangLZ19]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 27
Token-based similarity: Jaccard, Dice, Cosine, etc.
s = “University of California San Diego”
t = “University of California Los Angeles”
s’ = {University, of, California, San, Diego}
t’ = {University, of, California, Los, Angeles}
Overlap-based similarity, higher score means more similar
String Similarity Measures
Convert a string into tokens
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 28
Token-based similarity: Jaccard, Dice, Cosine, etc.
String Similarity Measures
Convert a string into q-grams
2-gram set of “AMAZON” is {“AM”, “MA”, “AZ”, “ZO”, “ON”}
A M A Z O N
s = “AMAZON”
t = “AMACON”
s’ = {AM, MA, AZ, ZO, ON}
t’ = {AM, MA, AC, CO, ON}
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 29
Character-based similarity: Edit distance, Edit Similarity, etc.
ED(“me”, “my”) = 1
Edit distance ED: The minimum number of single-character edits (insertions,
deletions or substitutions) required to change one word into the other.
Edit Similarity EDS:
Edit-distance based similarity, lower score means more similar
ED(“starbucks”, “starbukk”) = 2
starbucks
starbukks
starbukk
1st edit: substitute c with k
2nd edit: delete s
String Similarity Measures
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 30
Hybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine
1. Tokenize each string as a set of tokens.
2. Allow fuzzy matching between tokens.
s = “International Conference on Inforomation and Knowledges Management”
t = “Internaitional Conferrence on Information and Knowledge Management”
String Similarity Measures
Jaccard (s, t) = 2/12 = 0.167
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 31
Hybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine
1. Tokenize each string as a set of tokens.
2. Allow fuzzy matching between tokens.
s = “International Conference on Inforomation and Knowledges Management”
t = “Internaitional Conferrence on Information and Knowledge Management”
String Similarity Measures
Fuzzy-Jaccard(s, t) = 7/7 = 1.0
E.g., allow edit distance less than 2 between tokens
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 32
Hybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine
1. Tokenize each string as a set of tokens.
2. Allow fuzzy matching between tokens.
s = “International Conference on Inforomation and Knowledges Management”
t = “on Information and Knowledge Management Internaitional Conferrence”
String Similarity Measures
Fuzzy-Jaccard(s, t) = 7/7 = 1.0
E.g., allow edit distance less than 2 between tokens
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 33
Hybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine
1. Tokenize each string as a set of tokens.
2. Allow fuzzy matching between tokens.
Weighted bigrpah
s1
s2
sn-1
sn
t1
t2
tm-1
tm
s3
……
……
w1,1
w2,1
w1,mw2,m-1
wn-1,m
wn,1
String Similarity Measures
Build a bigraph to model two token sets
Node: refer to each token
Edge: if the character-based similarity of two
tokens exceeds a threshold
Weight: the character-based similarity
Fuzzy-overlap = maximum weight matching
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 34
Weighted bigrpah
s1
s2
sn-1
sn
t1
t2
tm-1
tm
s3
……
……
w1,1
w2,1
w1,mw2,m-1
wn-1,m
wn,1
String Similarity MeasuresHybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine
1. Tokenize each string as a set of tokens.
2. Allow fuzzy matching between tokens.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 35
String Similarity MeasuresHybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine
1. Tokenize each string as a set of tokens.
2. Allow fuzzy matching between tokens.
Fast-Join [ICDE/WangLF11]
Token level : token-based similarity
Element level: Character-based similarity
Weighted bigrpah
s1
s2
sn-1
sn
t1
t2
tm-1
tm
s3……
……
w1,1
w2,1
w1,mw2,m-1
wn-1,m
wn,1
Token level
Element level
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 36
String Similarity MeasuresHybrid similarity: Fuzzy-Overlap, Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine
1. Tokenize each string as a set of tokens.
2. Allow fuzzy matching between tokens.
Silkmoth [PVLDB/DengKMS17],
MF-Join [ICDE/WangLZ19]
Token level : token-based similarity
Element level: different kinds of similarities Weighted bigrpah
s1
s2
sn-1
sn
t1
t2
tm-1
tm
s3
……
……
w1,1
w2,1
w1,mw2,m-1
wn-1,m
wn,1
Token level
Element level
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 37
Syntactic similarity
s = “University of California San Diego Computer Science”
t = “UCSD CS”
Jaccard(s, t) = Dice(s, t) = Cosine(s,t) = 0
Edit-distance(s, t) = 44
Fuzzy-Jaccard(s, t) = Fuzzy-Dice(s, t) = Fuzzy-Cosine(s, t) = 0
s and t are not similar at all
String Similarity Measures
Token-based similarity: Jaccard, Dice, Cosine, etc.
Character-based similarity: Edit distance
Hybrid similarity: Fuzzy-Jaccard, Fuzzy-Dice, Fuzzy-Cosine
Missing cases
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 38
String Similarity MeasuresString Similarity Measures
Syntactic Similarity Semantic Similarity
Token-based
Similarity
Character-based
SimilaritySynonym-based
similarity
Taxonomy-based
similarity
Jaccard
Dice
Cosine
……
Edit Distance
Edit Similarity
Hamming
Distance
……
JaccT [ICDE/ArasuCK08]
Full Expansion [SIGMOD/LuLWLW13]
Selective Expansion [SIGMOD/LuLWLW13]
Pkduck [PVLDB/TaoDS17]
JACCAR [EDBT/WangLLZ19]
K-Join [ICDE/ShangLLF17]
GTS [CIKM/XuL18]
USIM (Unified String Similarity) [PVLDB/XuL19]
Hybrid Similarity
Fast-Join [ICDE/WangLF11]
Silkmoth [PVLDB/DengKMS17]
MF-Join [ICDE/WangLZ19]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 39
Source of synonyms:
Obtained by machine learning models
Provided by domain experts
Extracted from knowledge bases
AWS
CIKM
UW
UW
UW
Amazon Web Services
ACM International Conference on Information and Knowledge Management
University of Washington
University of Wisconsin
University of Waterloo
String Similarity MeasuresUsing synonyms to enhance string similarity measures
Example synonyms
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 40
String Similarity Measures
JaccT [ICDE/ArasuCK08]
Transformation-based operation
Enumerate all the transformed strings, pick the pair with largest score
s = “VLDB Conf”
t = “Large Database Conference”
VLDB Very Large Database
Conf Conference
s1 = “VLDB Conf”
s2 = “Very Large Database Conf”
s3 = “VLDB Conference”
s4 = “Very Large Database Conference”
t1 = “Large Database Conference”
t2 = “Large Database Conf”
SIM(s, t) = Jaccard(s4, t1) = ¾ = 0.75 Strings
Synonyms
Intermediate Results
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 41
String Similarity Measures
PKduck [PVLDB/TaoDS17]
Transformation-based operation
Apply to one side each time
s = “VLDB Conf”
t = “Large Database Conference”
VLDB Very Large Database
Conf Conference
s1 = “VLDB Conf”
s2 = “Very Large Database Conf”
s3 = “VLDB Conference”
s4 = “Very Large Database Conference”
t = “Large Database Conference”
SIM(s, t) = Jaccard(s4, t1) = ¾ = 0.75
Strings
Synonyms
Intermediate Results
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 42
String Similarity Measures
JACCAR [EDBT/WangLLZ19]
Transformation-based operation
Apply to only one side
s = “VLDB Conf”
VLDB Very Large Database
Conf Conference
s1 = “VLDB Conf”
s2 = “Very Large Database Conf”
s3 = “VLDB Conference”
s4 = “Very Large Database Conference”
SIM(s, t) = Jaccard(s4, t) = 4/4 = 1.0
…… Important dates for contributions to the 35th
international very large database conference held in
USA ….
Document
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 43
String Similarity Measures
Full-Expansion [SIGMOD/LuLWLW13]
Expansion-based operation
Apply all the applicable rules
s =“ACM's Special Interest Group on Management Of Data NY USA”
t =“SIGMOD New York United States of America”
ACM’s Association for Computing Machinery’s
SIGMOD ACM's Special Interest Group on Management Of Data
SIGMOD International Conference on Management of Data
NY New York
USA United States of America
s’ = " ACM's Special Interest Group on Management Of Data SIGMOD NY New York USA
United States America Association for Computing Machinery’s "
t’ = " ACM's Special Interest Group on Management of Data SIGMOD NY New York USA
United States America International Conference "
SIM(s, t) = Jaccard(s’, t’) = 16/22 = 0.72
Strings
Synonyms
Intermediate Results
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 44
String Similarity Measures
Selective-Expansion [SIGMOD/LuLWLW13]
Expansion-based operation
Apply only good applicable rules
s =“ACM's Special Interest Group on Management Of Data NY USA”
t =“SIGMOD New York United States of America”
ACM’s Association for Computing Machinery’s
SIGMOD ACM's Special Interest Group on Management Of Data
SIGMOD International Conference on Management of Data
NY New York
USA United States of America
SIM(s, t) = Jaccard(s’, t’) = 16/16 = 1.0
s’=" ACM's Special Interest Group on Management Of Data SIGMOD NY New York USA United States
America "
t’=" ACM's Special Interest Group on Management of Data SIGMOD NY New York USA United States
America "
Strings
Synonyms
Intermediate Results
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 45
String Similarity MeasuresString Similarity Measures
Syntactic Similarity Semantic Similarity
Token-based
Similarity
Character-based
SimilaritySynonym-based
similarity
Taxonomy-based
similarity
Jaccard
Dice
Cosine
……
Edit Distance
Edit Similarity
Hamming
Distance
……
JaccT [ICDE/ArasuCK08]
Full Expansion [SIGMOD/LuLWLW13]
Selective Expansion [SIGMOD/LuLWLW13]
Pkduck [PVLDB/TaoDS17]
JACCAR [EDBT/WangLLZ19]
K-Join [ICDE/ShangLLF17]
GTS [CIKM/XuL18]
USIM (Unified String Similarity) [PVLDB/XuL19]
Hybrid Similarity
Fast-Join [ICDE/WangLF11]
Silkmoth [PVLDB/DengKMS17]
MF-Join [ICDE/WangLZ19]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 46
String Similarity Measures
GTS [CIKM/XuL18] Using taxonomy to enhance string similarity
Wikipedia categories
Italy food restaurants
Turin coffee
coffee
drinks
latte espresso
type of
restaurants
bar coffeehouse
Arteries of
turin
Via nizza
Hierarchical taxonomy
s = “American latte”
t = “American espresso”
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 47
String Similarity Measures
GTS [CIKM/XuL18] Using taxonomy to enhance string similarity
Wikipedia categories
Italy food restaurants
Turin coffee
coffee
drinks
latte espresso
type of
restaurants
bar coffeehouse
Arteries of
turin
Via nizza
Hierarchical taxonomy
Node similarity:
Set similarity:
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 48
String Similarity MeasuresString Similarity Measures
Syntactic Similarity Semantic Similarity
Token-based
Similarity
Character-based
SimilaritySynonym-based
similarity
Taxonomy-based
similarity
Jaccard
Dice
Cosine
……
Edit Distance
Edit Similarity
Hamming
Distance
……
JaccT [ICDE/ArasuCK08]
Full Expansion [SIGMOD/LuLWLW13]
Selective Expansion [SIGMOD/LuLWLW13]
Pkduck [PVLDB/TaoDS17]
JACCAR [EDBT/WangLLZ19]
K-Join [ICDE/ShangLLF17]
GTS [CIKM/XuL18]
USIM (Unified String Similarity) [PVLDB/XuL19]
Hybrid Similarity
Fast-Join [ICDE/WangLF11]
Silkmoth [PVLDB/DengKMS17]
MF-Join [ICDE/WangLZ19]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 49
USIM [PVLDB/XuL19]
String Similarity Measures
Wikipedia categories
food
coffee
coffee
drinks
latte espresso
Hierarchical taxonomy Synonyms
VLDB = Very Large Databases
USA = American
Info = Information
CS = Computer Science
UW = University of Washington
UW = University of Wisconsin
UW = University of Waterloo
Syntactic
ED (string, strng) = 1
ED (database, databases) = 1
ED (Univeristy, University) = 2
ED (cup, cups) = 1
+ +
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 50
USIM [PVLDB/XuL19]
String Similarity Measures
Wikipedia categories
food
coffee
coffee
drinks
latte espresso
Hierarchical taxonomy Synonyms
VLDB = Very Large Databases
USA = American
Info = Information
CS = Computer Science
UW = University of Washington
UW = University of Wisconsin
UW = University of Waterloo
Syntactic
ED (string, strng) = 1
ED (database, databases) = 1
ED (Univeristy, University) = 2
ED (cold, cool) = 2
+ +
s = “USA cold latte”
t = “American cool espresso”
when strings are segmented
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 51
USIM [PVLDB/XuL19]
String Similarity Measures
Simple case: when strings are segmented, max similarity can be obtained.
Example: three kinds of inconsistencies in the figure are captured.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 52
USIM [PVLDB/XuL19]
String Similarity Measures
However, raw strings are not segmented. To make things even harder, a pair of strings may have more than one plan of segmentation:
How can we know P1 is the
best?
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 53
Motivation & Problem Definition
String Similarity Measures
String Similarity Search
String Similarity Join
Conclusion
Outline
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 54
String similarity Search
t1
t2
…...
tm
…
(s, tj)
…s s and tj are similar~
A set of strings S, a query q, a
threshold value θ
A set of string pairs (s, q) where
SIM(s, q) > θ
Threshold based
string similarity
search
A set of strings S, a query q, an
integer 𝑘
k string pairs (s, q) with the
highest similaritiesTop-k based string
similarity search
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 55
String Similarity Search
Threshold-based Search Top-k Search
DivideSkip[ICDE/LiLL08]
Syntactic Similarity Semantic Similarity
Token-based
Similarity
Character-based
Similarity
V-Gram [vldb/LiWY07]
HS-Topk [icde/WangLDZF15]
DivideSkip [ICDE/LiLL08]
SI-Search [TODS/LuLWLX15]
Token-based
Similarity
Syntactic Similarity
Character-based
Similarity
HS-Topk [icde/WangLDZF15]
TopkSearch [ICDE/DengLFL13]
AppGram [pvldb/WangDTZ13]
String similarity Search
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 56
Filter-and-Verification framework
String similarity Search
count filter
If two strings are similar, they must have at least T common tokens or
q-grams
Token-based similarity Character-based similarity
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 57
Filter-and-Verification framework
String similarity Search
count filter
If two strings are similar, they must have at least T common tokens or
q-grams
s = “Information and Knowledge Management”
t = “Management of Knowledge Base”
Threshold = 0.6
𝑇 =0.6
1 + 0.6∗ 4 + 4 = 3 Need to have 3 common tokens
Only 2 common tokens, not an answer (Jaccard = 0.33 < 0.6)
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 58
For a query Q, compute the T
String similarity Search
Index: inverted list with q-gram as key and strings as values
s1 = “abcd”
s2 = “ab”
s3 = “bcde”
s1 = {ab, bc, cd}
s2 = {ab}
s3 = {bc, cd, de}
Create 2-gram Create indexab bc cd de
s1
s2
s1
s3
s1
s3
s3
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 59
String similarity Search
Expensive operation
ab bc cd de
s1
s2
s1
s3
s1
s3
s3
Q = abc Q’ = {ab, bc}
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 60
String similarity Search
Expensive operation
ScanCount, MergeSkip
and DivideSkip improve
this step
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 61
ScanCount
Maintain an |s|-length
array with initial value 0.
For each record in
inverted list, increase
the count by 1.
Counts greater than
threshold are
candidates
String similarity Search
MergeSkip
Sort string IDs in
inverted list
Maintain a heap and
increase counts for
popped strings
DivideSkip [ICDE/LiLL08]
Improve MergeSkip by
sorting and grouping
strings
Need to process all
inverted list containing
tokens in the query
Prune irrelevant recordsAvoid scanning long
inverted lists
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 62
String similarity Search
Observation: some grams may be very frequent and others are infrequent
fix-length gram may be inefficient
Solution: propose variable-length grams to avoid generating very frequent grams
V-Gram [VLDB/LiWY07]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 63
String Similarity Search
Threshold-based Search Top-k Search
DivideSkip[ICDE/LiLL08]
Syntactic Similarity Semantic Similarity
Token-based
Similarity
Character-based
Similarity
V-Gram [vldb/LiWY07]
HS-Topk [icde/WangLDZF15]
DivideSkip [ICDE/LiLL08]
SI-Search [TODS/LuLWLX15]
Token-based
Similarity
Syntactic Similarity
Character-based
Similarity
HS-Topk [icde/WangLDZF15]
TopkSearch [ICDE/DengLFL13]
AppGram [pvldb/WangDTZ13]
String similarity Search
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 64
String similarity Search
SI-Search
Query: build a QP-tree
Strings: build an SI-tree
QP-tree
SI-Tree
Length filter
Prefix filter
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 65
String Similarity Search
Threshold-based Search Top-k Search
DivideSkip[ICDE/LiLL08]
Syntactic Similarity Semantic Similarity
Token-based
Similarity
Character-based
Similarity
V-Gram [vldb/LiWY07]
HS-Topk [icde/WangLDZF15]
DivideSkip [ICDE/LiLL08]
SI-Search [TODS/LuLWLX15]
Token-based
Similarity
Syntactic Similarity
Character-based
Similarity
HS-Topk [icde/WangLDZF15]
TopkSearch [ICDE/DengLFL13]
AppGram [pvldb/WangDTZ13]
String similarity Search
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 66
String similarity Search
Threshold-based Search top-k search
iteratively tuning thresholds Performance bad. Hard to estimate thresholds
Efficient way: utilize a priority queue, which always calculates the strings
that have the most probability to be in the final results.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 67
String similarity Search
HS-Topk
Create a hierarchical segment tree index, HS-Tree
-- First groups the strings by length.
--- Then constructs a complete binary tree for each group of strings
Proposed two pruning strategies
-- greedy-match strategy: prune the strings with consecutive errors
-- batch- pruning-based strategy: prune the strings by computing upper and
lower bounds
Inefficient filter step for large thresholds
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 68
String similarity Search
TopkSearch
Inefficient for long strings
Improve the traditional dynamic-programming algorithm to calculate edit
distance to avoid trying large numbers of edit-distance
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 69
String similarity Search
AppGram
Use approximate q-gram matchings.
Use a queue to prune strings with small distances
Use two filtering strategies to prune others
-- the CA strategy and f-queue strategy
Use a max heap to maintain top-k similar strings with the query.
High space complexity
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 70
Motivation & Problem Definition
String Similarity Measures
String Similarity Search
String Similarity Join
Outline
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 71
String Similarity Join
s1
s2
…...
sn
t1
t2
…...
tm
…
(si, tj)
…si and tj are similar
~
A set of strings S, a set of
strings T, a threshold value θ
A set of string pairs (s, t) where
SIM(s, t) > θ
Threshold based
string similarity join
A set of strings S, a set of
strings T, an integer 𝑘
k string pairs (s, t) with the
highest similaritiesTop-k based string
similarity join
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 72
String Similarity Join
Threshold-based Join Top-k Join
Syntactic Similarity Semantic Similarity
Token-based
Similarity
Character-based
Similarity
Token-based
Similarity
Syntactic Similarity
Token-based
Similarity
Character-based
Similarity
GramCount [vldb/GravanoIJKMS01]
PPJoin [www/XiaoWLY08]
PassJoin [pvldb/LiDWF11]
Trie-Join [pvldb/WangLF10]
PartEnum [VLDB/ArasuGK06]
Qchunk [sigmod/QinWLXL11]
ED-Join [pvldb/XiaoWL08]
JaccT [ICDE/ArasuCK08]
SI-Join [SIGMOD/LuLWLW13]
Pkduck [PVLDB/TaoDS17]
AP-Join [CIKM/XuL18]
USIM [PVLDB/XuL19]
Topk-Join[ICDE/XiaoWLS09]
Bed-Join[SIGMOD/ZhangHOS10]
String Similarity Join
Prefix filter
String Similarity Join
1. Define a global order
2. Order tokens based on the global order
3. Select T tokens as signatures
4. Check whether there is overlap for signatures
Prefix filter
String Similarity Join
Global ordering: {a b c d e f g h i j k l}
S1=“c k, e, a, f” S2=“d, b, f, e, k”
Threshold=0.8
Order the strings
S1’=“a, c, e, f, k” S2’=“b, d, e, f, k”
Sig(s1)=“a, c” Sig(s2)=“b, d”
Get signatures
No overlap Jacc(s1,s2)<0.8
1
2
3
4
(1-0.8)*5 + 1 = 2
Length filter
String Similarity Join
If two strings are similar, their length difference cannot be large
Threshold = 0.8 s = “Database Conference”
r = “International Conference on Information and Knowledge Management”
|s| = 2
|r| = 7|r| > |s|/0.8 Jaccard(s, r) < 0.8
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 76
String Similarity Join
GramCount [VLDB/GravanoIJKMS01]
Use length-filter and count-filter
Support token-based similarity and character-based similarity
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 77
String Similarity Join
ED-Join [PVLDB/XiaoWL08]
Use prefix filtering
Support only edit distance
Two further optimizations: Position Filtering and Content Filtering
Position Filtering: Remove unnecessary signatures produced in prefix filter
Content Filtering: Use the L1 distance as a bound of edit distance.
Edit distance cannot be smaller than half of the L1
Signatures(“sigmod”) = {“si”, “gm”, “od”} Last signature takes at least τ + 1 edit operation to
destroy both “si” and “ gm”
s = “sigmod”, t = “sigkdd”
L1 (s, t)/ 2 = 4/2 = 2 ED(s, t) >= 2
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 78
String Similarity Join
Qchunk [SIGMOD/QinWLXL11]
Use prefix filtering
Support only edit distance
Two types of signatures: q-grams and q-chunks q-grams with starting positions at
i∗q+1 (0 ≤ i ≤ (l−1)/q)
Strategy 1: index q-grams of r and use q-chunks of s to generate candidates.
Strategy 2: index q-chunks of r and use q-grams of s to generate candidates.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 79
String Similarity Join
PPJoin [WWW/XiaoWLY08]
Support Jaccard, Cosine, Dice
Two further optimizations: Position Filtering and Suffix Filtering
Position Filtering: Enhance prefix filter by computing upper bounds
Suffix Filtering: probe tokens in the suffix to estimate a tighter upper bound
Threshold = 0.8
Sig(s) = {B,C,D,E,F}
Sig(t) = {A,B,C,D,F}
lower bound of the union size (3+3=6)
upper bound of the intersection size
(1+3=4)
upper bound of the
Jaccard is 4/6 < 0.8
{A,B,D,E,?,?,?,?,?,?,Q,?,?,?,?,?,?,?}
{A,C,D,E,?,?,?,?,Q,?,?,?,?,?,?,?,?,?} lower bound of the union size (5+1+6+9 = 21)
upper bound of the intersection size (3+1+4+7 = 15)
upper bound of the
Jaccard is 15/21 <
0.84
7
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 80
String Similarity Join
Threshold-based Join Top-k Join
Syntactic Similarity Semantic Similarity
Token-based
Similarity
Character-based
Similarity
Token-based
Similarity
Syntactic Similarity
Token-based
Similarity
Character-based
Similarity
GramCount [vldb/GravanoIJKMS01]
PPJoin [www/XiaoWLY08]
PassJoin [pvldb/LiDWF11]
Trie-Join [pvldb/WangLF10]
PartEnum [VLDB/ArasuGK06]
Qchunk [sigmod/QinWLXL11]
ED-Join [pvldb/XiaoWL08]
JaccT [ICDE/ArasuCK08]
SI-Join [SIGMOD/LuLWLW13]
Pkduck [PVLDB/TaoDS17]
AP-Join [CIKM/XuL18]
USIM [PVLDB/XuL19]
Topk-Join[ICDE/XiaoWLS09]
Bed-Join[SIGMOD/ZhangHOS10]
String Similarity Join
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 81
String Similarity Join
JaccT[ICDE/ArasuCK08]
Use synonym rules
Transformation-based operation
Filtering: Prefix filter based signature
Verification: Enumerate all possible transformed strings, then apply Jaccard
4
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 82
String Similarity Join
SI-Join [SIGMOD/LuLWLW13]
Use synonym rules
Expansion-based operation
Filtering: length filter and prefix filter
Propose signature size estimation to choose good signatures
Verification: Use Full-Expansion and Selective-Expansion
4
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 83
Prefix filter
Length filter
Filtering
candidates Full expansion
Selective expansion
Verification
String Similarity Join
SI-Join [SIGMOD/LuLWLW13]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 84
String Similarity Join
AP-Join [CIKM/XuL18]
Use taxonomy rules
Transformation-based operation
Filtering: length filter and prefix filter
Propose an approximate segmentation algorithm to segment strings
Verification: Use GTS similarity measure
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 85
String Similarity Join
USIM [PVLDB/XuL19]
Use synonym and taxonomy rules
Transformation-based operation
Filtering: length filter and prefix filter
Propose an approximate segmentation algorithm to segment strings
Verification: Use USIM similarity measure
1. Generate pebbles for all similarities;
2. Select signature for prefix filtering;• Extension: allowing more than one overlaps.
3. Perform filtering by finding pairs
having enough overlaps as candidates;
4. Verify the unified similarity for each
candidate.
86
Taxonomy Synonym Jaccard
Pebbles
Prefix Signature
Candidates
Results
Prefix Selection
Prefix Filtering
Calculate sim.
String Similarity Join
USIM [PVLDB/XuL19]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 87
String Similarity Join
Threshold-based Join Top-k Join
Syntactic Similarity Semantic Similarity
Token-based
Similarity
Character-based
Similarity
Token-based
Similarity
Syntactic Similarity
Token-based
Similarity
Character-based
Similarity
GramCount [vldb/GravanoIJKMS01]
PPJoin [www/XiaoWLY08]
PassJoin [pvldb/LiDWF11]
Trie-Join [pvldb/WangLF10]
PartEnum [VLDB/ArasuGK06]
Qchunk [sigmod/QinWLXL11]
ED-Join [pvldb/XiaoWL08]
JaccT [ICDE/ArasuCK08]
SI-Join [SIGMOD/LuLWLW13]
Pkduck [PVLDB/TaoDS17]
AP-Join [CIKM/XuL18]
USIM [PVLDB/XuL19]
Topk-Join[ICDE/XiaoWLS09]
Bed-Join[SIGMOD/ZhangHOS10]
String Similarity Join
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 88
String Similarity Join
Topk-Join [ICDE/XiaoWLS09]
extend the prefix filtering
Assign each token a weight, the largest possible similarity
Use priority queue to store current top-k candidates
“FCS, Journal, Computer, Science”1 0.75 0.5 0.25
“… Journal …”
the maximum similarity is 0.75
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 89
String Similarity Join
Bed-Join [SIGMOD/ZhangHOS10]
Use B+ tree as index
Support only edit distance
Use several mapping functions to transform strings to integer values
1. Dictionary order
2. Gram counting order
3. Gram location order
Algorithm
# runs
of
reading
original
data
Filtering PhaseVerification
Phase
avoid
duplicate
s
apply
filters
load balancing predictable
size in
Reduce
candidates
generation
avoid
reading
original dataMapReduc
e
RIDPairsPPJ
oin[SIGMOD/VernicaCL10]
≥3 × √ √ √ × common prefix ×
V-Smart-Join[PVLDB/MetwallyF12]
≥3 √ × √ × ×enumerate pairs
in inverted lists√
MassJoin[ICDE/DengLHWF14]
≥3 × √ √ × ×enumerate pairs
in inverted lists×
FS-Join[ICDE/RongLSWLD17]
2 √ √ √ √ √common prefix
+ segmentation√
Distributed String Similarity Join
global
order
HDFS
string
collection
MapReduce
Ordering
global order
calculator
MapReduce
data
partitioner
candidate
generator
Filtering
MapReduce
candidate
aggregator
VerificationHDFS
Results
(similar
pairs)
candidatesordered
token
sets
Distributed String Similarity Join
FS-Join [ICDE/RongLSWLD17]
S1 = {D, E, F, K}
S2 = {B, C, I, J, K}
S3 = {A, B, H, I}
S4 = {A, C, D, F, J}
S5 = {D, E, G, I}
S6 = {B, G, H, J, K}
S1 = {D, K, E, F}
S2 = {I, C, J, B, K}
S3 = {H, B, I, A}
S4 = {A, D, C, J, F}
S5 = {G, I, D, E}
S6 = {H, J, B, G, K}
(a) Original sets (b) Re-ordered sets
Seg1 = {} Seg1 = {D, E, F} Seg1 ={} Seg1 ={K}
Seg2 = {B, C} Seg2 ={} Seg2 ={I} Seg2 ={J, K}
Seg3 = {A, B} Seg3 ={} Seg3 ={H, I} Seg3 ={}
Seg4 = {A, C} Seg4 ={D, F} Seg4 ={} Seg4 ={J}
Seg5 = {} Seg5={D, E} Seg5 ={G, I} Seg5 ={}
Seg6 = {B} Seg6 ={} Seg6 ={G,H} Seg6 ={J, K}
(c) Partitioning based on pivots {C, F, I}
segment of S1fragment
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
Global Ordering: A B CD E F G H I J K
Pivots: { C, F, I }
Distributed String Similarity Join
FS-Join [ICDE/RongLSWLD17]
Vertical Partitioning
There are two problems that should be considered:
1. How to select the pivots?
❖ Random Selection (Random)
❖ Even Interval (Even-Interval)
❖ Even Token Frequency (Even-TF)
❑ Generate fragments with the same number to tokens
2. How many pivots should be selected?
Computation Framework of FS-Join
map
map
map
1, Seg1
2, Seg1
3, Seg1
4, Seg1
1, Seg2
2, Seg2
3, Seg2
4, Seg2
Key, Value
1, Seg3
2, Seg3
1, Seg5
2, Seg5
gro
up
by k
eys
reduceglobal order
1, Seg1
1, Seg2
1, Seg3
1, Seg4
1, Seg5
1, Seg6
Key, Value
2, Seg1
2, Seg2
3, Seg1
3, Seg2
4, Seg1
4, Seg2
reduce
reduce
reduce
(S2, S3), 1(S2, S4), 1(S2, S6), 1(S3, S4), 1(S3, S6), 1(S1, S4), 2(S1, S5), 2(S4, S5), 1(S2, S3),1(S2, S5), 1(S3, S5), 1(S3, S6), 1(S5, S6), 1(S1, S2), 1(S1, S6), 1(S2, S4), 1(S2, S6), 2(S4, S6), 1……
Pair, # of common token
map
map
map
(S2, S3), 1(S2, S4), 1(S2, S6), 1(S3, S4), 1(S3, S6), 1
(S1, S4), 2(S1, S5), 2(S4, S5), 1(S2, S3), 1(S2, S5), 1(S3, S5), 1
(S3, S6), 1(S5, S6), 1(S1, S2), 1(S1, S6), 1(S2, S4), 1(S2, S6), 2(S4, S6), 1
gro
up b
y k
eys
(S1, S2), 1
Key, Value
(S2, S3), 1(S2, S3), 1
(S3, S6), 1(S3, S6), 1
(S5, S6), 1
reduce
reduce
reduce
a segment of S1
S1 = {D, E, F, K}
S2 = {B, C, I, J, K}
S3 = {A, B, H, I}
S4 = {A, C, D, F, J}
S5 = {D, E, G, I}
S6 = {B, G, H, J, K}
…
…
…
…
Res
ult
s
Filter phase: generate candidate string pairs Verification phase: produce final similarity join results
a fragment
# common tokens of Seg2
and Seg3 obtained from the first reduce
1
2
3
4
1
2
3
4
1
2
1
2
1
1
1
1
1
1
…
2
2
3
3
4
4
(S1, S4), 2
(S1, S5), 2
(S1, S6), 1
(S2, S4), 1(S2, S4), 1
(S2, S6), 1(S2, S6), 2
(S2, S5), 1
(S4, S5), 1
(S4, S6), 1
1
1
Filtering Methods
⮚ String Length Filtering (StrL-Filter)
⮚ Segment Length Filtering (SegL-Filter)
⮚ Segment Intersection Filtering (SegI-Filter)
⮚ Segment Difference Filtering (SegD-Filter)
Comparison with Existing Methods
0
1
2
3
4
5
6
0,75 0,8 0,85 0,9
RidPairsPPJoin FS-Join
Threshold
0
5
10
15
20
25
0,75 0,8 0,85 0,9
Online-Aggregation Merge
Merge+Light RidPairsPPJoin
FS-Join
Execution tim
e (
Sec)
X10
2
Threshold
Execution tim
e (
Sec)
X10
2
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 97
References[VLDB/GravanoIJKMS01] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava:
Approximate String Joins in a Database (Almost) for Free. VLDB 2001: 491-500
[VLDB/ArasuGK06] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006: 918-929
[VLDB/LiWY07] Chen Li, Bin Wang, Xiaochun Yang: VGRAM: Improving Performance of Approximate Queries on String Collections
Using Variable-Length Grams. VLDB 2007: 303-314
[ICDE/ArasuCK08] Arvind Arasu, Surajit Chaudhuri, Raghav Kaushik: Transformation-
based Framework for Record Matching. ICDE 2008: 40-49
[ICDE/LiLL08] Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE
2008: 257-266
[WWW/XiaoWLY08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient similarity joins for near duplicate detection. WWW
2008: 131-140
[PVLDB/XiaoWL08] Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: an efficient algorithm for similarity joins with edit distance
constraints. PVLDB 1(1): 933-944 (2008)
[ICDE/XiaoWLS09] Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang: Top-k Set Similarity Joins. ICDE 2009: 916-927
[SIGMOD/ZhangHOS10] Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Divesh Srivastava: Bed-tree: an all-purpose index
structure for string similarity search based on edit distance. SIGMOD Conference 2010: 915-926
[PVLDB/WangLF10] Jiannan Wang, Guoliang Li, Jianhua Feng: Trie-Join: Efficient Trie-based String Similarity Joins with Edit-
Distance Constraints. PVLDB 3(1): 1219-1230 (2010)
[SIGMOD/VernicaCL10] Rares Vernica, Michael J. Carey, Chen Li: Efficient parallel set-similarity joins using MapReduce. SIGMOD
Conference 2010: 495-506
[SIGMOD/QinWLXL11] Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin: Efficient exact edit similarity query processing
with the asymmetric signature scheme. SIGMOD Conference 2011: 1033-1044
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 98
References[ICDE/WangLF11] Jiannan Wang, Guoliang Li, Jianhua Feng: Fast-
join: An efficient method for fuzzy token matching based string similarity join. ICDE 2011: 458-469
[PVLDB/LiDWF11] Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng: PASS-JOIN: A Partition-based Method for Similarity
Joins. PVLDB 5(3): 253-264 (2011)
[PVLDB/MetwallyF12] Ahmed Metwally, Christos Faloutsos: V-SMART-Join: A Scalable MapReduce Framework for All-Pair
Similarity Joins of Multisets and Vectors. PVLDB 5(8): 704-715 (2012)
[SIGMOD/LuLWLW13] Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang: String similarity measures and joins with
synonyms. SIGMOD Conference 2013: 373-384
[ICDE/DengLFL13] Dong Deng, Guoliang Li, Jianhua Feng, Wen-Syan Li: Top-k string similarity search with edit-distance
constraints. ICDE 2013: 925-936
[PVLDB/WangDTZ13] Xiaoli Wang, Xiaofeng Ding, Anthony K. H. Tung, Zhenjie Zhang: Efficient and Effective KNN Sequence
Search with Approximate n-grams. PVLDB 7(1): 1-12 (2013)
[ICDE/DengLHWF14] Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, Jianhua Feng: MassJoin: A mapreduce-based method
for scalable string similarity joins. ICDE 2014: 340-351
[TODS/LuLWLX15] Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Xiaokui Xiao: Boosting the Quality of Approximate String
Matching by Synonyms. TODS. 40(3): 15:1-15:42 (2015)
[ICDE/WangLDZF15] Jin Wang, Guoliang Li, Dong Deng, Yong Zhang, Jianhua Feng: Two birds with one stone: An efficient
hierarchical framework for top-k and threshold-based string similarity search. ICDE 2015: 519-530
[ICDE/RongLSWLD17] Chuitian Rong, Chunbin Lin, Yasin N. Silva, Jianguo Wang, Wei Lu, Xiaoyong Du: Fast and Scalable
Distributed Set Similarity Joins for Big Data Analytics. ICDE 2017: 1059-1070
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 99
References[ICDE/ShangLLF17 ] Zeyuan Shang, Yaxiao Liu, Guoliang Li, Jianhua Feng: K-Join: Knowledge-
Aware Similarity Join. ICDE 2017: 23-24
[SEMWEB/SlabbekoornHH12] Kristian Slabbekoorn, Laura Hollink, Geert-Jan Houben: Domain-
Aware Ontology Matching. International Semantic Web Conference (1) 2012: 542-558
[PVLDB/TaoDS17] Wenbo Tao, Dong Deng, Michael Stonebraker: Approximate String Joins with Abbreviations. PVLDB 11(1): 53-
65 (2017)
[PVLDB/DengKMS17] Dong Deng, Albert Kim, Samuel Madden, Michael Stonebraker:
SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints. PVLDB 10(10): 1082-1093 (2017)
[CIKM/XuL18] Pengfei Xu, Jiaheng Lu: Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint. CIKM 2018: 1563-
1566
[ICDE/WangLZ19] Jin Wang, Chunbin Lin, Carlo Zaniolo: MF-Join: Efficient Fuzzy String Similarity Join with Multi-level
Filtering. ICDE 2019: 386-397
[EDBT/WangLLZ19] Jin Wang, Chunbin Lin, Mingda Li, Carlo Zaniolo: An Efficient Sliding Window Approach for Approximate Entity
Extraction with Synonyms. EDBT 2019: 109-120
[PVLDB/XuL19] Pengfei Xu, Jiaheng Lu: Towards a Unified Framework for String Similarity Joins. PVLDB 12(11): 1289-1302 (2019)
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 100
Part 2
String similarity searches and joins with
machine learning
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 101
Outline
• Background
• Traditional ML based Approach
• Deep Learning based Approach
• Comprehensive Approach
• Related topics in Natural Language Processing
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 102
String Matching can be formulated as…
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 103
String as Entity
⮚ String matching problem is always known as Entity Matching
⮚ Definition: identifying and linking/grouping different manifestations of the same real-world object, e.g
❖ Different ways of addressing (names, emails, Facebook accounts), the same person in text
❖ Web pages with different descriptions of the same business
❖ Different photos taken for the same object etc.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 104
Challenges
⮚ Name/attribute ambiguity, data entry errors, missing data, formatting differences, changing attributes…
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 105
Brief introduction of ML
⮚ Machine Learning (ML)
❖ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. -- Tom Mitchell, 1954
⮚ ML categories
According to whether the learning process is with labelled data:
❖ Supervised Learning
❖ Unsupervised Learning
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 106
Supervised vs. Unsupervised Learning
⮚ String matching mostly adopts supervised methods
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 107
Supervised ML models
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 108
Recurrent Neural Network (RNN)
⮚ Recurrent Neural Networks take the previous output as inputs. The composite input at time t summarized historical information about the happenings from the first to t-1 time steps.
⮚ RNNs are useful as their intermediate values (state) can store information about past inputs for a time that is not fixed a priori.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 109
LSTM: A variant of RNN [HS 1997]
The hidden activation vector corresponding
to the last word is the sentence embedding
vector (blue).
To solve the vanishing
gradients problem
Add gating mechanism to RNN, keep a memory
cell to contain information outside the normal flow
of RNN.
Taking last step’s output as input of current step.
Recurrently updating representation of sentence.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 110
RNN Encoder-decoder [CMG 2014]
⮚ Create a reversible sentence representation.
⮚ The representation can be reconstructed to an actual sentence form which is reasonable and novel.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 111
RNN Encoder-decoder (cont.)
• The conditional distribution of next symbol.
• Add a summary(constant) symbol, it will hold
the semantics of sentence.
• For long sentences, adding hidden unit to
remember/forget memory.
ሻ𝑃(𝑦𝑡|𝑦𝑡−1, 𝑦𝑡−2, … , 𝑦1, 𝑐ሻ = 𝑔(ℎ<𝑡>, 𝑦𝑡−1, 𝑐
ሻℎ<𝑡> = 𝑓(ℎ<𝑡−1>, 𝑦𝑡−1, 𝑐
Gated Recurrent Unit (GRU):
- Another RNN variant
- simpler than LSTM
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 112
Word Embedding
⮚ Foundation: the Distributional Hypothesis
⮚ Each word in the vocabulary is represented by a low dimensional vector
⮚ All words are embedded into the same space
⮚ Similar words have similar vectors
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 113
RNN Language Model
⮚ Generate much more meaningful text than n-gram models
⮚ The sparse history is projected into some continuous low-dimensional space, where similar histories get clustered
ሻ𝑠(𝑡ሻ = 𝑓(Uw(𝑡ሻ + W𝑠(𝑡 − 1ሻ
ሻ𝑦(𝑡ሻ = 𝑔(V𝑠(𝑡ሻ
Output Values:
𝑤 𝑡 : 𝑖𝑛𝑝𝑢𝑡 𝑤𝑜𝑟𝑑 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡
y 𝑡 : 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 𝑤𝑜𝑟𝑑𝑠
U,V,W: 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑚𝑎𝑡𝑟𝑖𝑥
s 𝑡 : ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 114
Word2vec [MCC 2013]
Directly learn the representation of words using context words
(𝑤,𝑐ሻ∈𝐷
𝑤𝑗∈𝑐
൯log𝑃(𝑤|𝑤𝑗Maximizing the objective function in whole corpus.
Skip-gram Given the context, predicting the word
Works well with small training data,
represents well even rare words or phrases
CBOW Given the word, predicting the context
Faster to train the model, better accuracy
for the frequent words
Two variants:
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 115
Glove [PSM 2014]
⮚ Without Distributional Hypothesis.
⮚ Constructing the word-word co-occurrence matrix of whole corpus.
⮚ Inspired from LSA, using matrix factorization to produce word representation.
መ𝐽 =
𝑖,𝑗
൯𝑓(𝑋𝑖𝑗 𝑤𝑖𝑇 𝑤𝑗 − log𝑋𝑖𝑗
2
X-ij is the count of if j-th word occurs, the occurrence of i-th word. w are word
vectors. Minimize loss function.
Loss function:
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 116
Phrase2vec [MSC 2013]
⮚ From Word2vec to Phrase2vec
❖ Enlarge phrase vocabulary by analogical reasoning task.
❑ A : B = C : ?
e.g. In word vector space, “Man” + “King” – “Woman” = “Queen”
❑ Phrase A : Phrase B = Phrase C : Phrase D
❖ Phrase Skip-gram Model
❑ Treat phrase in vocabulary like a word.
❑ Incorporates Hierarchical Softmax and Negative Sampling.
New York : New York Times = San Jose : ?
San Jose Airport
San Jose State Univ.
. . .
Phrase D is approximately produced,
which could enlarge phrase vocabulary.
words that appear
frequently together
in corpus
San Jose Mercury News
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 117
Paragraph2vec [LM 2014]
⮚ An extension of word2vec model, using a global sentence vector as context.
⮚ When updating word vectors in each iteration, paragraph matrix will also be updated.
Predict the next word according to context
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 118
Categorization of ML based String Matching
⮚ Traditional ML approach
❖ Acquire features from entities
❖ Use classification/clustering methods to make decision
⮚ Deep Learning approach
❖ Encode entity/attribute with low-dimensional vectors
❖ Use deep neural network for representation learning
⮚ Comprehensive approach
❖ End-to-end system
❖ Combination of different approaches
❖ Improved usability
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 119
Basics of ML approaches
⮚ ML model can be either traditional ones or deep learning ones
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 120
Feature Engineering
⮚ For Deep Learning approaches, representation learning is used instead of such explicit features.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 121
Outline
• Background
• Traditional ML based Approach
• Deep Learning based Approach
• Comprehensive Approach
• Related topic in Natural Language Processing
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 122
Rule-based approach [FS 1969]
⮚ r = (x,y) is record pair, 𝛾 is comparison vector, M: matches, U: non-matches
⮚ Decision Rule
⮚ Naïve Bayes Assumption
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 123
Supervised approaches
⮚ The most popular category
❖ Decision trees [TKM 2001]
❖ Support vector machines [BM 2003]
❖ Ensembles of classifiers [CKM 2009]
❖ Conditional Random Fields (CRF) [GS 2009]
⮚ However, there might be potential problems…
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 124
Potential drawback: insufficient training data
⮚ Constructing a training set is hard – since most pairs of records are “easy non‐matches”, E.g.
❖ 100 records from 100 cities
❖ Only 1% of the total pairs come from the same city
⮚ Some pairs are hard to judge even by humans
❖ Inherently ambiguous
E.g., Paris Hilton (person or business)
❖ Missing attributes
Starbucks, Toronto vs Starbucks, Queen Street ,Toronto
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 125
Other alternatives
⮚ Semi-supervised/Unsupervised Learning
❖ EM based techniques to learn parameters [Winkler 2006]
❖ Generative Models [RC 2004]
⮚ Active Learning
❖ Committee of Classifiers [SB 2002]
❖ Provably optimizing precision/recall [AGK 2010], [BIP 2012]
❖ Crowdsourcing [WKF 2012]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 126
Summary
⮚ Supervised Learning: The main stream
⮚ Make use of different features and models
⮚ Bottleneck
❖ Require feature engineering
❖ Insufficient training data
⮚ Active Learning and crowdsourcing methods could be promising
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 127
Outline
• Background
• Traditional ML based Approach
• Deep Learning based Approach
• Comprehensive Approach
• Related topic in Natural Language Processing
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 128
DeepER [ETJ 2018]
⮚ Adopt Deep Learning models to
❖ Capture both syntactic and semantic information
❖ Avoid human labored feature engineering
⮚ End-to-end training the model
⮚ Use Locality Sensitive Hashing for identifying similar entities
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 129
DeepER: Methodology
⮚ Use word embedding on each word in an entity
⮚ Encode the whole entity with RNN models
⮚ Compare the similarity with some predefined metrics, e.g. Element-wise comparison
⮚ Classification layer: use activation functions, e.g. softmax, sigmoid
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 130
LinkNBed [TSD 2018]
⮚ General idea: Jointly learn representations and entity linkage
⮚ Key Insights
❖
❖
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 131
Model Architecture
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 132
Deep Matcher [MLR 2018]
⮚ Three Components❖ Encode attributes into low-
dimensional vectors❖ Evaluate similarity between
representation of entities❖ Perform classification
⮚ 4 Models❖ SIF: An Aggregate Function Model❖ RNN: A Sequence-aware Model❖ Attention: A Sequence Alignment
Model❖ Hybrid Model
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 133
Decomposable Attention-based model
⮚ Attention mechanism
❖ Learn a similarity representation
❖ Expressed with token-wise alignment
⮚ Comparison: Two-layer Highway Network
⮚ Aggregation
❖ Sum over all elements
❖ Normalization
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 134
Hybrid Attribute Summarization
⮚ Use a Bi-directional RNN to learn the representation before soft alignment
⮚ Improvement over aggregation
❖ Use RNN to compute weight
❖ A weighted average over all elements
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 135
Summary
⮚ Pros
❖ Do not need manually created features
❖ More Powerful
⮚ Cons
❖ Efficiency
❖ Only Support Supervised setting
❖ Overfitting
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 136
Outline
• Background
• Traditional ML based Approach
• Deep Learning based Approach
• Comprehensive Approach
• Related topic in Natural Language Processing
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 137
End-to-end EM System
⮚ Challenges
❖ Cover the entire EM pipeline
❖ Integrate multiple methods into one system
❖ Provide necessary guidance to users
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 138
Two-Stage Framework
How is this done today in practice?
Development stage: find an accurate workflow, using data samples
Production stage: execute workflow on entirety of data
A
1M tuples
1M tuplesblock match using supervised learning
B
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 139
The Magellan System [KDS 2016]
Development Stage
How-to guide
Supporting tools
(as Python commands)
Data samples
Python Interactive Environment
Script Language
Data Analysis Stack
pandas, scikit-learn, matplotlib,
numpy, scipy, pyqt, seaborn,
…
Big Data Stack
PySpark, mrjob, Pydoop,
pp, dispy,
…
Facilities for Lay Users
GUIs, wizards, …
Power Users
EM
Workflow
PyData
eco
system
Production Stage
How-to guide
Supporting tools
(as Python commands)
Original data
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 140
Workflow: How to guide
no
A
B
sampleA’
B’
matcher
V
quality
check
blocker
blocker
X
(-,-)
(-,-)
(-,-)
(-,-)
(-,-)
Cx
sample
(-,-)
(-,-)
(-,-)
(-,-)
(-,-)
(-,-)
(-,-)
(-,-)
(-,-) +
(-,-) -
(-,-) +
cross-validatematcher U
cross-validatematcher V
0.89 F1
0.93 F1
matcherlabel
Cx S
G
(-,-) +
(-,-) +
(-,-) -
(-,-) -
(-,-) +
A’
B’
blocker
XCx
A’
B’
blocker
YCy
yes
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 141
PyData Ecosystem
⮚ Development stage does a lot of data analysis ❖ So build tools on data analysis stack in PyData
⮚ Production stage focuses on scaling❖ So build tools on Big Data stack in PyData
⮚ PyData ecosystem❖ Used extensively by data scientists❖ 86,800 packages (in PyPI)❖ Data analysis stack❖ Big data stack❖ Tools to manage user work❖ Software infrastructure to build tools❖ Ways to manage/package/distribute tools❖ Companies, conferences, books, etc. 141
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 142
Design for Open World
command 1
command 2
data
metadata
A
B
A.ssn is a key
Magellan
System X
System Y
command x1
command x2
data
C
command y1
command y2
metadata…
…
…
…
SQL queries
commands
data
metadata
A
B
A.ssn is a key
RDBMS
Closed-World Systems Open-World Systems
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 143
Falcon [DSD 2017]
⮚ Boost EM with crowdsourcing and active learning
❖ Define basic operators & use them to model the EM workflow as a DAG
❖ Scales up operators (using MapReduce)
❑ e.g., executing complex blocking rules over large tables
❖ Conduct optimizations both within and across operators
❑ e.g., use crowd time of an operator to mask machine time of another
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 144
The EM DAG of Falcon
⮚ Basic idea: use crowd time to mask machine time
⮚ Other Optimizations: indexing, speculatively execute rules, mask pair selection
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 145
Smurf [SAD 2018]
⮚ Replace the labeling step of Falcon with self-service solution
⮚ Use Random Forest to directly match strings with multiple predicates instead of deriving blocking rules
Apply pruning predicates in the fashion of Decision Tree
Multiple Decision Trees: Random Forest
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 146
Execution of Random Forests
⮚ In-database fashion of execution
⮚ Optimizations: leveraging the query optimizer of DBMS❖ Join results reuse❖ Intra/Inter path filter reuse and ordering❖ Cost-based plan selection
Overall process of executionExpress rules with Join operation
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 147
Summary
⮚ Powerful toolkits
⮚ Integration of different approaches
❖ Similarity functions
❖ ML models
❖ Crowdsourcing
⮚ User friendly API
⮚ Efficiency
❖ Benefit from DB techniques
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 148
Outline
• Background
• Traditional ML based Approach
• Deep Learning based Approach
• Comprehensive Approach
• Related topic in Natural Language Processing
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 149
Text matching in NLP tasks
⮚ The Problem can be formulated as:
Match(𝑇1, 𝑇2)= 𝐹(𝜙(𝑇1), 𝜙(𝑇2))
❖ 𝜙: mapping text to representation vector
❖ 𝐹: scoring function based on representation
Text Matching is a core Task in NLP
Task Text 1 Text 2
Information Retrieval Query/Document Document
Question Answering Question Answer
Paraphrase Identification Text A Text B
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 150
Paraphrase Identification
Where can I get very professional and reliable
envelope printing service in Sydney?
Where can I get very affordable branded envelope
printing service in Sydney?
Why are doctors always late?
Why doctors always make you wait for 15-20 minutes
before they see you?
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 151
ARC-I and ARC-II [HLL 2014]
Architecture I: Siamese NetworkArchitecture II: early interaction
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 152
PWIM [HL 2016]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 153
Information Retrieval
⮚ Deep learning methods have been successfully applied to Information retrieval (IR)
⮚ Human defined features❖ Time consuming
❖ Incomplete❖ Over specified
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 154
DSSM [HHG 2013]
⮚ First composite the representation of each document/sentence
⮚ Then perform matching between documents/sentences
❖ Cosine similarity between semantic vectors
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 155
Match Pyramid [WLG 2016]
⮚ Focus on capturing matching patterns from interaction
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 156
Question Answering (QA)
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 157
Benchmark: Stanford Question Answer Dataset (SQuAD)
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 158
Match-LSTM [WJ 2016]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 159
BiDAF: Bi-directional Attention Flow [SKF 2017]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 160
QANet [YDL 2018]
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 161
Summary
⮚ Text matching is also a popular and crucial topic in NLP
⮚ Different definitions of texts for different problems
⮚ Basic architecture: Siamese Network
⮚ Models are going deeper…
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 162
Reference
⮚ [AGK 2010] On active learning of record matching packages. SIGMOD 2010.[BIP 2012] Active sampling for entity matching. KDD 2012.
⮚ [BM 2003] Adaptive Duplicate Detection Using Learnable String Similarity Measures. KDD 2003.[CKM 2009] Exploiting context analysis for combining multiple entity resolution systems. SIGMOD 2009.
⮚ [CMG 2014] Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.
⮚ [DHM 2005] Reference reconciliation in complex information spaces. SIGMOD 2005.
⮚ [DSD 2017] Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. SIGMOD 2017.
⮚ [ETJ 2018] Distributed representations of tuples for entity resolution. PVLDB 2018.
⮚ [HS 1997] Long Short-Term Memory. Neural Computation, 1997.
⮚ [KDS 2016] Magellan: Toward building entity matching management systems. PVLDB 2016.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 163
Reference (cont.)
⮚ [HHG 2013] Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. CIKM 2013.
⮚ [HL 2016] Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement. NAACL 2016.
⮚ [HLL 2014] Convolutional Neural Network Architectures for Matching Natural Language Sentences. NIPS 2014.
⮚ [GS 2009] Answering Table Augmentation Queries from Unstructured Lists on the Web, PVLDB 2009.
⮚ [LM 2014] Distributed representations of sentences and documents. ICML 2014.⮚ [MLR 2018] Deep learning for entity matching: A design space exploration. SIGMOD 2018.⮚ [MCC 2013] Efficient Estimation of Word Representations in Vector Space. ICLR Workshop, 2013.⮚ [MSC 2013] Distributed representations of words and phrases and their compositionality. NIPS
2013.⮚ [PSM 2014] Glove: Global vectors for word representation. EMNLP, 2014.⮚ [RC 2004] A Hierarchical Graphical Model for Record Linkage. UAI 2004.⮚ [SAD 2018] Smurf: Self-service string matching using random forests. PVLDB 2018.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 164
Reference (cont.)
⮚ [SAD 2018] Smurf: Self-service string matching using random forests. PVLDB 2018.
⮚ [SKF 2017] Bidirectional Attention Flow for Machine Comprehension. ICLR 2017.
⮚ [TKM 2001] Learning object identification rules for information integration. Inf. Syst. 2001.
⮚ [TSD 2018] LinkNBed: Multi-Graph Representation Learning with Entity Linkage. ACL 2018.
⮚ [Winkler 2006] Overview of Record Linkage and Current Research Directions, Research Report Series, US Census, 2006.
⮚ [WJ 2017] Machine Comprehension Using Match-LSTM and Answer Pointer. ICLR 2017.
⮚ [WKF 2012] CrowdER: Crowdsourcing Entity Resolution. PVLDB 2012.
⮚ [WLG 2016] A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations. AAAI 2016.
⮚ [YDL 2018] QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. ICLR 2018.
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 165
Open challenges
Open challenges (I)
Optimize the pipeline of String Similarity Queries
⮚ Most of works are algorithm level optimization
⮚ Need an end-to-end pipeline to deal with the whole life cycle of the task.
⮚ Essential to build such a pipeline on relational database management systems like MySQL or NoSQL database like MongoDB.
166
Open challenges (II)
More Efficient ML based Approaches
⮚ Machine learning based methods needs more time to training model than database techniques.
⮚ Supervised ones requires a large amount of labeled data to serve as the training set
167
Open challenges (III)
Combine Human-in-the-Loop Approaches with ML
⮚ Use crowdsourcing approaches in string similarity measurement
⮚ The future direction is to automatically identify when and to what extent should human labor be involved in string similarity processing
168
Conclusion
⮚ String data is ubiquitous
⮚ Two approaches for similarity string processing with database techniques and ML models
⮚ Future work to combine two approaches together for a Human-in-the-Loop and (semi-)autonomous tool
169
CIKM 2019: Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join 170
Thank you
Q & A