34
Fast-Join : An Efficient Method for Fuzzy Token Matching based String Similarity Join Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Embed Size (px)

Citation preview

  • Slide 1

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China) Slide 2 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 2/34 Slide 3 Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 Fast-Join @ ICDE2011 Card #NameAddrPhn 1234****Jeffery UllmanCS Dept. Stanford, CA111-1111 1018****Marvin MinskyCS Dept., MIT, MA222-2222 Card #NameEmailTel 1205****David [email protected] 0101****Jeffrey [email protected](650)111-1111 Jeffery Ullman Jeffrey Ullman Perform a similarity join on name attribute 3/34 Slide 4 Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 Fast-Join @ ICDE2011 User Id QueryTimestamp 1018**** ICDE 2011 Hanover 2011-01-15 10:12:10 1234**** NBA All Stars 2011 2011-01-15 11:05:06 2823**** ICDE Hannover 2011-01-15 11:10:10 6345**** weather Hanover 2011-01-15 12:34:10 Perform a self similarity join on query attribute 4/34 Slide 5 Motivation 2011/4/13 Fast-Join @ ICDE2011 Existing Similarity Metrics Token-based Similarity Character-based Similarity Hybrid Similarity Dice, Cosine, Jaccard, Edit Distance, Edit Similarity, GED [SIGMOD 03] Jaccard(S1, S2) = 1/3 ED(S1, S2) = 8GED(S1, S2) = 0 S1 = nba mcgrady, S2 = macgrady nba 5/34 Slide 6 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 6/34 Slide 7 Token-based Similarity Dice similarity Cosine similarity Jaccard similarity 2011/4/13 Fast-Join @ ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} |T 1 T 2 | =1 Example Exactly matched token pairs, i.e. T 1 T 2 7/34 Slide 8 2011/4/13 Fast-Join @ ICDE2011 T1T1 T2T2 mcgrady nba wnba macgrady nba 0.125 0.75 0.875 0.143 1 0.125 Weighted Bipartite Graph 3.Fuzzy Overlap: Maximum Weighted Matching (Quantify token similarity) Better than |T 1 T 2 |= 1 8/34 Slide 9 Fuzzy-Token Similarity Fuzzy-Dice similarity Fuzzy-Cosine similarity Fuzzy-Jaccard similarity 2011/4/13 Fast-Join @ ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} Example 9/34 Slide 10 Comparison with Existing Similarities 2011/4/13 Fast-Join @ ICDE2011 10/34 Slide 11 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 11/34 Slide 12 2011/4/13 Fast-Join @ ICDE2011 String Similarity Join using Fuzzy-Token Similarity s1s1 kobe and trancy s2s2 trcy macgrady mvp s' 1 kobe bryant age s' 2 mvp tracy mcgrady T1T1 {kobe, and, trancy} T2T2 {trcy, macgrady, mvp} T 1 {kobe, bryant, age} T 2 {mvp, tracy, mcgrady} Tokenization (s 2, s 2 ), Naive Solution Enumerating N 2 pairs Quite Expensive Naive Solution Enumerating N 2 pairs Quite Expensive 12/34 Slide 13 Using Existing Methods 2011/4/13 Fast-Join @ ICDE2011 13/34 Slide 14 Our Signature Scheme 2011/4/13 Fast-Join @ ICDE2011 The superscript denotes which token generates the signature The superscript denotes which token generates the signature 14/34 Slide 15 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 15/34 Slide 16 Fast-Join @ ICDE2011 2011/4/13 Prefix Filtering Signature Scheme Alphabetical Order Remove 2 largest signatures 16/34 Slide 17 2011/4/13 Fast-Join @ ICDE2011 Token Sensitive Signature Scheme Prefix Filtering No! Token Sensitive Yes! 17/34 Slide 18 2011/4/13 Fast-Join @ ICDE2011 Candidates : {(T2,T4)} Delete the maximal number of largest signatures that contain 2 tokens Alphabetical Order Token Sensitive Signature Scheme (Contd) Candidates : {(T 1,T 2 ),(T 1,T 3 ),(T 1,T 4 ),(T 2,T 4 )} 18/34 Slide 19 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 19/34 Slide 20 2011/4/13 Fast-Join @ ICDE2011 Partition-NED Signature Scheme 20/34 Slide 21 2011/4/13 Fast-Join @ ICDE2011 Partition t 21/34 Slide 22 2011/4/13 Fast-Join @ ICDE2011 Partition t 22/34 Slide 23 2011/4/13 Fast-Join @ ICDE2011 Partition t (Contd) -3 -2 2 23/34 Slide 24 2011/4/13 Fast-Join @ ICDE2011 Punning Techniques Reduce substrings from 21 to 8 24/34 Slide 25 Comparison with Partition-ED (SIGMOD 09) 2011/4/13 Fast-Join @ ICDE2011 25/34 Slide 26 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 26/34 Slide 27 Experiment Setup Data sets DBLP Author: Author names from DBLP dataset AOL Query Log: Queries from AOL dataset Environment C++, GCC 4.2.3, Ubuntu Intel Core 2 Quad X5450 3.00GHz processor and 4 GB memory 2011/4/13 Fast-Join @ ICDE2011 27/34 Slide 28 Result Quality 2011/4/13 Fast-Join @ ICDE2011 28/34 Slide 29 Evaluation on Different Signature Schemes for Tokens 2011/4/13 Fast-Join @ ICDE2011 29/34 Slide 30 Evaluation on Different Signature Schemes for Token Sets 2011/4/13 Fast-Join @ ICDE2011 30/34 Slide 31 Put Everything Together 2011/4/13 Fast-Join @ ICDE2011 31/34 Slide 32 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 32/34 Slide 33 Conclusion Fuzzy-token similarity Hybrid similarity Subsume many well-known similarities High result quality String similarity join using fuzzy-token similarity Signature-based framework Token-sensitive signature scheme Partition-NED signature scheme Achieve higher performance than the state-of-the-art methods both theoretically and experimentally 2011/4/13 Fast-Join @ ICDE2011 33/34 Slide 34 2011/4/13 Fast-Join @ ICDE2011 http://dbgroup.cs.tsinghua.edu.cn/wangjn/projects/fastjoin/ 34/34