26
March 13, 2022 © Copyright 2004 Sunaptic Solutions. All rights reserved. Text Search and Fuzzy Matching Presented by Andrei Kossoroukov, Sunaptic Solutions [email protected]

Text Search and Fuzzy Matching

Embed Size (px)

DESCRIPTION

Text Search and Fuzzy Matching. Presented by Andrei Kossoroukov, Sunaptic Solutions [email protected]. Focus of the Presentation. Text Search in Big Databases Data Cleansing in ETL Word Matching Usage of Different Matching Algorithms. Scenarios. Scenario 1. - PowerPoint PPT Presentation

Citation preview

Page 1: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Text Search and Fuzzy Matching

Presented by Andrei Kossoroukov, Sunaptic [email protected]

Page 2: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Focus of the Presentation

Text Search in Big Databases Data Cleansing in ETL Word Matching Usage of Different Matching Algorithms

Page 3: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Scenarios

User Interface

Scenario 1

Other Systems

“Dirty” Data

“clean” request for search

User Interface

Scenario 3

Other Systems

“Clean” Data

“dirty” request for search

Scenario 2 (ETL)

“Dirty” Data

“Clean” Data

Page 4: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Text Search Challenges

Improving Search Speed Searching for a substring in a string

regardless of the substring nature. Improving Relevance of Results

Searching for words of a human language. Domain dependence.

Page 5: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Word Matching Approaches

Exact Matching Partial Matching (Pattern Matching) Grammatical Algorithms: Stemming

Matching and Synonym Matching (Semantics)

Phonetic Matching Fuzzy Matching

Page 6: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Exact Matching

No additional challenge except speed. Domain does not really matter. Example: search in a file using notepad

program. Example (SQL): SELECT field FROM table

WHERE field = ‘string’. MS SQL Server: Proper indexing

improves speed.

Page 7: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Partial Matching

Domain does not really matter. Example: wildcards, search patterns. Example (SQL): SELECT field FROM table

WHERE field LIKE ‘string%’. MS SQL Server: Proper indexing

improves speed.

Page 8: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Full-text Searchin MS SQL Server

Needs MS Search Service (for SQL Server 2000) Included in MS SQL Server 2005 as SQL Server Full Text

Search Service CONTAINS Predicate

Unlike LIKE, CONTAINS matches words. Can search for a word inflectionally generated from

another (stemming matching). Can search for a word near another word. SQL Server discards noise words from the search

criteria. FREETEXT Predicate

A word or phrase close to the search word or phrase. Needs Additional Space on Disk

Page 9: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Full-text Search Architecture in MS SQL 2000

Page 10: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Full-text Search Architecture in MS SQL 2005

Page 11: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Grammatical Algorithms

Stemming Match We already saw SQL Server Full Text search. Google example: “cutting and paste”. Why needs dictionary: to determine the stem. MS Search Service provides only inflectional, not

derivational, word generation. Synonym Match Most Grammatical Algorithms are Based on Dictionaries Quasi Stemming Match

Can be developed without a main dictionary (using quasi–endings tree).

Relatively low relevance.

Page 12: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Phonetic Matching

Phonetic Matching Algorithms (or Phonetic Encoding, or “Sounds Alike” Algorithms)

Language Dependent Domain Dependent

Page 13: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Phonetic Matching Algorithms

The original SoundEx Algorithm Has been used in US census since late

1890s. Was patented by Margaret O'Dell and

Robert C. Russell in 1918. Improvements: Phonix (1988), Editex (phonetic

distance measuring, circa 2000), etc. Metaphone and Double Metaphone Algorithms

Author: Lawrence Phillips. 1990 and 2000.

Page 14: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

SoundEx Algorithm 1. Capitalize all letters in the word and drop all punctuation marks. Pad the word

with rightmost blanks as needed during each procedure step. 2. Retain the first letter of the word. 3. Change all occurrence of the following letters to '0' (zero):

  'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. 4. Change letters from the following sets into the digit given:

1 = 'B', 'F', 'P', 'V' 2 = 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' 3 = 'D','T' 4 = 'L' 5 = 'M','N' 6 = 'R'

5. Remove all pairs of digits which occur beside each other from the string that resulted after step 4.

6. Remove all zeros from the string that results from step 5 (placed there in step 3).

7. Pad the string that resulted from step (6) with trailing zeros and return only the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.

Page 15: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

More About SoundEx

Example (SQL) DIFFERENCE Oracle SOUNDEX – Slightly Different from

SQL Server SOUNDEX Seems That Major DBMSs (SQL Server,

Oracle, DB2) Don’t Have a Better Phonetic Matching

Enhancements Replace DG with G etc. Phonix algorithm.

Page 16: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

SoundEx Limitations

SoundEx is only usable in applications that can tolerate high false positives (when words that don't match the sound of the inquiry are returned) and high false negatives (when words that match the sound of the inquiry are NOT returned).

In many instances, unreliable interfaces are used as a foundation, upon which a reliable layer may be built. Interfaces that build a reliable layer, based on context, over a SoundEx foundation may also be possible.

SQL: word can’t start with a space. Mistake in first letter results in 100% mismatch.

Page 17: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Metaphone and Double Metaphone

Metaphone An algorithm to code English words phonetically by

reducing them to 16 consonant sounds. Double Metaphone

An algorithm to code English words (and foreign words often heard in the United States) phonetically by reducing them to 12 consonant sounds.

Author: Lawrence Phillips, 1990 and 2000 Metaphone Description and Demo:

http://www.wbrogden.com/phonetic/ SQL Example

Page 18: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Double Metaphone Advantages and Limitations

Free, Efficient, and Easy to Use Provides Better Results Compare to

SOUNDEX Returns Two Possible Matches

Works Best with Proper Names May Fail to Match Misspelled Words Much Slower than SOUNDEX

Page 19: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Fuzzy Matching

What is Fuzzy Matching? Fuzzy query in Index Server are simple prefix matching,

like dog* returns dogmatic and doghouse, + stem matching.

Originally Meant “Not Exact Matching” Web Search Engines Edit Distance Based Algorithms

Simple: Hamming distance algorithms. Most popular: Levenshtein distance algorithm.

Q-Gram Based Algorithms Both Types of Algorithms Are Language and Domain

Independent

Page 20: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Levenshtein Distance

Developed in 1965 LD is a Measure of the Similarity

Between Two Strings It is the smallest number of insertions,

deletions, and substitutions required to change one string into another.

Language and Domain Independent Demo

http://www.merriampark.com/ld.htm

Page 21: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Q-Grams

Q-Grams Are Obtained by Sliding a Window of Size Q over the Characters of a Given String

Example 2-grams of “john smith” are $j jo oh hn n_ _s sm

mi it th h# IDEA: If Strings Match, They Have Many

Common Q-Grams Example: “john smith” and jonh smith” have 9

common q-grams. Language and Domain Independent

Page 22: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

“Fuzzy” SSIS

Fuzzy Lookup enables to match input records with clean, standardized records in a reference table.

Fuzzy Grouping enables to identify groups of records in a table where each record in the group potentially corresponds to the same real-world entity.

Designed for data cleanup. Based on Q-Grams and Levenshtein Distance (?).

Page 23: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Design a Simple SSIS Fuzzy Lookup Package

Setting Up String Data Types (DT_STR and

DT_WSTR) ETI (Error-Tolerant Index), Tokens,

Delimiters Tokens are not Q-Grams Similarity Threshold Number of Matches

Page 24: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Can Fuzzy Lookup Be Accessed From C# Code?

NOT YET

Page 25: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

Conclusions

Language and Domain Knowledge is Important

No Implementations? – Develop Yourself!

Questions?

Page 26: Text Search  and Fuzzy Matching

April 19, 2023© Copyright 2004 Sunaptic Solutions. All rights reserved.

For More Information Contact

Andrei Kossoroukov, Sunaptic Solutions [email protected] 604-629-0891 ext. 105