Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekuntaMatemaattis-luonnontieteellinen tiedekunta

Efficient Approximate String Matching with Synonyms and Taxonomies

Pengfei XuCustos: Professor Jiaheng Lu, University of Helsinki

Opponent: Professor Jan Holub, Czech Technical University in Prague

1

PhD Defense

Matemaattis-luonnontieteellinen tiedekunta

§ Motivation, challenge, and research problems§ Main contributions of this thesis§ Impacts of this research

Lectio Praecursoria

2


Motivation

3

Typos…

… should be corrected


Motivation

4

Q: A ≈ B and C ≈ D?A: Yes.


We want the data be clean and consistent.But data in reality is often far from that

Motivation

5


Research Problems

6

most similar records all similar recordsWhat we want:


Research Problems

7

most similar records all similar records

typographic diffs semantic diffs

synonym

DeutscheBahn

taxonomy

What we want:

bodygauardWhat we have

in data:


Research Problems

8



synonym

DeutscheBahn

taxonomy

What we want:


in data:


Research Questions

9



synonym

DeutscheBahn

taxonomy

What we want:


in data:

RQ1

RQ2

RQ3

RQ4


RQ1: find most similar records given a query string w.r.t. synonyms

Main Contributions (RQ1)

10

Twin TriesLeast space cost

Slow lookup speed

Expansion TrieMost space cost

Fast lookup speed

Hybrid TrieModerate space cost

Moderate (on the faster side) lookup speed


Twin tries§ Records and synonym rules in separate trie§ Save space but slowdown the lookup

Expansion trie§ Attach synonyms directly to records§ Use more space but fast lookup

Hybrid trie§ Select some rules and attach them to records, left others

in a separate trie§ Branch and bound to solve a knapsack problem

§ Moderate space (under threshold) and lookup time


11



12

Space cost

Lookup speed


RQ2: find most similar records between two sets w.r.t. taxonomies


13

Sorted List LP-Sorted List

intermediate node

real node in record

Trie



14

1. Two pointers on the first record

2. Advance the large pointer until not similar

3. Advance the small pointer

4. Repeat Steps 2-35. Stop when both pointers

reach the end



15

Trie solution is able to skip visiting some records.

It makes uses of ml (min length) to tell the depth of the shallowest node among all descendants, skipping infesiable subtries.

intermediate node

real node in record

Trie



16

Trie solution is able to skip visiting some records.

It makes uses of ml (min length) to tell the depth of the shallowest node among all descendants, skipping infesiable subtries.

intermediate node

real node in record

Prefix “1.5” in two tries lead to at most 2 / max(3, 4) = 0.5 similarity.

Trie



17

Lookup speed


RQ3: find all similar records between two sets w.r.t. taxonomies


18


§ Finding the maximal similarity is a bipartite matching problem (Hungarian algorithm)

§ Prefix filtering can be adapted to reduce the search space


19


We extended the popular prefix filtering to allow multiple overlaps§ Normal prefix filtering:

If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 1 prefix in the first (1 − 𝑡) |𝑆| + 1 (or (1 − 𝑡) |𝑇| + 1 ) prefixes

§ Our enhanced prefix filtering:If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 𝑛 prefix, each of which has a similarity at least ! " #$%&

' #$%&


20


We extended the popular prefix filtering to allow multiple overlaps§ Normal prefix filtering:

If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 1 prefix in the first (1 − 𝑡) |𝑆| + 1 (or (1 − 𝑡) |𝑇| + 1 ) prefixes

§ Our enhanced prefix filtering:If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 𝑛 prefix, each of which has a similarity at least ! " #$%&

' #$%&


21


Iterative Bernoulli sampling:1. pick small samples, run the filtering and join algorithm2. scale the time up to the whole dataset3. repeat above steps and return the mean of all

estimated values


22


RQ4: find all similar records w.r.t. typos, synonyms, and taxonomies


23


What type of similarity should be used?à NP-hard problem: weighted MIS


24


Model the segmentation problem as a weighted MIS on d-claw-free graph§ Approximable in polynomial time


25

4+1-claw free graph4-claw


Model the segmentation problem as a weighted MIS on d-claw-free graph§ Approximable in polynomial time


26



27

The filtering framework1. Generate pebbles for all similarities

Synonym TaxonomyTypo

Pebbles

Prefix Signature

Candidates

Results

Prefix Selection

Prefix Filtering

Calculate similarity



28

The filtering framework1. Generate pebbles for all similarities2. Select signature for prefix filtering

Heuristicruns in a quasi-linear time

DPslower but produce shorter signatures


Pebbles

Prefix Signature

Candidates

Results

Prefix Selection

Prefix Filtering




29

The filtering framework1. Generate pebbles for all similarities2. Select signature for prefix filtering3. Find record pairs having enough

common prefixes§ Allow to find multiple overlaps, with a

sampling algorithm


Pebbles

Prefix Signature

Candidates

Results

Prefix Selection

Prefix Filtering




30

The filtering framework1. Generate pebbles for all similarities2. Select signature for prefix filtering3. Find record pairs having enough

common prefixes4. Verify the real similarity for each

candidate


Pebbles

Prefix Signature

Candidates

Results

Prefix Selection

Prefix Filtering



Result shows a huge improvement over existing algorithms


31


§ Semantic knowledge, including synonyms and taxonomies, can be used for string matching tasks

§ Prefix filtering can be extended to increase the filtering power, accelerating verification process

§ Multiple types of similarites can be intergrated to discover more related records

Research Impacts

32


§ Top-k string auto-completion with synonyms, DASFAA 2017 (answers RQ1)

§ Efficient taxonomic similarity joins with adaptive overlap constraint, CIKM 2018 (answers RQ2)

§ Efficient string similarity join with taxonomy knowledge, Under review, KIS (answers RQ3)

§ Towards a unified framework for string similarity joins, VLDB 2019 (answers RQ4)

Publications

33


Thesis

34

ISSN 1238-8645ISBN 978-951-51-6987-7 (PAPERBACK)

ISBN 978-951-51-6988-4 (PDF)UNIGRAFIA

HELSINKI 2021

UNIVERSITY OF HELSINKIFACULTY OF SCIENCE

EFFICIENT APPROXIMATE STRING MATCHING WITH SYNONYMS AND TAXONOMIESPENGFEI XU

PENGFEI XU | EFFICIENT APPROXIMATE STRING M

ATCHING WITH SYNO

NYMS AND TAXO

NOM

IES

DEPARTMENT OF COMPUTER SCIENCE PHD THESIS

A-2021-2

UNIVERSITY OF HELSINKIFACULTY OF SCIENCE

DEPARTMENT OF COMPUTER SCIENCESERIES OF PUBLICATIONS A, REPORT A-2021-2

Available online for comments:

http://urn.fi/URN:ISBN:978-951-51-6988-4

http://urn.fi/URN:ISBN:978-951-51-6988-4

Documents

Efficient Approximate String Matching with Synonyms and