Click here to load reader

Efficient Approximate String Matching with Synonyms and

  • View
    0

  • Download
    0

Embed Size (px)

Text of Efficient Approximate String Matching with Synonyms and

Pengfei Xu Custos: Professor Jiaheng Lu, University of Helsinki
Opponent: Professor Jan Holub, Czech Technical University in Prague
1
§ Motivation, challenge, and research problems § Main contributions of this thesis § Impacts of this research
Lectio Praecursoria
Matemaattis-luonnontieteellinen tiedekunta
We want the data be clean and consistent. But data in reality is often far from that
Motivation
5
Matemaattis-luonnontieteellinen tiedekunta
Research Problems
typographic diffs semantic diffs
typographic diffs semantic diffs
typographic diffs semantic diffs
Matemaattis-luonnontieteellinen tiedekunta
RQ1: find most similar records given a query string w.r.t. synonyms
Main Contributions (RQ1)
Slow lookup speed
Fast lookup speed
Moderate (on the faster side) lookup speed
Matemaattis-luonnontieteellinen tiedekunta
Twin tries § Records and synonym rules in separate trie § Save space but slowdown the lookup
Expansion trie § Attach synonyms directly to records § Use more space but fast lookup
Hybrid trie § Select some rules and attach them to records, left others
in a separate trie § Branch and bound to solve a knapsack problem
§ Moderate space (under threshold) and lookup time
Main Contributions (RQ1)
RQ2: find most similar records between two sets w.r.t. taxonomies
Main Contributions (RQ2)
2. Advance the large pointer until not similar
3. Advance the small pointer
4. Repeat Steps 2-3 5. Stop when both pointers
reach the end
Trie solution is able to skip visiting some records.
It makes uses of ml (min length) to tell the depth of the shallowest node among all descendants, skipping infesiable subtries.
intermediate node
Trie solution is able to skip visiting some records.
It makes uses of ml (min length) to tell the depth of the shallowest node among all descendants, skipping infesiable subtries.
intermediate node
real node in record
Prefix “1.5” in two tries lead to at most 2 / max(3, 4) = 0.5 similarity.
Trie
RQ3: find all similar records between two sets w.r.t. taxonomies
Main Contributions (RQ3)
§ Finding the maximal similarity is a bipartite matching problem (Hungarian algorithm)
§ Prefix filtering can be adapted to reduce the search space
Main Contributions (RQ3)
Matemaattis-luonnontieteellinen tiedekunta
We extended the popular prefix filtering to allow multiple overlaps § Normal prefix filtering:
If and has similarity , they must share at least 1 prefix in the first (1 − ) || + 1 (or (1 − ) || + 1 ) prefixes
' #$%&
Matemaattis-luonnontieteellinen tiedekunta
We extended the popular prefix filtering to allow multiple overlaps § Normal prefix filtering:
If and has similarity , they must share at least 1 prefix in the first (1 − ) || + 1 (or (1 − ) || + 1 ) prefixes
' #$%&
Matemaattis-luonnontieteellinen tiedekunta
Iterative Bernoulli sampling: 1. pick small samples, run the filtering and join algorithm 2. scale the time up to the whole dataset 3. repeat above steps and return the mean of all
estimated values
RQ4: find all similar records w.r.t. typos, synonyms, and taxonomies
Main Contributions (RQ4)
Matemaattis-luonnontieteellinen tiedekunta
What type of similarity should be used? à NP-hard problem: weighted MIS
Main Contributions (RQ4)
Matemaattis-luonnontieteellinen tiedekunta
Model the segmentation problem as a weighted MIS on d-claw-free graph § Approximable in polynomial time
Main Contributions (RQ4)
Matemaattis-luonnontieteellinen tiedekunta
Model the segmentation problem as a weighted MIS on d-claw-free graph § Approximable in polynomial time
Main Contributions (RQ4)
Synonym TaxonomyTypo
28
The filtering framework 1. Generate pebbles for all similarities 2. Select signature for prefix filtering
Heuristic runs in a quasi-linear time
DP slower but produce shorter signatures
Synonym TaxonomyTypo
29
The filtering framework 1. Generate pebbles for all similarities 2. Select signature for prefix filtering 3. Find record pairs having enough
common prefixes § Allow to find multiple overlaps, with a
sampling algorithm
Synonym TaxonomyTypo
30
The filtering framework 1. Generate pebbles for all similarities 2. Select signature for prefix filtering 3. Find record pairs having enough
common prefixes 4. Verify the real similarity for each
candidate
Main Contributions (RQ4)
Matemaattis-luonnontieteellinen tiedekunta
§ Semantic knowledge, including synonyms and taxonomies, can be used for string matching tasks
§ Prefix filtering can be extended to increase the filtering power, accelerating verification process
§ Multiple types of similarites can be intergrated to discover more related records
Research Impacts
§ Efficient taxonomic similarity joins with adaptive overlap constraint, CIKM 2018 (answers RQ2)
§ Efficient string similarity join with taxonomy knowledge, Under review, KIS (answers RQ3)
§ Towards a unified framework for string similarity joins, VLDB 2019 (answers RQ4)
Publications
33
ISBN 978-951-51-6988-4 (PDF) UNIGRAFIA
EFFICIENT APPROXIMATE STRING MATCHING WITH SYNONYMS AND TAXONOMIES PENGFEI XU
PENGFEI XU | EFFICIENT APPROXIM ATE STRING M
ATCHING W ITH SYNO
NYM S AND TAXO
A-2021-2
DEPARTMENT OF COMPUTER SCIENCE SERIES OF PUBLICATIONS A, REPORT A-2021-2
Available online for comments: