Indexing Mixed Types for Approximate Retrieval

Liang Jin* UC Irvine

Nick Koudas University of Toronto

Chen Li* UC Irvine

Anthony K.H. Tung National University of Singapore

* Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586

Indexing Mixed Types for Approximate Retrieval

2

Queries with Mixed-Type Predicates

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1980| <= 5;• SIMLARTO:

– a domain-specific function – returns a similarity value between two strings

• Example: edit distance ed(Tom Hanks, Ton Hank) = 2

3

Why fuzzy predicates?

• Errors in queries– User doesn’t remember a string exactly– User types a wrong string

Samuel Jackson

…

Schwarzenegger

Samuel Jackson

Keanu ReevesStar

…

Samuel L. Jackson

Schwarzenegger

Samuel L. Jackson

Keanu ReevesStar

Relation R Relation S

• Errors in databases:– Data is not clean– Especially true in data integration and cleansing

4

Problem Formulation

SELECT *

FROM Movies


AND |year – 1980| <= 5;

Given: A query with fuzzy predicates on strings and

range predicates on numeric attributes

on a single relation

Goal: Answer the query efficiently

5

Rest of the talk

• Motivation: supporting queries with mixed-type predicates

• Our approach: MAT tree• Construction and maintenance of MAT tree• Experiments

6

Assumptions

SELECT *

FROM Movies


AND |year – 1980| <= 5;

• One fuzzy string predicate (edit distance)

• One numeric predicate

(’Schwarrzenger’, 2, 1980, 5)

(Qs, δs, Qn, δn)Query:

7

Intuition of MAT (Mixed-attribute-type) Tree

• “2 > 1 + 1”– One integrated indexing structure is better than– two independent indexing structures on two attributes

• Indexing numeric attributes: B-tree or R-tree• Indexing strings as a tree to support fuzzy predicates?

Spielberg1946

Hanks1956

Gibson1956

Hanks1957

Crowe1964

Robert1968

DiCaprio1974

Roberrts1977

<1946,1956> <1956,1957>

<1946,1957> <1964,1977>

MBR

Root

Leaf nodes

*

<1964,1968>

*

<1974,1977>

*

* *

......

...

......

*

...

MAT tree

8

Answering a query (Qs, δs, Qn, δn)

• Top-down traverse the MAT-tree• At each node, do pruning by checking:

– If [Qn – δn, Qn + δn] overlap with the numeric range.

– If minEditDistance(Qs, Tn) <= δs.

Spielberg1946

Hanks1956

Gibson1956

Hanks1957

Crowe1964

Robert1968

DiCaprio1974

Roberrts1977

<1946,1956> <1956,1957>

<1946,1957> <1964,1977>

MBR

Root

Leaf nodes

*

<1964,1968>

*

<1974,1977>

*

* *

......

...

......

*

...

9

Spielberg1946

Hanks1956

Gibson1956

Hanks1957

Crowe1964

Robert1968

DiCaprio1974

Roberrts1977

<1946,1956> <1956,1957>

<1946,1957> <1964,1977>

MBR

Root

Leaf nodes

*

<1964,1968>

*

<1974,1977>

*

* *

......

...

......

*

...

Challenge

• How to represent strings to fit into a limited space• and support fuzzy-predicate pruning

Limited space (disk based)

10

Existing Approaches to Indexing Strings as Trees

• M-tree: – Edit distance: metric space

• Q-tree– Utilize the q-gram property of strings. – See our paper for details

11

Representing strings as a trie

n1

n2 n3 n4

n5 n6 n7 n8

n14

n9

n10 n11 n12 n13

n15

a b c

a b

c

d

e a b

dd h

Strings:aad, abcde, abdfg, beh, ca, cb

n16 n17e

f

g

12

Compressing a trie

• Select k representative nodes (centers).

• Each center is in the format of <alphabet,height>.

• A compressed trie represents more strings

n1

n2 n4

n5 n6

n7

n11

n13

n15

<{b},2>

<{e},1>

<{h},1>

<{a,b,c},2><{a},1>

<{a,d},2>

<{b,d},2>

<{f,g},2><{c,d,e},3>


n1

n2 n3 n4

n5 n6 n7 n8

n14

n9

n10 n11 n12 n13

n15

a b c

a b

c

d

e a b

dd h


n16 n17e

f

g

compression

13

minEditDistace (Qs, Tn)?– Convert a trie to an automaton.– Compute the min distance between a string and an automaton [Myers and

Miller, 1989]– Early termination possible

Minimum edit distance between a string a trie

a

b

d

a

b

d

a

b

d

[a,*]

[a,*]

[a,*]

[a,*]

[c,*][c,*]

[c,*]

[c,*]

[*,b]

[*,d]

[*,*]

[*,*]

[*,*]

[*,d]

[*,d]

[*,a]

[*,a]

[*,a]

[*,b]

[*,b]

[a,b]

[a,a] [a,d]

[c,b][c,a] [c,d]

a

b

d

Automaton

Query String

“ac”

Edit Graph

14

Compressed trie Automaton

• Each node is a state.• Each edge becomes a transition between two states.• For compressed node <Σ, L>, expand it to L levels.

At each level, all characters in Σ become single states and are connected to a common tail ε.

Convert a compressed node <{a,b,c},2> into automaton nodes.

c

a

b

c

a

b

15

Outline



16

Constructing MAT-tree

• Option 1: insert records one by one. • Option 2:

– bulk-load records– construct the MAT-tree bottom-up

17

Compressing a trie

• Important:– Accurately represent strings in a limited space.– Minimize “information loss”.– Maintain the pruning power during a traversal.

• Three methods:– (1) Reducing # of accepted strings– (2) Keeping accepted strings “clustered”– (3) Combining of (1) and (2)

18

Method (1): Reducing # of accepted strings

• Intuition: – reducing this # makes the compressed trie more

accurate

• Goodness function: # of accepted strings• Algorithm: “Randomized”

– Randomly select k initial centers– Randomly select one of the centers– Randomly select an unselected node– Swap them if it can improve the goodness function– Do certain # of iterations

19

Method (2): Keeping accepted strings clustered

• Intuition: – keeping the accepted strings similar to the original ones by

letting them share common prefix. – Place k centers as close to the root as possible.

• Algorithm: “BreadthFirst”

n1

n2 n3 n4

n5 n6 n7 n8 n9

a b c

<{a,d},2>

<{b,c,d,e,f,g},4> <{e,h},2>

<{a},1>

<{b},1>


20

Method (3): Combining (1) and (2)

• Intuition: – minimize the number of accepted strings, and in

the same time maintain their similarity to the originals.

• Algorithm: “Bottomup”– Keep shrinking the trie bottom up until we have k

nodes– Compress a node that minimizes # of additional

strings

21

Dynamic maintenance

Insertion (s, n)• Search the index for (s, n). If it’s not in the

index, identify the correct leaf node.• If no overflow:

– update the “MBR” of the leaf node and its precedents recursively if necessary.

• If overflow:– Split the leaf node and – Construct two compressed tries– Cascade the split to the precedents if necessary.

Deletion and Update are handled similarly

22

Outline



23

Setting

• Data– IMDB: 100K movie star records (Name and YOB).– Customers: 50K records (Name and YOB)

• Test bed– PC: 2.4G P4, 1.2GB Memory, Windows XP– Visual C++ compiler

• Similar results. Report result for IMDB.

24

Implemented approaches

• B-tree• Q-tree• B-tree & Q-tree• BQ-tree• BM-tree• Sequential scan

“BBQ-tree”?

25

“2 > 1 + 1”

An integrated indexing structure is better than two separate indexing structures

δs=3, δn=4

26

Scalability

27

Effect of numeric threshold δn

28

Effect of string threshold δs

29

Dynamic Maintenance: time

30

Dynamic maintenance: MAT quality

31

Number of centers

• Increasing cluster # may not reduce the running time: pruning power versus computational cost

• For BottomUp and BreadthFirst (compared to Randomized)

- Centers close to the root, thus more likely to do early termination

32

Conclusion

• MAT-tree: an efficient indexing structure for queries with mixed-type predicates

• Can be efficiently constructed and maintained

• Future work: develop a uniform framework to support different kinds of similarity functions

Q&A?

The Flamingo Project : http://www.ics.uci.edu/~flamingo/

33

Backup Slides

34

Constructing MAT-tree

• Option 1: inserting records one by one. • Option 2: bulk-loading data records and

constructing the MAT-tree in a bottom-up fashion.– Records are sorted based on one attribute.– Fill pages with records until full.– Calculate the numeric range and the compressed

trie for each leaf nodes.– Merge leaf nodes into internal nodes recursively

according to desired fanout, until a single root is formed.

35

Example – Customer Service Call Center

Name SSN YOB

Jack Lemmon 430-871-8294 1978

Harrison Ford 292-918-2913 1962

Tom Hanks 234-762-1234 1956

Tim Legler 125-457-8654 1870

… … …

Customer calls in

Issue a fuzzy query:

Name LIKE “Tom Hanks” AND YOB CLOSE to 1958

Return result

Serve the customer

In this example, the underline system should be able to support fuzzy query on both the string and numeric attributes!

36

Scalability test (IO)

Documents

Indexing Mixed Types for Approximate Retrieval