60
A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA)

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Embed Size (px)

DESCRIPTION

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes. Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA). Problem. We are interested in low cardinality set-values Retail store transaction logs Web logs - PowerPoint PPT Presentation

Citation preview

Page 1: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

A Combination of Trie-trees and Inverted files for the Indexing of

Set-valued Attributes

Manolis Terrovitis (NTUA)Spyros Passas (NTUA)Panos Vassiliadis (UoI)

Timos Sellis (NTUA)

Page 2: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Problem

We are interested in low cardinality set-values– Retail store transaction logs– Web logs– Biomedical databases etc.

We address the efficient evaluation of containment queries– In which transactions were products ‘a’ and ‘b’ sold together?– Which users visited only the main page or the download page

of our site?

We propose the Hybrid Trie-Inverted file (HTI) index

Page 3: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Outline

Problem definition The HTI index Query evaluation Experiments Conclusions

Page 4: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Outline

Problem definition The HTI index Query evaluation Experiments Conclusions

Page 5: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Data and queries

tid products tid products

1 {f,a} 9 {a,e}

2 {a,d,c} 10 {g,c,a}

3 {c,b,a} 11 {b,a,e}

4 {f,a,c} 12 {b,d,c}

5 {c,g} 13 {c,f,a,d,b}

6 {a,b,g,c,d,e}

14 {b,d}

7 {a,d,b} 15 {e}

8 {a,e,b} 16 {b,f,a}

Page 6: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Data and queries

Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset)

tid products tid products

1 {f,a} 9 {a,e}

2 {a,d,c} 10 {g,c,a}

3 {c,b,a} 11 {b,a,e}

4 {f,a,c} 12 {b,d,c}

5 {c,g} 13 {c,f,a,d,b}

6 {a,b,g,c,d,e}

14 {b,d}

7 {a,d,b} 15 {e}

8 {a,e,b} 16 {b,f,a}

Page 7: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Data and queries

Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset)

Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality)

tid products tid products

1 {f,a} 9 {a,e}

2 {a,d,c} 10 {g,c,a}

3 {c,b,a} 11 {b,a,e}

4 {f,a,c} 12 {b,d,c}

5 {c,g} 13 {c,f,a,d,b}

6 {a,b,g,c,d,e}

14 {b,d}

7 {a,d,b} 15 {e}

8 {a,e,b} 16 {b,f,a}

Page 8: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Data and queries

Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset)

Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality)

Find all transactions that contain only items from ‘a’, ‘b’ and ‘d’ (superset)

tid products tid products

1 {f,a} 9 {a,e}

2 {a,d,c} 10 {g,c,a}

3 {c,b,a} 11 {b,a,e}

4 {f,a,c} 12 {b,d,c}

5 {c,g} 13 {c,f,a,d,b}

6 {a,b,g,c,d,e}

14 {b,d}

7 {a,d,b} 15 {e}

8 {a,e,b} 16 {b,f,a}

Page 9: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Data and queries

Traditional methods– Signature files– Inverted files

Differences from text databases:– Low cardinality– Large number of records in comparison

with vocabulary size– New types of queries (equality-superset)

Page 10: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Outline

Problem definition The HTI index Query evaluation Experiments Conclusions

Page 11: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

The HTI index Background – The inverted file

d

e

f

g

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

a

c

b

1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 16

3, 6, 7, 8, 9, 11, 12, 13, 14, 16

2, 3, 4, 5, 6, 10, 12, 13 14

16

b, d

b, f, a

Database transactions (D)

Inverted (postings) lists

Voc

abu

lary

(I)

Page 12: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexInverted files - problems

The evaluation of containment queries relies on merge-joining the inverted lists

The inverted lists become very long – when the database size is very big compared to the

vocabulary – when the items’ distribution is skewed

This is often the case in the real world!

Page 13: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexSolution?

We need to break up the lists!

But how?– Lets make a list for every combination of

items!

Page 14: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexSolution?

We assume a total order based on the frequency of appearance for the items of the database

We order the items in each set-value and we transform it to a sequence

We create a path in the access tree for each sequence

Page 15: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexAll combinations?

Null

a

b efc

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

fedc

fe

d

gfd

dc

d

cc

g

g

Page 16: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexAll combinations?

Null

a

b efc

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

fedc

fe

d

gfd

dc

d

cc

g

g

tid’s: 1

tid’s: 1

Page 17: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexAll combinations?

Null

a

b efc

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

fedc

fe

d

gfd

dc

d

cc

g

g

tid’s: 1

tid’s: 1,2

tid’s: 2

tid’s: 2

Page 18: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexAll combinations?

Null

a

b efc

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

fedc

fe

d

gfd

dc

d

cc

g

tid’s: 1,2,3,4,6,7,8,9,10,11,13,16

tid’s: 3,6,7,8,11,13,16

tid’s: 1 tid’s: 9

tid’s: 7 tid’s: 8,11 tid’s: 16tid’s: 3,6,13

tid’s: 13,16

tid’s: 13 tid’s: 16

tid’s: 2,4,10

tid’s: 2 tid’s: 4 tid’s: 10

tid’s: 12,14

tid’s: 12

tid’s: 12

tid’s: 14

tid’s: 5

tid’s: 5

tid’s: 15

gtid’s: 13

Page 19: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexAll combinations? Maybe, not…

Null

a

b efc

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

fedc

fe

d

gfd

dc

d

cc

g

tid’s: 1,2,3,4,6,7,8,9,10,11,13,16

tid’s: 3,6,7,8,11,13,16

tid’s: 1 tid’s: 9

tid’s: 7 tid’s: 8,11 tid’s: 16tid’s: 3,6,13

tid’s: 13,16

tid’s: 13 tid’s: 16

tid’s: 2,4,10

tid’s: 2 tid’s: 4 tid’s: 10

tid’s: 12,14

tid’s: 12

tid’s: 12

tid’s: 14

tid’s: 5

tid’s: 5

tid’s: 15

gtid’s: 13

Page 20: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexAn access tree for the frequent items

Null

a

b

c

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

c c c

tid’s: 1,2,3,4,6,7,8,9,10,11,13,16

tid’s: 3,6,7,8,11,13,16

tid’s: 3,6,13 tid’s: 2,4,10 tid’s: 12,14

tid’s: 12

tid’s: 5

Page 21: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexAn access tree for the frequent items

Null

a

b

c

b

Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}

c c c

tid’s: 1,2,3,4,6,7,8,9,10,11,13,16

tid’s: 3,6,7,8,11,13,16

tid’s: 3,6,13 tid’s: 2,4,10 tid’s: 12,14

tid’s: 12

tid’s: 5

Page 22: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

The HTI index

Vocabulary

a

c

d

e

f

b

f

Page 23: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

The HTI index

Vocabulary

a

c

d

e

f

b

f

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

Inverted Lists

Page 24: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

The HTI index

Vocabulary

a

c

d

e

f

b

f

Null

a

b

c

b

c c c

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 25: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

The HTI index

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 26: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

HTI indexThe basic points

The access tree is used only for the most frequent items

The inverted lists are restructured so that each node of the access tree points to a different inverted sublist

We keep the access tree in main memory

Page 27: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Outline

Problem definition The HTI index Query evaluation Experiments Conclusions

Page 28: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Query EvaluationBasic Steps

1. Find the frequent items of the query set

2. Use the access tree to detect the sublists which might participate in the answer

3. Merge-join these sublists with the inverted lists of the non-frequent items

Page 29: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 30: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 31: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 32: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 33: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Subset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 34: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Equality - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 35: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Equality - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 36: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Equality - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 37: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Equality - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 38: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 39: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 40: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 41: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 42: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

f

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 43: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 44: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Superset - (‘b’, ‘c’, ‘d’’)

Vocabulary

a

c

d

e

f

b

g

Null

a

b

c

b

c c c

1,2,3,4,6,7,8,9,10,11,13,16

3,6,7,8,11,13,16

3,6,13 2,4,10 12

12,14

5

2, 6, 7, 12, 13, 14

6, 8, 9, 11, 15

1, 4, 13, 16

5, 6, 10

AccessTree

Inverted Lists

Page 45: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Outline

Problem definition The HTI index Query evaluation Experiments Conclusions

Page 46: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsSetup

Real Data from UCI– web log from microsoft.com [ 320k records, 294

items]– web log from msnbc.com [1M records, 17 items]

Synthetic data– Zipfian distribution of order 1– 100k-1M records– 1k-10k items– Queries with 2-22 items

Page 47: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsQuery performance – DB size

synthetic data - DB size

0

500

1000

1500

2000

2500

3000

0 200 400 600 800 1000

1000's of records

disk

pag

e acc

esse

s

I F

HTI - 0.5%

HTI - 1%

HTI - 3%

Page 48: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsQuery performance – query length

synthetic data - query length

0

500

1000

1500

2000

2500

2 7 12 17 22

query length

disk

pag

e acc

esse

s

I F

HTI - 0.5%

HTI - 1%

HTI - 3%

Page 49: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsQuery performance – query length

real data - subset

0

50

100

150

200

250

300

350

400

2 3 4 5 6 7

query length

dis

k p

age

acc

esse

s

I F

HTI - 5%

HTI - 20%

HTI - 40%

Page 50: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsQuery performance – query length

real data - equality

0

50

100

150

200

250

300

350

400

2 3 4 5 6 7

query length

dis

k pa

ge a

ccess

es

I F

HTI - 5%

HTI - 20%

HTI - 40%

Page 51: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsQuery performance – query length

real data - superset

0

100

200

300

400

500

600

700

800

900

1000

2 3 4 5 6 7

query length

disk

pag

e acc

esse

s

I F

HTI - 5%

HTI - 20%

HTI - 40%

Page 52: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsAccess tree size – DB size

Eff ect of the DB size

0

500

1000

1500

2000

2500

0 200 400 600 800 1000

1000's of records

1000

's o

f tr

ee n

odes

HTI - 0.5%

HTI - 1%

HTI - 3%

Page 53: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsAccess tree size – DB size

Eff ect of the DB size

0

200

400

600

800

1000

1200

1400

1600

1800

0 5 10 15 20 25 30

millions of records

1000

's o

f tr

ee n

odes

Page 54: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Experiments

The HTI scales a lot better than the inverted file as the query and the database size grow

A small threshold is enough for a performance gain over an order of magnitude

The main memory requirements do not exceed 0.5M for the real data.

Page 55: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Outline

Problem Definition The HTI index Query evaluation Experiments Conclusions

Page 56: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

Conclusions

The HTI index relies on breaking up the larger inverted lists in smaller lists that contain known combinations of items

The HTI index significantly outperforms the inverted file for small domains and skewed item distributions

It has moderate memory requirements that can be adjusted by using the right threshold

Page 57: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

The End

Thank You!

Page 58: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsVocabulary size

Eff ect of the vocabulary size

0

200

400

600

800

1000

1200

1400

1600

1 3 5 7 9

vocabulary size in 1000's of items

1000

's o

f tr

ee n

odes

HTI - 0.5%

HTI - 1%

HTI - 3%

Page 59: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsThreshold choice

Eff ect of the threshold

0

200

400

600

800

1000

1200

1400

0,00% 2,00% 4,00% 6,00% 8,00% 10,00%

threshold

1000's of tree nodes

Page 60: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Terrovitis et. al., CIKM '06

ExperimentsThreshold choice

Eff ect of the threshold

0

50

100

150

200

250

300

0,00% 2,00% 4,00% 6,00% 8,00% 10,00%

threshold

Avg of disk page accesses