Upload
noelle-richardson
View
22
Download
2
Tags:
Embed Size (px)
DESCRIPTION
A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes. Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) Timos Sellis (NTUA). Problem. We are interested in low cardinality set-values Retail store transaction logs Web logs - PowerPoint PPT Presentation
Citation preview
A Combination of Trie-trees and Inverted files for the Indexing of
Set-valued Attributes
Manolis Terrovitis (NTUA)Spyros Passas (NTUA)Panos Vassiliadis (UoI)
Timos Sellis (NTUA)
Terrovitis et. al., CIKM '06
Problem
We are interested in low cardinality set-values– Retail store transaction logs– Web logs– Biomedical databases etc.
We address the efficient evaluation of containment queries– In which transactions were products ‘a’ and ‘b’ sold together?– Which users visited only the main page or the download page
of our site?
We propose the Hybrid Trie-Inverted file (HTI) index
Terrovitis et. al., CIKM '06
Outline
Problem definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06
Outline
Problem definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06
Data and queries
tid products tid products
1 {f,a} 9 {a,e}
2 {a,d,c} 10 {g,c,a}
3 {c,b,a} 11 {b,a,e}
4 {f,a,c} 12 {b,d,c}
5 {c,g} 13 {c,f,a,d,b}
6 {a,b,g,c,d,e}
14 {b,d}
7 {a,d,b} 15 {e}
8 {a,e,b} 16 {b,f,a}
Terrovitis et. al., CIKM '06
Data and queries
Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset)
tid products tid products
1 {f,a} 9 {a,e}
2 {a,d,c} 10 {g,c,a}
3 {c,b,a} 11 {b,a,e}
4 {f,a,c} 12 {b,d,c}
5 {c,g} 13 {c,f,a,d,b}
6 {a,b,g,c,d,e}
14 {b,d}
7 {a,d,b} 15 {e}
8 {a,e,b} 16 {b,f,a}
Terrovitis et. al., CIKM '06
Data and queries
Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset)
Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality)
tid products tid products
1 {f,a} 9 {a,e}
2 {a,d,c} 10 {g,c,a}
3 {c,b,a} 11 {b,a,e}
4 {f,a,c} 12 {b,d,c}
5 {c,g} 13 {c,f,a,d,b}
6 {a,b,g,c,d,e}
14 {b,d}
7 {a,d,b} 15 {e}
8 {a,e,b} 16 {b,f,a}
Terrovitis et. al., CIKM '06
Data and queries
Find all transactions that contain ‘a’, ‘b’ and ‘d’ (subset)
Find all transactions that contain exactly ‘a’, ‘b’ and ‘d’ (equality)
Find all transactions that contain only items from ‘a’, ‘b’ and ‘d’ (superset)
tid products tid products
1 {f,a} 9 {a,e}
2 {a,d,c} 10 {g,c,a}
3 {c,b,a} 11 {b,a,e}
4 {f,a,c} 12 {b,d,c}
5 {c,g} 13 {c,f,a,d,b}
6 {a,b,g,c,d,e}
14 {b,d}
7 {a,d,b} 15 {e}
8 {a,e,b} 16 {b,f,a}
Terrovitis et. al., CIKM '06
Data and queries
Traditional methods– Signature files– Inverted files
Differences from text databases:– Low cardinality– Large number of records in comparison
with vocabulary size– New types of queries (equality-superset)
Terrovitis et. al., CIKM '06
Outline
Problem definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06
The HTI index Background – The inverted file
d
e
f
g
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
a
c
b
1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 16
3, 6, 7, 8, 9, 11, 12, 13, 14, 16
2, 3, 4, 5, 6, 10, 12, 13 14
16
b, d
b, f, a
Database transactions (D)
Inverted (postings) lists
Voc
abu
lary
(I)
Terrovitis et. al., CIKM '06
HTI indexInverted files - problems
The evaluation of containment queries relies on merge-joining the inverted lists
The inverted lists become very long – when the database size is very big compared to the
vocabulary – when the items’ distribution is skewed
This is often the case in the real world!
Terrovitis et. al., CIKM '06
HTI indexSolution?
We need to break up the lists!
But how?– Lets make a list for every combination of
items!
Terrovitis et. al., CIKM '06
HTI indexSolution?
We assume a total order based on the frequency of appearance for the items of the database
We order the items in each set-value and we transform it to a sequence
We create a path in the access tree for each sequence
Terrovitis et. al., CIKM '06
HTI indexAll combinations?
Null
a
b efc
b
Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}
fedc
fe
d
gfd
dc
d
cc
g
g
Terrovitis et. al., CIKM '06
HTI indexAll combinations?
Null
a
b efc
b
Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}
fedc
fe
d
gfd
dc
d
cc
g
g
tid’s: 1
tid’s: 1
Terrovitis et. al., CIKM '06
HTI indexAll combinations?
Null
a
b efc
b
Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}
fedc
fe
d
gfd
dc
d
cc
g
g
tid’s: 1
tid’s: 1,2
tid’s: 2
tid’s: 2
Terrovitis et. al., CIKM '06
HTI indexAll combinations?
Null
a
b efc
b
Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}
fedc
fe
d
gfd
dc
d
cc
g
tid’s: 1,2,3,4,6,7,8,9,10,11,13,16
tid’s: 3,6,7,8,11,13,16
tid’s: 1 tid’s: 9
tid’s: 7 tid’s: 8,11 tid’s: 16tid’s: 3,6,13
tid’s: 13,16
tid’s: 13 tid’s: 16
tid’s: 2,4,10
tid’s: 2 tid’s: 4 tid’s: 10
tid’s: 12,14
tid’s: 12
tid’s: 12
tid’s: 14
tid’s: 5
tid’s: 5
tid’s: 15
gtid’s: 13
Terrovitis et. al., CIKM '06
HTI indexAll combinations? Maybe, not…
Null
a
b efc
b
Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}
fedc
fe
d
gfd
dc
d
cc
g
tid’s: 1,2,3,4,6,7,8,9,10,11,13,16
tid’s: 3,6,7,8,11,13,16
tid’s: 1 tid’s: 9
tid’s: 7 tid’s: 8,11 tid’s: 16tid’s: 3,6,13
tid’s: 13,16
tid’s: 13 tid’s: 16
tid’s: 2,4,10
tid’s: 2 tid’s: 4 tid’s: 10
tid’s: 12,14
tid’s: 12
tid’s: 12
tid’s: 14
tid’s: 5
tid’s: 5
tid’s: 15
gtid’s: 13
Terrovitis et. al., CIKM '06
HTI indexAn access tree for the frequent items
Null
a
b
c
b
Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}
c c c
tid’s: 1,2,3,4,6,7,8,9,10,11,13,16
tid’s: 3,6,7,8,11,13,16
tid’s: 3,6,13 tid’s: 2,4,10 tid’s: 12,14
tid’s: 12
tid’s: 5
Terrovitis et. al., CIKM '06
HTI indexAn access tree for the frequent items
Null
a
b
c
b
Ordered Transactions1 {a,f}2 {a,c,d}3 {a,b,c}4 {a,c,f}5 {c,g}6 {a,b,c,d,e,g}7 {a,b,d}8 {a,b,e}9 {a,e}10 {a,c,g}11 {a,b,e}12 {b,c,d}13 {a,b,c,d,f}14 {b,d}15 {e}16 {a,b,f}
c c c
tid’s: 1,2,3,4,6,7,8,9,10,11,13,16
tid’s: 3,6,7,8,11,13,16
tid’s: 3,6,13 tid’s: 2,4,10 tid’s: 12,14
tid’s: 12
tid’s: 5
Terrovitis et. al., CIKM '06
The HTI index
Vocabulary
a
c
d
e
f
b
f
Terrovitis et. al., CIKM '06
The HTI index
Vocabulary
a
c
d
e
f
b
f
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
Inverted Lists
Terrovitis et. al., CIKM '06
The HTI index
Vocabulary
a
c
d
e
f
b
f
Null
a
b
c
b
c c c
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
The HTI index
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
HTI indexThe basic points
The access tree is used only for the most frequent items
The inverted lists are restructured so that each node of the access tree points to a different inverted sublist
We keep the access tree in main memory
Terrovitis et. al., CIKM '06
Outline
Problem definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06
Query EvaluationBasic Steps
1. Find the frequent items of the query set
2. Use the access tree to detect the sublists which might participate in the answer
3. Merge-join these sublists with the inverted lists of the non-frequent items
Terrovitis et. al., CIKM '06
Subset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Subset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Subset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Subset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Subset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Equality - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Equality - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Equality - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Equality - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
f
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Superset - (‘b’, ‘c’, ‘d’’)
Vocabulary
a
c
d
e
f
b
g
Null
a
b
c
b
c c c
1,2,3,4,6,7,8,9,10,11,13,16
3,6,7,8,11,13,16
3,6,13 2,4,10 12
12,14
5
2, 6, 7, 12, 13, 14
6, 8, 9, 11, 15
1, 4, 13, 16
5, 6, 10
AccessTree
Inverted Lists
Terrovitis et. al., CIKM '06
Outline
Problem definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06
ExperimentsSetup
Real Data from UCI– web log from microsoft.com [ 320k records, 294
items]– web log from msnbc.com [1M records, 17 items]
Synthetic data– Zipfian distribution of order 1– 100k-1M records– 1k-10k items– Queries with 2-22 items
Terrovitis et. al., CIKM '06
ExperimentsQuery performance – DB size
synthetic data - DB size
0
500
1000
1500
2000
2500
3000
0 200 400 600 800 1000
1000's of records
disk
pag
e acc
esse
s
I F
HTI - 0.5%
HTI - 1%
HTI - 3%
Terrovitis et. al., CIKM '06
ExperimentsQuery performance – query length
synthetic data - query length
0
500
1000
1500
2000
2500
2 7 12 17 22
query length
disk
pag
e acc
esse
s
I F
HTI - 0.5%
HTI - 1%
HTI - 3%
Terrovitis et. al., CIKM '06
ExperimentsQuery performance – query length
real data - subset
0
50
100
150
200
250
300
350
400
2 3 4 5 6 7
query length
dis
k p
age
acc
esse
s
I F
HTI - 5%
HTI - 20%
HTI - 40%
Terrovitis et. al., CIKM '06
ExperimentsQuery performance – query length
real data - equality
0
50
100
150
200
250
300
350
400
2 3 4 5 6 7
query length
dis
k pa
ge a
ccess
es
I F
HTI - 5%
HTI - 20%
HTI - 40%
Terrovitis et. al., CIKM '06
ExperimentsQuery performance – query length
real data - superset
0
100
200
300
400
500
600
700
800
900
1000
2 3 4 5 6 7
query length
disk
pag
e acc
esse
s
I F
HTI - 5%
HTI - 20%
HTI - 40%
Terrovitis et. al., CIKM '06
ExperimentsAccess tree size – DB size
Eff ect of the DB size
0
500
1000
1500
2000
2500
0 200 400 600 800 1000
1000's of records
1000
's o
f tr
ee n
odes
HTI - 0.5%
HTI - 1%
HTI - 3%
Terrovitis et. al., CIKM '06
ExperimentsAccess tree size – DB size
Eff ect of the DB size
0
200
400
600
800
1000
1200
1400
1600
1800
0 5 10 15 20 25 30
millions of records
1000
's o
f tr
ee n
odes
Terrovitis et. al., CIKM '06
Experiments
The HTI scales a lot better than the inverted file as the query and the database size grow
A small threshold is enough for a performance gain over an order of magnitude
The main memory requirements do not exceed 0.5M for the real data.
Terrovitis et. al., CIKM '06
Outline
Problem Definition The HTI index Query evaluation Experiments Conclusions
Terrovitis et. al., CIKM '06
Conclusions
The HTI index relies on breaking up the larger inverted lists in smaller lists that contain known combinations of items
The HTI index significantly outperforms the inverted file for small domains and skewed item distributions
It has moderate memory requirements that can be adjusted by using the right threshold
Terrovitis et. al., CIKM '06
The End
Thank You!
Terrovitis et. al., CIKM '06
ExperimentsVocabulary size
Eff ect of the vocabulary size
0
200
400
600
800
1000
1200
1400
1600
1 3 5 7 9
vocabulary size in 1000's of items
1000
's o
f tr
ee n
odes
HTI - 0.5%
HTI - 1%
HTI - 3%
Terrovitis et. al., CIKM '06
ExperimentsThreshold choice
Eff ect of the threshold
0
200
400
600
800
1000
1200
1400
0,00% 2,00% 4,00% 6,00% 8,00% 10,00%
threshold
1000's of tree nodes
Terrovitis et. al., CIKM '06
ExperimentsThreshold choice
Eff ect of the threshold
0
50
100
150
200
250
300
0,00% 2,00% 4,00% 6,00% 8,00% 10,00%
threshold
Avg of disk page accesses