Upload
gittel
View
59
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Supporting Efficient Top-k Queries in Type-A h ead Search. Guoliang Li 1 , Jiannan Wang 1 , Chen Li 2 , Jianhua Feng 1 1 Tsinghua University 2 UC Irvine, Bimaple Technology Inc. . SIGIR 2012, Portland, Oregon. Query suggestions. Type-ahead search (instant search). - PowerPoint PPT Presentation
Citation preview
Supporting Efficient Top-k Queries in Type-Ahead
SearchGuoliang Li1, Jiannan Wang1, Chen Li2,
Jianhua Feng1
1 Tsinghua University2 UC Irvine, Bimaple Technology Inc.
SIGIR 2012, Portland, Oregon
Tsinghua/UC Irvine/Bimaple 2
Query suggestions
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 3
Type-ahead search (instant search)
Li, Wang, Li, and Feng
Finding answers instantly!
Tsinghua/UC Irvine/Bimaple 4
ipubmed.ics.uci.edu
Li, Wang, Li, and Feng
Fuzzy search
Tsinghua/UC Irvine/Bimaple 5
Advantages of instant fuzzy search
Li, Wang, Li, and Feng
Save time
Fat fingers!
Mobile friendly
Correct errors
Tsinghua/UC Irvine/Bimaple 6
Challenges Speed
“100ms rule” Prefix matching Fuzzy matching
Li, Wang, Li, and Feng
Quality
Tsinghua/UC Irvine/Bimaple 7
Techniques for computing top-k answers in instant fuzzy search
without generating all candidates
Li, Wang, Li, and Feng
Contributions
Ranking framework Index Structures Algorithms Experimental evaluation
Tsinghua/UC Irvine/Bimaple 8
Outline
Problem Formulation Instant exact search Instant fuzzy search Experiments
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 9
Problem Formulation Data: records Query:
w1, w2, …, wm wm partial keyword
Answers: k best records
Li, Wang, Li, and Feng
graph icde li
ID Recordr0 graph icdmr1 graph group luir2 gray icde liur3 graph icde lin luir4 graph group icdm lin
liur5 graph gray gross icdm
lin liur6 gray group icdm lin liur7 gray gross group icde
linr8 gross icde liur9 icdm liu
Prefix
Tsinghua/UC Irvine/Bimaple 10
Aggregate
Ranking Framework
Li, Wang, Li, and Feng
graph, gray, gross, icde, lin, liu
Record
Query graph
icde
li
Score(graph) Score(icde) Score(lin)Score(liu)
Max
Tsinghua/UC Irvine/Bimaple 11
Trie
Index structures
gr
a
i lcd
e mo
p yh
s ups
i uin u
Li, Wang, Li, and Feng
ID Recordr0 graph icdmr1 graph group luir2 gray icde liur3 graph icde lin luir4 graph group icdm lin
liur5 graph gray gross
icdm lin liur6 gray group icdm lin
liur7 gray gross group icde
linr8 gross icde liur9 icdm liu
Inverted Index
Tsinghua/UC Irvine/Bimaple 12
{graph, icde, li} k=1
Basic Solution
gr
a
i lcd
e mo
p yh
s ups
i uin u
graph icdelin
liu
Li, Wang, Li, and Feng
Too many candidates
Tsinghua/UC Irvine/Bimaple 13
Optimization 1: Heap-based MethodAggregate
Max Heap
𝑟 ,9𝑟5 ,8
graphicde
linliu
GetMax()
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 14
Optimization 2: Top-k List-Merging Algorithm
Example: Threshold algorithm
Li, Wang, Li, and Feng
T = 15
= 17= 14= 12= 12
Random Access
Sorted Access
Sorted Access
Early termination
Tsinghua/UC Irvine/Bimaple 15
Efficient Random Access: How?ID “grap
h”“icde” “li”
7 0 ?
gr
a
i lcd
e mo
p yh
s ups
i uin u
ID Recordr0 graph icdmr1 graph group luir2 gray icde liur3 graph icde lin luir4 graph group icdm
lin liu… …
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 16
Forward index [Ji et al. WWW’09]ID “grap
h”“icde” “li”
7 0 ?
ID Forward listr0 <1, 2> <5, 3>r1 <1, 3> <1, 9> <9, 6>r2 <2, 9> <5, 2> <8, 3>r3 <1, 4> <5, 2> <7, 9> <9,
4>r4 <1, 7> <4, 3> <6, 9> <7,
2> <8, 7>… …Keyword IDWeight
Li, Wang, Li, and Feng
gr
a
i lcd
e mo
p yh
s ups
i uin u
12
3 45 6
7 8 9[1, 1][1, 2]
[3,3] [4, 4]
[3, 4]
[1, 4]
[1, 4]
[2,2][5, 6][5, 6][5, 6] [7, 8]
[7, 9]
[9, 9]
Tsinghua/UC Irvine/Bimaple 17
gr
a
i lcd
e mo
p yh
s ups
i uin u
12
3 45 6
7 8 9[1, 1][1, 2]
[3,3] [4, 4]
[3, 4]
[1, 4]
[1, 4]
[2,2][5, 6][5, 6][5, 6] [7, 8]
[7, 9]
[9, 9]
ID Forward listr0 <1, 2> <5, 3>
r1 <1, 3> <1, 9> <9, 6>
r2 <2, 9> <5, 2> <8, 3>
r3 <1, 4> <5, 2> <7, 9> <9, 4>
r4 <1, 7> <4, 3> <6, 9> <7, 2> <8, 7>
… …
Random Access Using Forward IndexID “grap
h”“icde” “li”
7 07?
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 18
Outline
Problem Formulation Instant exact search Instant fuzzy search Experiments
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 19
Ranking Framework (Fuzzy matching)
Li, Wang, Li, and Feng
Aggregate
graph, gray, icdm, gross, lin, liu
Record
Query graph
icde
li
Score(graph) Sim(icde,icdm)*Score(icdm) Score(lin)Score(liu)
MaxSim(li,i)*Score(lin)
Tsinghua/UC Irvine/Bimaple 20
{graph, icde, li}, similarity threshold τ=0.45
Computing Similar Prefixes [Ji et al. WWW’09]
gr
a
i lcd
e mo
p yh
s ups
i uin u
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 21
Top-k Algorithm
icdeicdm lin
liu
lui
Max Heap
𝑟3 ,9𝑟5 ,8 𝑟 4 ,4.5
3 2
similarity
×0.5 ×1 ×1 ×0.5×0.5
icdeicdm
×0.5×1
𝑟 4 ,4.54 Max Heap
𝑟5 ,9Max Heap
GetMax()
sum
×1
graph icde li
graph
GetMax() GetMax()
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 22
Probing on Forward Lists
Efficient Random Access (method 1)
ID “graph”
“icde” “li”
7 9 ?
ID Forward listr0 <1, 2> <5, 3>r1 <1, 3> <1, 9> <9, 6>r2 <2, 9> <5, 2> <8, 3>r3 <1, 4> <5, 2> <7, 9> <9,
4>r4 <1, 7> <4, 3> <6, 9> <7,
2> <8, 7>… …
Binary Search: [5,6], [7,9], [7,8], [9,9], 7, 8, 9
Li, Wang, Li, and Feng
gr
a
i lcd
e mo
p yh
s ups
i uin u
12
3 45 6
7 8 9[1, 1][1, 2]
[3,3] [4, 4]
[3, 4]
[1, 4]
[1, 4]
[2,2][5, 6][5, 6][5, 6] [7, 8]
[7, 9]
[9, 9]
Tsinghua/UC Irvine/Bimaple 23
Efficient Random Access (method 2) Probing on Trie Leaf Nodes
ID “graph”
“icde” “li”
7 9 ?
gr
a
i lcd
l mo
p yh
s ups
i uin u
12
3 4
5 67 8 9[1,1]
[1,2][3,3] [4,4]
[3,4]
[1,4][1,4]
[2,2][5,6][5,6][5,6] [7,8]
[7,9]
[9,9]ID Forward listr0 <1, 2> <5, 3>
r1 <1, 3> <1, 9> <9, 6>
r2 <2, 9> <5, 2> <8, 3>
r3 <1, 4> <5, 2> <7, 9> <9, 4>
r4 <1, 7> <4, 3> <6, 9> <7, 2> <8, 7>
… …
li, 0.5 li, 0.5
li, 1li, 1
li, 0.5
Traverse the forward list of Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 24
Optimization by materializing union lists
gr
a
i lcd
e mo
p yh
s ups
i uin u
Time/space tradeoff Cost-based analysis for a space budget
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 25
Outline
Problem Formulation Instant exact search Instant fuzzy search Experiments
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 26
Data sets and index costs
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 27
Exact Search (DBLP)
k=10, similarity threshold τ=0.6Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 28
Exact Search (DBLP)
k=10, similarity threshold τ=0.6Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 29
Fuzzy Search
DBLP, k=10, similarity threshold τ=0.6Li, Wang, Li, and Feng
TA
NRA
Tsinghua/UC Irvine/Bimaple 30
Other results (not included in the paper)
More general ranking (e.g., positional information)
Other languages Location-based search
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 31
Conclusions (ipubmed.ics.uci.edu)
Efficient techniques for instant fuzzy search
Li, Wang, Li, and Feng
Tsinghua/UC Irvine/Bimaple 32
Acknowledgements The authors have financial interest in Bimaple
Technology Inc., a company currently commercializing some of the techniques described in this publication.
Chen Li was partially supported by NIH grant 1R21LM010143-01A1.
Guoliang Li, Jianan Wang, and Jianhua Feng were partly supported by the National Natural Science Foundation of China under Grant No. 61003004, the National Grand Fundamental Research 973 Program of China under Grant No. 2011CB302206, a project of Tsinghua University under Grant No. 20111081073, and the “NExT Research Center” funded by MDA, Singapore, under the Grant No. WBS:R-252-300-001-490.
Li, Wang, Li, and Feng