Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 2
Guest Logo
Demands and Goals of Text Search Engines
� Locating words in a huge number of documents.
� Important Operations: AND queries, phrase queries, document reporting.
� Using main-memory for maximumquery performance.
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 3
Guest Logo
Inverted Index
� Unique IDs for:
� documents
� terms
� A document becomesa list of term IDs.
� Inverted Index: For each term we storea list of document IDs.
� Positional Inverted Index:Positions list for eachterm / document pair.
1037551357
9975221
120 151103513573
2
1
…
bye
world
hello
…
Structure of an Inverted Index
dictionary Inverted Lists
321253
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 4
Guest Logo
Intersection Algorithms: Randomized Inverted Indices
� Randomized Inverted Index: document IDs are assigned randomly. (e.g. using pseudo-random permutation)
� Two-level data structure:
� Split the range of document IDs into buckets based on their most significant
bits.
� Lookup-table: direct access to the first
value of a bucket1011011
1010100
1001011
1000100
0111101
0110111
0101110
0101000
0011111
0011000
0001111
0000111
11
10
01
00
12
8
4
0
lookup –table
Inverted List with lookup table
B = 4
( lookup table size: log2(n/B) )
0
n-1
4
8
n=12
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 5
Guest Logo
Intersection Algorithms: Randomized Inverted Indices
� Lookup is an intersection algorithm that runs on this data structure:
Inverted List with lookup table
n=12
Function lookup(M,N)
O := {} // output
i := -1 // current bucket key (now a dummy)
foreach d ∈ M do // unpack M
h := d >> kN // bucket key
l := d & (2kN-1) // least significant bits
if h > i then // a new bucket
i := h // set current bucket
j := t[i] // get start position
e := t[i+1] // get end position
while j < e do // bucket not exhausted
l‘ := N[j] // unpack if necessary
if l ≤ l‘ then
if l = l‘ then O := O ∪ {d}
break while
j++
return O
Lookup runs in
expected timeO(m+min{n,Bm})
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 6
Guest Logo
Experiments: Space Consumption (WT2g)
No significant differencesbetween ∆-bitc. and ∆-escaped
for the rand. representation.
∆-bitc. does notwork well for the
det. representation.
Escaping can exploit
nonuniformity of input.
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 7
Guest Logo
Experiments: Performance of Lookup (WT2g)
bit-compressed ∆-encoding
Large bucketsare bad for
small ratios...
…but good fornearly equal
lengths.
Bucket size 8
seems to be a good compromise.
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 8
Guest Logo
Experiments: Performance Comparison (WT2g)
Lookup is best
up to ratioclose to one.
Zipper, skippperand lookup are all very good for lists
of similar lengths.
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 9
Guest Logo
Experiments: Space-time Tradeoff of Encodings (WT2g.s)
Performance loss of
∆-encoding isnegligible low for
rand. representation.
Escapingrequires
perceivablemore time.
Clearly different
running times for det. representation.
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 10
Guest Logo
Experiments: Impact of Randomization on Lookup
Randomization gives theoretical
performance guarantees, but in practice deterministic data often
outperforms randomization.
Lookup is also a good heuristics for non-
randomized data.
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 11
Guest Logo
Suffix Arrays
a
abra
abracadabra
acadabra
adabra
bra
bracadabra
cadabra
dabra
ra
racadabra
� Suffix Arrays as full-text index: concatenate all text documents.
� Phrase search: A home match for SAs.
� AND searches:
� search for terms
� Intersect the occurence lists.
� Document reporting
� Compressed or distinct SAs are available from the Pizza&Chili website:
http://pizzachili.di.unipi.itSuffix array of „abracadabra“
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 12
Guest Logo
Performance of different SAs on WT2g 1-50000
AND
phrase
3.12.13.13.2peak mem usage [GB]
23.08.711.59.3indexing time [min]
0.770.841.390.64compression
279.3302.6500.9230.9size [MB]
AF-indexSSA2CCSACSA
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 13
Guest Logo
Modular Design of the Compressed Inverted Index (CII)
Preprocessor
Document-grained
inverted index
Positional index
Text delta
Bags ofwords
Dictionary
Documents normalizeddocuments (term
ID/doc ID/pos)
terms
Differencesbetween
documents and their normalized
versions
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 14
Guest Logo
CII: Document-grained Inverted Index
10
9
8
7
6
5
4
3
2
1
1
.
.
.
4
4
3
2
1
…96521
…109432
Directstoreddoc IDs
Golomb coded list of doc ID deltas
max. K=128 values
max. K=128 values
….........
…... > K values
….........
…... > K values
Two-Level Lookup lists
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 15
Guest Logo
Document-grained Index: Lookup List
1011011
1010100
1001011
1000100
0111101
0110111
0101110
0101000
0011111
0011000
0001111
0000111
rankpos
12
8
4
0
11
10
01
00
84
56
28
0
lookup – table
Inverted List with lookup table
B = 4
( lookup table size: log2(n/B) )
0
n-1
4
8
n=12
� Two-level data structure, refinement of [Sanders and Transier, 2007]:
� Split the range of document IDs into buckets based on their most significant bits.
� Lookup-table: direct access to the first value of a bucket
� Rank information: number of values smaller than the
contents of a bucket.
� Variable Golomb coding:
estimating the average of each bucket.
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 16
Guest Logo
CII: Positional Index
10
9
8
7
6
5
4
3
2
1
3
.
.
.
64
128
3
17
1
5
…96521
…1034102
Directstored
positions
Golomb coded list of doc ID deltas
singledoc
multi doc / single pos
….........
…... multi doc / multi pos
Two-Level list (top-levelindexed by ranks)
Bitcompressed lists (indexed by ranks)
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 17
Guest Logo
Querying the Compressed Inverted Index
� AND Query: Intersection of inverted lists in increasing order of list lengths.
� Phrase Query:
� As AND queries, but keeping track of the current ranks.
� Retrieve corresponding position lists.
� Check positions.
(((( ∩ )∩ )∩ )
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 18
Guest Logo
document 5document 5
document 5
CII: Document Reporting with Text Delta
0
8
6
9
3
1
4
9
7
5
Escape dictionary
List of termseperators
42
27
25
7
5
1
0
3
2
0
List of term escapes
All chars to uppercase
First char to uppercase
List of original terms 5
Next itema term
Next 8 termsare seperated
by spaces.
bag of words
List of normalized terms
document index
positional index
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 19
Guest Logo
� We have implemented all algorithms using C++.
� One core of an Intel Core 2 Duo E6600, clocked at 2.4 GHz with 2 x 2MB L2 cache and 4 GB main memory.
� openSuSE 10.2 (kernel 2.6.18), gcc 4.1.2 (-O3)
� Timing with PAPI 3.5.0
Experiments: Test System
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 20
Guest Logo
Experiments: Test Data
� Real world instance: first 50.000 docs of WT2g.
� Pseudo real-world queries:
� AND and phrase queries.
� Selecting random hits.
� Query lengths: 1-10 terms.
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 21
Guest Logo
Indexing: Space requirements
25.1bag of words
412.8 (360.5)input size (norm.)
23.9dictionary
32.4document Index
126.3positional Index
230.8suffix array
0.1doc bounds
3.20.7peak mem usage [GB]
- (9.3)5.6 (5.1)indexing time [min]
- (0.64)0.76 (0.57)compression
314.8sum + text delta [MB]
108.7text delta [MB]
230.9206.1sum [MB]
CSACII
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 22
Guest Logo
Experiments: Average AND Query Time
3-4 orders of magnitude slower than CII
Average < 1 msfor all lengths.
tim
e [
ms]
query lengths
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 23
Guest Logo
Experiments: How many % AND queries take longer than t?
35 s in worst
case
All queries aredone in less than
3.6 ms.
Query length
of 2 terms.
% q
ueri
es
to b
ean
sw
ere
d
time [ms]
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 24
Guest Logo
Experiments: Average Phrase Query Time
CSA is marginal faster.
(factor 0.94 – 0.62,< 1ms absolute)
CII is more than
20 times fasterthan CSA.
query lengths
tim
e [
ms]
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 25
Guest Logo
Experiments: How many % phrase queries take longer than t?
t < 190 ms for all queries.
Query length of 2 terms.
Factor24 apart.
time [ms]
% q
ueri
es
to b
ean
sw
ere
d
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 26
Guest Logo
Experiments: Document Reporting
6-8 MB/sBandwith of CSA is
about 5 times smaller
Assuming a disk access latency of
5ms, load from disk would be fasterfor documents > 32 KB.
data
rate
[M
B/s
]
text size [KB]
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 27
Guest Logo
Outlook & Future Work
� Bag of words: adaptive coding scheme?
� Compression of the dictionary.
� Speeding up the most expensive phrase queries.
� Construction time?
� Fast updates?
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 28
Guest Logo
Conclusion
Thank You forYour attention!
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 29
Guest Logo
CII on the complete WT2g / WT2g.s
short queries are slower on WT2g.s, as there are more results. – Large queries are faster, as they
benefit from more lookup lists.
AND
phrase
Phrase queries are faster on
WT2g.s, because theposition lists are shorter.
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 30
Guest Logo
Copyright 2007 SAP AG. All Rights Reserved
No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed without prior notice.
Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors.
Microsoft, Windows, Outlook, and PowerPoint are registered trademarks of Microsoft Corporation.
IBM, DB2, DB2 Universal Database, OS/2, Parallel Sysplex, MVS/ESA, AIX, S/390, AS/400, OS/390, OS/400, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere, Netfinity, Tivoli, and Informix are trademarks or registered trademarks of IBM Corporation.
Oracle is a registered trademark of Oracle Corporation.
UNIX, X/Open, OSF/1, and Motif are registered trademarks of the Open Group.
Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame, and MultiWin are trademarks or registered trademarks of Citrix Systems, Inc.
HTML, XML, XHTML and W3C are trademarks or registered trademarks of W3C®, World Wide Web Consortium, Massachusetts Institute of Technology.
Java is a registered trademark of Sun Microsystems, Inc.
JavaScript is a registered trademark of Sun Microsystems, Inc., used under license for technology invented and implemented by Netscape.
MaxDB is a trademark of MySQL AB, Sweden.
SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP NetWeaver, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.
The information in this document is proprietary to SAP. No part of this document may be reproduced, copied, or transmitted in any form or for any purpose without the express prior written permission of SAP AG.
This document is a preliminary version and not subject to your license agreement or any other agreement with SAP. This document contains only intended strategies, developments, and functionalities of the SAP® product and is not intended to be binding upon SAP to any particular course of business, product strategy, and/or development. Please note that this document is subject to change and may be changed by SAP at any time without notice.
SAP assumes no responsibility for errors or omissions in this document. SAP does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.
SAP shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. This limitation shall not apply in cases of intent or gross negligence.
The statutory liability for personal injury and defective products is not affected. SAP has no control over the information that you may access through the use of hot links contained in these materials and does not endorse your use of third-party Web pages nor provide any warranty whatsoever relating to third-party Web pages.
Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 31
Guest Logo
Copyright 2007 SAP AG. Alle Rechte vorbehalten
Weitergabe und Vervielfältigung dieser Publikation oder von Teilen daraus sind, zu welchem Zweck und in welcher Form auch immer, ohne die ausdrückliche schriftliche Genehmigung durch SAP AG nicht gestattet. In dieser Publikation enthaltene Informationen können ohne vorherige Ankündigung geändert werden.
Die von SAP AG oder deren Vertriebsfirmen angebotenen Softwareprodukte können Softwarekomponenten auch anderer Softwarehersteller enthalten.
Microsoft®, WINDOWS®, NT®, EXCEL®, Word®, PowerPoint® und SQL Server® sind eingetragene Marken der Microsoft Corporation.
IBM®, DB2®, DB2 Universal Database, OS/2®, Parallel Sysplex®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix und Informix® Dynamic ServerTM sind Marken der IBM Corporation.
ORACLE® ist eine eingetragene Marke der ORACLE Corporation.
UNIX®, X/Open®, OSF/1® und Motif® sind eingetragene Marken der Open Group.
Citrix®, das Citrix-Logo, ICA®, Program Neighborhood®, MetaFrame®, WinFrame®, VideoFrame®, MultiWin® und andere hier erwähnte Namen von Citrix-Produkten sind Marken von Citrix Systems, Inc.
HTML, DHTML, XML, XHTML sind Marken oder eingetragene Marken des W3C®, World Wide Web Consortium, Massachusetts Institute of Technology.
JAVA® ist eine eingetragene Marke der Sun Microsystems, Inc.
JAVASCRIPT® ist eine eingetragene Marke der Sun Microsystems, Inc., verwendet unter der Lizenz der von Netscape entwickelten und implementierten Technologie.
MaxDB ist eine Marke von MySQL AB, Schweden.
SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP NetWeaver, und weitere im Text erwähnte SAP-Produkte und -Dienstleistungen sowie die entsprechenden Logos sind Marken oder eingetragene Marken der SAP AG in Deutschland und anderen Ländern weltweit. Alle anderen Namen von Produkten und Dienstleistungen sind Marken der jeweiligen Firmen. Die Angaben im Text sind unverbindlich und dienen lediglich zu Informationszwecken. Produkte können länderspezifische Unterschiede aufweisen.
Die in dieser Publikation enthaltene Information ist Eigentum der SAP. Weitergabe und Vervielfältigung dieser Publikation oder von Teilen daraus sind, zu welchem Zweck und in welcher Form auch immer, nur mit ausdrücklicher schriftlicher Genehmigung durch SAP AG gestattet.
Bei dieser Publikation handelt es sich um eine vorläufige Version, die nicht Ihrem gültigen Lizenzvertrag oder anderen Vereinbarungen mit SAP unterliegt. Diese Publikation enthält nur vorgesehene Strategien, Entwicklungen und Funktionen des SAP®-Produkts. SAP entsteht aus dieser Publikation keine Verpflichtung zu einer bestimmten Geschäfts- oder Produktstrategie und/oder bestimmten Entwicklungen. Diese Publikation kann von SAP jederzeit ohne vorherige Ankündigung geändert werden.
SAP übernimmt keine Haftung für Fehler oder Auslassungen in dieser Publikation. Des Weiteren übernimmt SAP keine Garantie für die Exaktheit oder Vollständigkeit der Informationen, Texte, Grafiken, Links und sonstigen in dieser Publikation enthaltenen Elementen. Diese Publikation wird ohne jegliche Gewähr, weder ausdrücklich noch stillschweigend, bereitgestellt. Dies gilt u. a., aber nicht ausschließlich, hinsichtlich der Gewährleistung der Marktgängigkeit und der Eignung für einen bestimmten Zweck sowie für die Gewährleistung der Nichtverletzung geltenden Rechts.
SAP haftet nicht für entstandene Schäden. Dies gilt u. a. und uneingeschränkt für konkrete, besondere und mittelbare Schäden oder Folgeschäden, die aus der Nutzung dieser Materialien entstehen können. Diese Einschränkung gilt nicht bei Vorsatz oder grober Fahrlässigkeit.
Die gesetzliche Haftung bei Personenschäden oder Produkthaftung bleibt unberührt. Die Informationen, auf die Sie möglicherweise über die in diesem Material enthaltenen Hotlinkszugreifen, unterliegen nicht dem Einfluss von SAP, und SAP unterstützt nicht die Nutzung von Internetseiten Dritter durch Sie und gibt keinerlei Gewährleistungen oder Zusagen über Internetseiten Dritter ab.