12
1/23/2013 1 DD2476: Lecture 3 Today: More indexingissues Largeindexes Distributedindexing Dynamicindexing Index compression Large indexes The web is big: 1998: 26 million uniqueweb pages 2008: 1 000 000 000 000 (1 trillion) unique web pages! What if the index is toolargeto fit in main memory? Dictionary in mainmemory, postingson disk Dictionary (partially) on disk, postingson disk Index on severaldisks (cluster) friend roman countryman 1 10 12 16 82 6 9 10 10 23 Mixed MM and disk storage Tree: Nodes below a certain depth stored on disk Hashtable: All postings put on disk, hash keys in MM A distributedhashtable allowskeys and postings to be distributed of a large number of computers Hardware basics Access to data in memory is much faster than access to data on disk. Disk seeks: No data is transferred from disk while the disk head is being positioned. Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB.

DD2476: Lecture 3 Large indexes

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DD2476: Lecture 3 Large indexes

1/23/2013

1

DD2476: Lecture 3

• Today: More indexing issues

– Large indexes

– Distributed indexing

– Dynamic indexing

– Index compression

Large indexes

• The web is big:

– 1998: 26 million unique web pages

– 2008: 1 000 000 000 000 (1 trillion) unique web pages!

• What if the index is too large to fit in main memory?

– Dictionary in main memory, postings on disk

– Dictionary (partially) on disk, postings on disk

– Index on several disks (cluster)

friend

roman

countryman

1 10 12 16 82

6

9

10

10

23

Mixed MM and disk storage

• Tree: Nodes below a certain depth stored on disk

• Hashtable: All postings put on disk, hash keys in MM

• A distributed hash table allows keys and postings to be distributed of a large number of computers

Hardware basics

• Access to data in memory is much faster than access

to data on disk.

• Disk seeks: No data is transferred from disk while the

disk head is being positioned.

– Therefore: Transferring one large chunk of data from disk

to memory is faster than transferring many small chunks.

– Disk I/O is block-based: Reading and writing of entire

blocks (as opposed to smaller chunks).

– Block sizes: 8KB to 256 KB.

Page 2: DD2476: Lecture 3 Large indexes

1/23/2013

2

Hardware assumptions

• In the book:

statistic value

average seek time 5 ms = 5 x 10−3 s

transfer time per byte 0.02 μs = 2 x 10−8 s

processor’s clock rate 109 s−1 (1 GHz)

low-level operation 0.01 μs = 10−8 s(e.g., compare & swap a word)

size of main memory several GB

size of disk space 1 TB or more

Basic indexing

• Term-document pairs are collected when

documents are parsed

I did enact Julius

Caesar I was killed

i' the Capitol;

Brutus killed me.

Doc 1

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

Doc 2

Term Doc #

I 1

did 1

enact 1

julius 1

caesar 1

I 1

was 1

killed 1

i' 1

the 1

capitol 1

brutus 1

killed 1

me 1

so 2

let 2

it 2

be 2

with 2

caesar 2

the 2

noble 2

brutus 2

hath 2

told 2

you 2

caesar 2

was 2

ambitious 2

Sorting step .

• The list of term-doc pairs

is sorted

• This must be done on disk

for large lists

• Goal: Minimize the number

of disk seeks

Term Doc #

I 1

did 1

enact 1

julius 1

caesar 1

I 1

was 1

killed 1

i' 1

the 1

capitol 1

brutus 1

killed 1

me 1

so 2

let 2

it 2

be 2

with 2

caesar 2

the 2

noble 2

brutus 2

hath 2

told 2

you 2

caesar 2

was 2

ambitious 2

Term Doc #

ambitious 2

be 2

brutus 1

brutus 2

capitol 1

caesar 1

caesar 2

caesar 2

did 1

enact 1

hath 1

I 1

I 1

i' 1

it 2

julius 1

killed 1

killed 1

let 2

me 1

noble 2

so 2

the 1

the 2

told 2

you 2

was 1

was 2

with 2

A bottleneck

• Say we want to sort 100,000,000 term-doc-pairs.

• A list can be sorted by N log2 N comparison

operations.

– How much time does that take? (assume 10−8 s/operation)

• Suppose that each comparison additionally took 2 disk

seeks

– How much time? (assuming 5 x 10−3 /disk seek)

Page 3: DD2476: Lecture 3 Large indexes

1/23/2013

3

Scaling index construction

• In-memory index construction does not scale.

• How can we construct an index for very large collections?

– As we build the index, we parse docs one at a time.

– The final postings for any term are incomplete until the end.

– At 12 bytes per non-positional postings entry (term, doc, freq), demands a lot of space for large collections.

– Swedish Wikipedia has > 185,000,000 tokens

– Can be done in-memory in 2012, but real collections are much larger

– Thus: We need to store intermediate results on disk

Blocked sort-based indexing

• 12-byte (4+4+4) records (term, doc, freq).

– Define a Block ~ 10M of such records

– Accumulate postings for each block, sort, write to disk.

– Then merge the blocks into one long sorted order.

Sorting 10 blocks of 10M records

• First, read each block and sort within:

– Quicksort takes N ln N expected steps

– In our case (10M) ln (10M) steps

•• Exercise: estimate total time to read each block from Exercise: estimate total time to read each block from

disk and disk and andand quicksortquicksort it.it.

– assuming transfer time 2 x 10−8 s per byte

• 10 times this estimate – gives us 10 sorted runs of

10M records each.

Blocked sort-based indexing

Page 4: DD2476: Lecture 3 Large indexes

1/23/2013

4

From BSBI to SPIMI

• BSBI requires that the dictionary can be kept in main

memory

• Alternative approach: Construct several separate

indexes and merge them

– Generate separate dictionaries for each block – no need to

maintain term-termID mapping across blocks.

– No need to keep dictionary in main memory

– Accumulate postings directly in postings list (as in lab 1).

• This is called SPIMI – Single-Pass In-Memory Index

construction (Figure 4.4)

SPIMI-invert

Distributed indexing

• For web-scale indexing, we must use a distributed

computing cluster

• Individual machines are fault-prone

– Can unpredictably slow down or fail

• Google (2007): estimated total of 1 million servers, 3

million processors/cores

• Estimate: Google installs 100,000 servers each quarter.

– Based on expenditures of 200–250 million dollars per year

– This would be 10% of the computing capacity of the world!?!

Distributed indexing

• Use two classes of parallel tasks

– Parsing

– Inverting

• Input is split broken into splits

– each split is a subset of documents

• Master computer assigns a split to an idle machine

– Parser will read a document and sort (t,d) pairs

– Inverter will merge, create and write postings

Page 5: DD2476: Lecture 3 Large indexes

1/23/2013

5

Distributed indexing: Architecture

• Uses an instance of MapReduce

– general architecture for distributed computing

– simple programming model

– manages interactions among clusters of cheap commodity

servers

• Uses Key-Value pairs as primary object of

computation

• An open-source implementation is “Hadoop” by

apache.org

Distributed indexing: Architecture

splits

Parser

Parser

Parser

Master

a-f g-p q-z

a-f g-p q-z

a-f g-p q-z

Inverter

Inverter

Inverter

Postings

a-f

g-p

q-z

assign assign

Map

phaseSegment files

Reduce

phase

MapReduce

• Map takes an input key-value pair, produces set of

intermediate key-value pairs

• Reduce takes an intermediate key and a set of

values, produces another set of values

map(String key, String value):

// key: document name

// value: contents

for each word w in value:

Emit(w, ”1”) reduce(String key, Iterator values):

// key: a word

// value: a list of ”1”

int result = 0

for each word v in values:

result += ParseInt(v)

Emit( result.toString() )

Example: word count

Index construction with MapReduce

d1:I did enact…, d2:So let it…

(I:d1,did:d1,… caesar:d1,…,caesar:d2…)→

(caesar:(d1,d2), julius:(d1), …)

I did enact Julius

Caesar I was killed

i' the Capitol;

Brutus killed me.

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

map

reduce

d1d2

Page 6: DD2476: Lecture 3 Large indexes

1/23/2013

6

Index construction with MapReduce

d1:I did enact…, d2:So let it…

(I:d1,did:d1,… caesar:d1,…,caesar:d2…)→

(caesar:(d1,d2), julius:(d1), …)

I did enact Julius

Caesar I was killed

i' the Capitol;

Brutus killed me.

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

d1d2

map(String key, String value):

// key: document name

// value: contents

for each word w in value:

Emit(w, key); reduce(String key, Iterator values):

// key: a word

// value: a list of docIDs

Emit( key:values.removeDuplicates(); )

Exercise

• Web documents contain links to one another

• If docs A,B and C link to D, then D’s list of backlinks is

(A,B,C).

• How can we compute the list of backlinks for each

document using the MapReduce framework?

• Give pseudo-code for the Map and Reduce functions.

Dynamic indexing

• Up to now, we have assumed that collections are

static.

• They rarely are:

– Documents come in over time and need to be inserted.

– Documents are deleted and modified.

• This means that the dictionary and postings lists have

to be modified:

– Postings updates for terms already in dictionary

– New terms added to dictionary

Simplest approach

• Maintain “big” main index

• New docs go into “small” auxiliary index

• Search across both, merge results

• Deletions

– Invalidation bit-vector for deleted docs

– Filter docs output on a search result by this invalidation

bit-vector

• Periodically, re-index into one main index

Page 7: DD2476: Lecture 3 Large indexes

1/23/2013

7

Logarithmic merge

• Maintain a series of indexes, each twice as large as

the previous one.

• Keep smallest (Z0) in memory

• Larger ones (I0, I1, …) on disk

• If Z0 gets too big (> n), write to disk as I0

– or merge with I0 (if I0 already exists) as Z1

• Either write merge Z1 to disk as I1 (if no I1)

– or merge with I1 to form Z2

• etc.

Dynamic indexing at search engines

• All the large search engines now do dynamic

indexing

• Their indices have frequent incremental changes

– News items, blogs, new topical web pages

• Whitney Houston, Håkan Juholt, …

• But (sometimes/typically) they also periodically

reconstruct the index from scratch

– Query processing is then switched to the new index, and

the old index is then deleted

Why compression (in general)?

• Use less disk space

– Saves a little money

• Keep more stuff in memory

– Increases speed

• Increase speed of data transfer from disk to memory

– [read compressed data | decompress] is faster than

[read uncompressed data]

– Premise: Decompression algorithms are fast

• True of the decompression algorithms we use

Why compression for indexes?

• Dictionary

– Make it small enough to keep in main memory

– Make it so small that you can keep some postings lists in main memory too

• Postings file(s)

– Reduce disk space needed

– Decrease time needed to read postings lists from disk

– Large search engines keep a significant part of the postings in memory.

• Compression lets you keep more in memory

• The book presents various IR-specific compression schemes

Page 8: DD2476: Lecture 3 Large indexes

1/23/2013

8

Lossless vs lossy compression

• Lossless compression: All information is preserved.

– What we mostly do in IR.

• Lossy compression: Discard some information

• Several of the preprocessing steps can be viewed as

lossy compression: case folding, stop words,

stemming, number elimination.

Vocabulary vs collection size

• How many distinct words are there?

• Heaps’ Law: M=kTb

M = vocabulary size, T = number of tokens

• Typically 30 ≤ k ≤ 100, b ≈ 0.5

– RCV1 corpus used in the book has 38,365 terms on first

1,000,000 tokens, and k=44.

– Last year’s course corpus: 38,220 terms on 684,540 tokens,

k ≈ 46

– However, Swedish Wikipedia has 3,266,333 terms on

185,546,616 tokens, k ≈ 237

Swedish Wikipedia

tokens

terms

Distribution of terms

• Zipf’s law describes the relative frequency of terms:

cfi = K/i

cfi is the collection frequency : the number of

occurrences of the term ti in the collection

• Consequence: If the most frequent term (the) occurs

cf1 times

– then the second most frequent term (of) occurs cf1/2 times

– the third most frequent term (and) occurs cf1/3 times …

Page 9: DD2476: Lecture 3 Large indexes

1/23/2013

9

Swedish Wikipedia

i 3648969

quot 3460170

och 2878013

lt 2411738

gt 2365764

a 2344229

en 2169456

av 1718654

kategori 1613462

som 1518518

är 1232640

att 1215843

på 1174394

ö 1138493

de 1127375

den 1071467

till 1052663

med 1029549

för 975414

ref 842771

Swedish Wikipedia

• Top 30 terms account for 41,238,989 tokens

– about 22% of all tokens

Dictionary compression

• Why compress the dictionary?

– Search begins with the dictionary

– We want to keep it in memory

– Memory footprint competition with other applications

– Embedded/mobile devices may have very little memory

– Even if the dictionary isn’t in memory, we want it to be

small for a fast search startup time

– So, compressing the dictionary is important

First idea

Terms Freq. Postings ptr.

a 656,265

aachen 65

…. ….

zulu 221

20 bytes 4 bytes each

• Array of fixed-width entries

– ~400,000 terms; 28 bytes/term = 11.2 MB.

Dictionary search

structure

Page 10: DD2476: Lecture 3 Large indexes

1/23/2013

10

Compressing the term list

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

Total string length =

400K x 8B = 3.2MB

Pointers resolve 3.2M

positions: log23.2M =

22bits = 3bytes

•Store dictionary as a (long) string of characters:�Pointer to next word shows end of current word

�Hope to save up to 60% of dictionary space.

Blocking

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

7

Save 9 bytes

on 3

pointers.

Lose 4 bytes on

term lengths.

• Store pointers to every kth term string.

– Example below: k=4.

• Need to store term lengths (1 extra byte)

Front coding

• Front-coding:

– Sorted words commonly have long common prefix

– store differences only

– (for last k-1 in a block of k)

8automata8automate9automatic10automation→8automat*a1◊e2◊ic3◊ion

Encodes automatExtra length

beyond automat.

Begins to resemble general string compression.

Compressions on the corpus in the book

Technique Size in MB

Fixed width 11.2

Dictionary-as-String with pointers to every

term

7.6

Also, blocking k = 4 7.1

Also, Blocking + front coding 5.9

Page 11: DD2476: Lecture 3 Large indexes

1/23/2013

11

Compressing postings

• We store the list of docs containing a term in

increasing order of docID.

– computer: 33,47,154,159,202 …

• Consequence: it suffices to store gaps.

– 33,14,107,5,43 …

• Hope: most gaps can be encoded/stored with

far fewer than if we store docIDs.

Three postings entries

Variable-length encoding

• Aim:

– For arachnocentric, we will use ~20 bits/gap entry.

– For the, we will use ~1 bit/gap entry.

• If the average gap for a term is G, we want to use

~log2G bits/gap entry.

• Key challenge: encode every integer (gap) with about

as few bits as needed for that integer.

• This requires a variable length encoding

• Variable length codes achieve this by using short

codes for small numbers

Variable-byte codes

• Begin with one byte to store G and dedicate 1 bit in it

to be a continuation bit c

• If G ≤127, binary-encode it in the 7 available bits and

set c =1

• Else encode G’s lower-order 7 bits and then use

additional bytes to encode the higher order bits

using the same algorithm

• At the end set the continuation bit of the last byte to

1 (c =1) – and for the other bytes c = 0.

Page 12: DD2476: Lecture 3 Large indexes

1/23/2013

12

Example

docIDs 824 829 215406

gaps 5 214577

VB code 00000110

10111000

10000101 00001101

00001100

10110001

Postings stored as the byte concatenation

000001101011100010000101000011010000110010110001