59
Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann LECTURE 2 INDEXING 26.09.2012 Information Retrieval, ETHZ 2012 1

Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann

LECTURE 2 INDEXING 26.09.2012 Information Retrieval, ETHZ 2012 1

Page 2: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Today’s Overview

1.  Introduction 2.  Dictionaries 3.  Index Construction 4.  Distributed Indexing 5.  Multiple Query Terms 6.  Advanced Posting List Intersection 7.  Web-scale Index Serving

Class from 9:15-10:45 (no break), 11-12: Excercise

Information Retrieval, ETHZ 2012 2

Page 3: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

INTRODUCTION

3 Information Retrieval, ETHZ 2012

Page 4: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Basic Index: Challenge Design solution to a simple lookup problem:

Efficiently identify documents containing a given term t “Efficiently” = do this in time O(# documents returned)

Use a data structure to be constructed off-line (@ indexing time) in order to avoid linear scanning (@ query time).

Tradeoff response time & query throughput for pre-processing costs & index space (memory, disk).

Any data structure for storing a set of records could be used. Here: focus on arrays & linked lists = posting lists.

4 Information Retrieval, ETHZ 2012

Page 5: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Pre-Computer Age: Book Index Book indexes

Record pages mentioning (e.g.) keywords and names

Goes back to the age of printed books (15th century)

Information Retrieval, ETHZ 2012 5

Page 6: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Posting Lists

ETHZ

docID_4 = docID(“www.ethz.ch”) docID_2 = docID(“wikipedia.org/wiki/ETH_Zurich”)

docID_3 = docID(“www.systems.ethz.ch/…”) docID_1 = docID(“swissinfo.ch/…”)

docID_1 docID_2 docID_3 docID_4 …

ETHZ

Array or linked list Information Retrieval, ETHZ 2012 6

Page 7: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

DICTIONARIES

7 Information Retrieval, ETHZ 2012

Page 8: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Basic Index: Dictionary

For each admissible (i.e. single term) query we need to find the corresponding posting list (if it exists, else NULL)

We need an efficient data structure for term look-up, i.e. a dictionary Preferred solution: Hash table

§  Hash function

§  Mechanism for dealing with collisions: e.g. linked list

§  O(1) access for “good” hash functions and large enough n

§  Standard implementations: re-scale at load >0.75

Information Retrieval, ETHZ 2012 8

¤ Büttcher, S., Clarke Ch. L. A., and Cormack, G. V.: Information Retrieval. Implementing and Evaluating Search Engines, Section 4.2, 2010.

Page 9: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Dictionary Hash Table

Information Retrieval, ETHZ 2012 9

terms

class

hashes

ETHZ

mountain

weather

0 1 2

r

n

r+1

h

. . .

collision lists

mountain 549283471

ETHZ 398437231

class 234443989

weather 770209991

… …

class 234443989

<token> <posting list address> =

Page 10: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

INDEX CONSTRUCTION

10 Information Retrieval, ETHZ 2012

Page 11: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Basic Index: Generation

Construct all posting lists in one pass over the document collection.

INDEXGENERATION(C) 1 for all documents d in collection C 2 for all terms t occurring in d 3 if not EXISTS(posting_list(t)) 4 then CREATE(posting_list(t)) 5 ADD(posting_list(t),d) 6 else if not CONTAINS(posting_list(t),d) 7 then ADD(posting_list(t),d) 8 return posting_list

Note: indexing terms (=vocabulary) can be identified on the fly. Dictionary construction can happen in parallel.

Information Retrieval, ETHZ 2012 11

Page 12: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Index Construction

Conceptually: 3 steps 1.  Make a pass through the collection and assemble all

postings, i.e. pairs (term, doc-id) or (term-id, doc-id)

2.  Sort the postings using the term(-id) as the primary and the doc-id as the secondary key

3.  Organize doc-ids into posting lists for each term

Information Retrieval, ETHZ 2012 12

Page 13: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Scalable Index Construction

In-memory index construction does not scale. How can we construct an index for very large collections?

Taking into account the hardware constraints on memory, disk, speed etc.

Information Retrieval, ETHZ 2012 13

Page 14: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Sort-Based Index Construction

As we build index, we parse docs one at a time. The final postings for any term is potentially incomplete until the end.

At 10–12 bytes per postings entry, demands a lot of space for large collections.

For large document collections, we need to store intermediate results on disk.

Information Retrieval, ETHZ 2012 14

Page 15: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Blocked Sort-Based Indexing (BSBI)

12-byte (4+4+4) postings (term-id, doc-id, document frequency)

Must now sort many Billions of postings by term-id. Define a block to consist of (say) 10M such postings. We can easily fit that many postings into memory.

Basic idea of algorithm:

Accumulate postings for each block, sort, write to disk. Then merge the block into one long sorted order.

Information Retrieval, ETHZ 2012 15

Page 16: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

BSBI Index Construction

Information Retrieval, ETHZ 2012 16

Page 17: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

BSBI: Merging Blocks

Information Retrieval, ETHZ 2012 17

Page 18: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Problems with Sort-Based Algorithm

Our assumption was: we can keep the dictionary in memory.

We need the dictionary (which grows dynamically) in order to implement a term to term-id mapping. Actually, we could work with (term, doc-id) postings instead of (term-id, doc-id) postings . . .

. . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)

… term fingerprinting an alternative, but inexact.

Information Retrieval, ETHZ 2012 18

Page 19: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Single Pass in Memory Indexing

Abbreviation: SPIMI Key idea #1: Generate separate dictionaries for each block – no need to maintain term-term-id mapping across blocks. Key idea #2: Don’t sort. Accumulate postings in postings lists as they occur.

With these two ideas we can generate a complete inverted index for each block.

These separate indexes can then be merged into one big index.

Information Retrieval, ETHZ 2012 19

Page 20: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

DISTRIBUTED INDEXING

20 Information Retrieval, ETHZ 2012

Page 21: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Distributed Index Generation

For web-scale indexing: must use a distributed computer cluster

Individual machines are fault-prone and may unpredictably slow down or fail

How do we exploit such a pool of machines?

Information Retrieval, ETHZ 2012 21

Page 22: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Master Coordination

Maintain a master machine directing the indexing job – considered “safe”

Break up indexing into sets of parallel tasks

Master machine assigns each task to an idle machine from a pool.

Information Retrieval, ETHZ 2012 22

Page 23: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Parallel Tasks

We will define two sets of parallel tasks and deploy two types of machines to solve them:

§  Parsers

§  Inverters

Break the input document collection into splits (corresponding to blocks in BSBI/SPIMI)

Each split is a subset of documents.

Information Retrieval, ETHZ 2012 23

Page 24: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Parsers

Master assigns a split to an idle parser machine. Parser reads a document at a time and emits (term, doc) pairs.

Parser writes pairs into j term-partitions. Each for a range of terms’ first letters

E.g., a-f, g-p, q-z (here: j = 3)

Information Retrieval, ETHZ 2012 24

Page 25: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Inverters

An inverter collects all (term, doc) pairs (= postings) for one term-partition.

Sorts and writes to postings lists

Information Retrieval, ETHZ 2012 25

Page 26: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Data Flow

Information Retrieval, ETHZ 2012 26

Page 27: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Map Reduce

The index construction algorithm we just described is an instance of Map Reduce.

Map Reduce is a robust and conceptually simple framework for distributed computing . . . . . . without having to write code for the distribution part.

The open source version is called Hadoop.

Hadoop is a key tool for big data. See lecture 3 of Donald Kossmann’s class.

Information Retrieval, ETHZ 2012 27

¤ J. Dean & S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. 6th Symposium on Operating System Design & Implementation, 2004.

Page 28: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

MULTIPLE QUERY TERMS

28 Information Retrieval, ETHZ 2012

Page 29: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Basic Index: Modified Challenge Deal with multiple terms:

§  Efficiently identify documents containing a given set of terms t1,…,tk.

§  This is also known as Boolean retrieval with “AND”.

In which way do we need to generalize the •  index data structures •  index generation, and •  query processing?

Information Retrieval, ETHZ 2012 29

Page 30: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Multi-Term Posting Lists Challenge: #terms may be large: O(billions), but #sets of

terms grows exponentially in the set size

In practice some sets (or n-grams) of terms may be used frequently (“mountain bike trails”), but most term combinations will never be observed.

Idea #1: Multi-term posting lists §  Identify frequent k-term combinations (from documents

or query logs, k=2 or k=3). Create posting lists for those.

§  Advantage: popular k term queries can be answered as fast as one term queries

Information Retrieval, ETHZ 2012 30

Page 31: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Intersecting Posting Lists

Idea #2: Traverse multiple posting lists in parallel to compute intersection.

In order to be effective (for ~ equal length posting lists): sorted posting lists - sort entries in each list using the same total order (e.g. ascending documentID).

Basic method: §  Always advance in posting list with smallest current

element. §  Check for documents contained in all lists

Information Retrieval, ETHZ 2012 31

Page 32: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Intersecting Posting Lists: Example

Information Retrieval, ETHZ 2012 32

ETHZ 370871 391223 623920 … 789908

systems

370871 927382 391223 623920 … sort

177883 300012 391223 391248 …

information retrieval

391223 800123 990002 991226 …

Page 33: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Intersecting Posting Lists: Example

Information Retrieval, ETHZ 2012 33

ETHZ 370871 391223 623920 … 789908

systems 177883 300012 391223 391248 …

information retrieval

391223 800123 990002 991226 …

Page 34: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Intersecting Posting Lists: Example

Information Retrieval, ETHZ 2012 34

ETHZ 370871 391223 623920 … 789908

systems 177883 300012 391223 391248 …

information retrieval

391223 800123 990002 991226 …

Page 35: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Intersecting Posting Lists: Example

Information Retrieval, ETHZ 2012 35

ETHZ 370871 391223 623920 … 789908

systems 177883 300012 391223 391248 …

information retrieval

391223 800123 990002 991226 …

Page 36: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Intersecting Posting Lists: Example

Information Retrieval, ETHZ 2012 36

ETHZ 370871 391223 623920 … 789908

systems 177883 300012 391223 391248 …

information retrieval

391223 800123 990002 991226 …

Page 37: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Intersecting Posting Lists: Example

Information Retrieval, ETHZ 2012 37

ETHZ 370871 391223 623920 … 789908

systems 177883 300012 391223 391248 …

information retrieval

391223 800123 990002 991226 …

add docID 391223 to result set

Page 38: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Intersecting Posting Lists: Example

Information Retrieval, ETHZ 2012 38

ETHZ 370871 391223 623920 … 789908

systems 177883 300012 391223 391248 …

information retrieval

391223 800123 990002 991226 …

Page 39: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Intersection Algorithm

For simplicity, we focus on the case of two posting lists

Multiple terms can be handled by generalizing to k posting lists

… or by creating temp intermediate posting lists and recursion

Optimization: start with shorter posting lists

Information Retrieval, ETHZ 2012 39

INTERSECT(p1, p2) 1 answer := < > 2 while (p1 != NULL) AND (p2 != NULL) 3 if docID(p1) == docID(p2) then 4 ADD(answer, docID(p1)) 5 ADVANCE(p1) 6 ADVANCE(p2) 7 else if docID(p1) < docID(p2) 8 ADVANCE(p1) 9 else 10  ADVANCE(p2) 11  return answer

Page 40: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Intersecting Posting Lists

How expensive is the parallel intersection of k posting lists?

Number of pointer advances

Reasonable efficiency, if posting lists are approximately of the same length.

Access time dominated by longest posting list. Can we also devise a method that is dominated by the shortest?

Information Retrieval, ETHZ 2012 40

Page 41: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

ADVANCED POSTING LIST INTERSECTION

41 Information Retrieval, ETHZ 2012

Page 42: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Alternative Posting List Intersection

Naïve approach when |L_1| << |L_2| §  Build a hash map dictionary of docIDs for L_1 and L_2

§  Lookup the elements of L_1 in the dictionary for L_2

§  O(|L_1|) time

Only works well in highly asymmetric case. Can we do better?

Information Retrieval, ETHZ 2012 42

Page 43: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Alternative Posting List Intersection: Refinement §  Compute hashed sets h(L1) and h(L2)

§  Bucketed bit set representation of set of hash values

§  Fast intersection in bit set representation

§  Exact intersection

§  #bits in h: small enough to allow for fast intersection; large enough to make L’1 and L’2 small.

Information Retrieval, ETHZ 2012 43

¤  P. Bille, A. Pagh, and R. Pagh. Fast Evaluation of Union-Intersection Expressions. In ISAAC, pages 739–750, 2007.

Page 44: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Posting Lists with Skip Pointers

Other ways to speed up list based intersection: introduce skip pointers

Traverse skip pointers instead of next element pointer, if whole segment can be skipped.

Where to put skip pointers? Heuristics: sqrt spacing

Trade-offs:

(1) space and I/O (!) requirements for skip pointers vs. not

(2) additional comparisons with skip pointers vs. skip gains

Information Retrieval, ETHZ 2012 44

Page 45: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Use of Skip Pointers: Example

Information Retrieval, ETHZ 2012 45

When 8 is reached in both lists. Next element in top list is 41. We can advance to that element. However, we can skip over the block in bottom list and move past 31, skipping 4 elements.

Page 46: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

WEB SCALE INDEX SERVING

46 Information Retrieval, ETHZ 2012

Page 47: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Disk vs. RAM

When building a scalable (i.e. Web scale) index, one key design question is to use disk vs. RAM (today also: SSD).

§  RAM ~200x more expensive than disk

§  Disk ~10-20x slower to access

§  Additional overhead for random access = disk seeks

Hardware economics also influence system architecture.

Information Retrieval, ETHZ 2012 47

Page 48: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Distributed Index: Sharding

A related problem for large indexes is how to split up the index into pieces or shards. Relevant performance dimensions are response time or latency (how long does it take to compute a response?) as well as throughput (how many queries/s can be answered?). In addition fault tolerance may be an issue. There are two basic ways of sharding: document sharding or term sharding. Document sharding: each shard contains short posting lists (for a subset of documents). Term sharding: each shard contains few posting lists Information Retrieval, ETHZ 2012 48

Page 49: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Document Sharding

Information Retrieval, ETHZ 2012 49

Page 50: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Term Sharding

Information Retrieval, ETHZ 2012 50

Page 51: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Document Sharding -  Pros & Cons

Pros §  each shard can

process queries independently

§  easy to keep additional per-doc information

§  network traffic (requests/ responses) small

Information Retrieval, ETHZ 2012 51

Cons §  query has to be

processed by each shard

§  O(K*N) disk seeks for K word query on N shards

Page 52: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Term Sharding -  Pros & Cons

Pros §  K word query =>

handled by at most K shards

§  O(K) disk seeks for K word query

Information Retrieval, ETHZ 2012 52

Cons §  much higher network

bandwidth needed §  data about each term for

each matching doc must be collected in one place

§  harder to have per-doc information

Document sharding is “standard” approach in Web search.

Page 53: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Basic Design Principles

Document Keying §  Documents assigned small integer ids (docids)

§  Smaller ids for higher quality/more important docs: allows for approximation/cut-offs

Index Servers

§  Given (query) return sorted list of (score, docid, ...)

§  Partitioned (“sharded”) by docid

§  Index shards are replicated for capacity

§  Cost is O(# queries * # docs in index)

Information Retrieval, ETHZ 2012 53

Page 54: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Web Search Serving System (Google @ year ~2000)

Information Retrieval, ETHZ 2012 54

Page 55: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Caching

Cache servers §  Cache both index results and doc snippets

§  Hit rates typically 30-60% •  Depends on frequency of index updates, query traffic, level of

personalization, etc.

Main benefits

§  Performance! 10s of machines do work of 100(0)s

§  Reduce query latency on hits

§  Cache served queries are typically popular and often expensive

Information Retrieval, ETHZ 2012 55

Page 56: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Dealing with Growth

More web pages: more shards

More queries: more replicas

Information Retrieval, ETHZ 2012 56

Page 57: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

From Document Sharding to In-memory Index

Must add shards to keep response time low as index size increases

... but query cost increases with # of shards

§  typically >= 1 disk seek / shard / query term

§  even for very rare terms

As # of replicas increases, total amount of memory available increases

Eventually, have enough memory to hold an entire copy of the index in memory Radically changes many design parameters

Information Retrieval, ETHZ 2012 57

Page 58: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

In Memory Index (a la Google)

Information Retrieval, ETHZ 2012 58

Page 59: Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Anecdote form the Life of a Search Engine 1999 J

Index updates (~once per month) §  Wait until traffic is low

§  Take some replicas offline

§  Copy new index to these replicas

§  Start new frontends pointing at updated index

Disk-optimized update scheme

Information Retrieval, ETHZ 2012 59