82
A Billion Queries Per Day Michael Busch @michibusch [email protected] [email protected] 1

Lucene rev preso busch realtime search lr1010

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 2: Lucene rev preso busch realtime search lr1010

A Billion Queries Per Day

Agenda

‣ Introduction

- Posting list format and early query termination

- Index memory model

- Lock-free algorithms and data structures

- Roadmap

2

Page 3: Lucene rev preso busch realtime search lr1010

Introduction

3

Page 4: Lucene rev preso busch realtime search lr1010

Introduction

• Search @ Twitter today:

• >1,000 TPS (tweets/sec) = >80,000,000 tweets/day

• >12,000 QPS (queries/sec) = >1,000,000,000 queries/day

• <10s latency (entire pipeline)

• <1s latency (indexer)

• Performance goal: Ability to scale to support Twitter’s continued growth

4

Page 5: Lucene rev preso busch realtime search lr1010

Introduction

• 1st gen search engine based on MySQL

• Next-gen engine based on (partially rewritten) Lucene Java

• Significantly more performant (orders of magnitude)

• Much easier to develop new features on the new platform - stay tuned!

5

Page 6: Lucene rev preso busch realtime search lr1010

A Billion Queries Per Day

Agenda

- Introduction

‣ Posting list format and early query termination

- Index memory model

- Lock-free algorithms and data structures

- Roadmap

6

Page 7: Lucene rev preso busch realtime search lr1010

Posting list format and early query termination

7

Page 8: Lucene rev preso busch realtime search lr1010

Objectives

• Only evaluate as few documents as possible before terminating the query

• Rank documents in reverse time order (newest documents first)

8

Page 9: Lucene rev preso busch realtime search lr1010

Inverted Index 101

1 The old night keeper keeps the keep in the town

2 In the big old house in the big old gown.

3 The house in the town had the big old keep

4 Where the old night keeper never did sleep.

5 The night keeper keeps the keep in the night

6 And keeps in the dark and sleeps in the light.

Table with 6 documents

Example from:Justin Zobel , Alistair Moffat, Inverted files for text search engines, ACM Computing Surveys (CSUR)v.38 n.2, p.6-es, 2006

9

Page 10: Lucene rev preso busch realtime search lr1010

Inverted Index 101

1 The old night keeper keeps the keep in the town

2 In the big old house in the big old gown.

3 The house in the town had the big old keep

4 Where the old night keeper never did sleep.

5 The night keeper keeps the keep in the night

6 And keeps in the dark and sleeps in the light.

term freq

and 1 <6>big 2 <2> <3>

dark 1 <6>did 1 <4>

gown 1 <2>had 1 <3>

house 2 <2> <3>in 5 <1> <2> <3> <5> <6>

keep 3 <1> <3> <5>keeper 3 <1> <4> <5>keeps 3 <1> <5> <6>light 1 <6>

never 1 <4>night 3 <1> <4> <5>old 4 <1> <2> <3> <4>

sleep 1 <4>sleeps 1 <6>

the 6 <1> <2> <3> <4> <5> <6>town 2 <1> <3>where 1 <4>

Table with 6 documents

Dictionary and posting lists10

Page 11: Lucene rev preso busch realtime search lr1010

Inverted Index 101

1 The old night keeper keeps the keep in the town

2 In the big old house in the big old gown.

3 The house in the town had the big old keep

4 Where the old night keeper never did sleep.

5 The night keeper keeps the keep in the night

6 And keeps in the dark and sleeps in the light.

term freq

and 1 <6>big 2 <2> <3>

dark 1 <6>did 1 <4>

gown 1 <2>had 1 <3>

house 2 <2> <3>in 5 <1> <2> <3> <5> <6>

keep 3 <1> <3> <5>keeper 3 <1> <4> <5>keeps 3 <1> <5> <6>light 1 <6>

never 1 <4>night 3 <1> <4> <5>old 4 <1> <2> <3> <4>

sleep 1 <4>sleeps 1 <6>

the 6 <1> <2> <3> <4> <5> <6>town 2 <1> <3>where 1 <4>

Table with 6 documents

Dictionary and posting lists

Query: keeper

11

Page 12: Lucene rev preso busch realtime search lr1010

Inverted Index 101

1 The old night keeper keeps the keep in the town

2 In the big old house in the big old gown.

3 The house in the town had the big old keep

4 Where the old night keeper never did sleep.

5 The night keeper keeps the keep in the night

6 And keeps in the dark and sleeps in the light.

term freq

and 1 <6>big 2 <2> <3>

dark 1 <6>did 1 <4>

gown 1 <2>had 1 <3>

house 2 <2> <3>in 5 <1> <2> <3> <5> <6>

keep 3 <1> <3> <5>keeper 3 <1> <4> <5>

keeps 3 <1> <5> <6>light 1 <6>

never 1 <4>night 3 <1> <4> <5>old 4 <1> <2> <3> <4>

sleep 1 <4>sleeps 1 <6>

the 6 <1> <2> <3> <4> <5> <6>town 2 <1> <3>where 1 <4>

Table with 6 documents

Dictionary and posting lists

Query: keeper

12

Page 13: Lucene rev preso busch realtime search lr1010

Early query termination

1 The old night keeper keeps the keep in the town

2 In the big old house in the big old gown.

3 The house in the town had the big old keep

4 Where the old night keeper never did sleep.

5 The night keeper keeps the keep in the night

6 And keeps in the dark and sleeps in the light.

term freq

and 1 <6>big 2 <2> <3>

dark 1 <6>did 1 <4>

gown 1 <2>had 1 <3>

house 2 <2> <3>in 5 <1> <2> <3> <5> <6>

keep 3 <1> <3> <5>keeper 3 <1> <4> <5>keeps 3 <1> <5> <6>light 1 <6>

never 1 <4>night 3 <1> <4> <5>old 4 <1> <2> <3> <4>

sleep 1 <4>sleeps 1 <6>

the 6 <1> <2> <3> <4> <5> <6>town 2 <1> <3>where 1 <4>

Table with 6 documents

Dictionary and posting lists

Query: old

13

Page 14: Lucene rev preso busch realtime search lr1010

Early query termination

1 The old night keeper keeps the keep in the town

2 In the big old house in the big old gown.

3 The house in the town had the big old keep

4 Where the old night keeper never did sleep.

5 The night keeper keeps the keep in the night

6 And keeps in the dark and sleeps in the light.

term freq

and 1 <6>big 2 <2> <3>

dark 1 <6>did 1 <4>

gown 1 <2>had 1 <3>

house 2 <2> <3>in 5 <1> <2> <3> <5> <6>

keep 3 <1> <3> <5>keeper 3 <1> <4> <5>keeps 3 <1> <5> <6>light 1 <6>

never 1 <4>night 3 <1> <4> <5>old 4 <1> <2> <3> <4>

sleep 1 <4>sleeps 1 <6>

the 6 <1> <2> <3> <4> <5> <6>town 2 <1> <3>where 1 <4>

Table with 6 documents

Dictionary and posting lists

Query: old

Walk posting list in reverse order

14

Page 15: Lucene rev preso busch realtime search lr1010

Early query termination

1 The old night keeper keeps the keep in the town

2 In the big old house in the big old gown.

3 The house in the town had the big old keep

4 Where the old night keeper never did sleep.

5 The night keeper keeps the keep in the night

6 And keeps in the dark and sleeps in the light.

term freq

and 1 <6>big 2 <2> <3>

dark 1 <6>did 1 <4>

gown 1 <2>had 1 <3>

house 2 <2> <3>in 5 <1> <2> <3> <5> <6>

keep 3 <1> <3> <5>keeper 3 <1> <4> <5>keeps 3 <1> <5> <6>light 1 <6>

never 1 <4>night 3 <1> <4> <5>old 4 <1> <2> <3> <4>

sleep 1 <4>sleeps 1 <6>

the 6 <1> <2> <3> <4> <5> <6>town 2 <1> <3>where 1 <4>

Table with 6 documents

Dictionary and posting lists

Query: old

E.g. 2 results are requested: Optimal termination after exactly

two postings were read

15

Page 16: Lucene rev preso busch realtime search lr1010

Early query termination

• Possible if time-ordered results are desired (which is often the case in realtime search applications)

• Posting list requirements:

• Pointer from dictionary to last posting in a list (must always be up to date)

• Possibility to iterate lists in reverse order

• Allow efficient skipping

16

Page 17: Lucene rev preso busch realtime search lr1010

Posting list encodings

Doc IDs to encode: 5, 15, 9000, 9002

Delta encoding: 5, 10, 8985, 2

Safes space, if appropriate compression techniques are used, e.g.

variable-length integers (VInt)

• Lucene uses a combination of delta encoding and VInt compression

• VInts are expensive to decode

• Problem 1: How to traverse posting lists backwards?

• Problem 2: How to write a posting atomically, if not a primitive java type

17

Page 18: Lucene rev preso busch realtime search lr1010

Postinglist format

int (32 bits)

docID24 bits

max. 16.7M

textPosition8 bits

max. 255

• Tweet text can only have 140 chars

• Decoding speed significantly improved compared to delta and VInt decoding (early experiments suggest 5x improvement compared to vanilla Lucene with FSDirectory)

18

Page 19: Lucene rev preso busch realtime search lr1010

Postinglist format

• ints can be written atomically in Java

• Backwards traversal easy on absolute docIDs (not deltas)

• Every posting is a possible entry point for a searcher

• Skipping can be done without additional data structures as binary search, even though there are better approaches which should be explored

• On tweet indexes we need about 30% more storage for docIDs compared to delta+Vints, but that’s compensated by savings on position encoding!

• Max. segment size: 2^24 = 16.7M tweets

19

Page 20: Lucene rev preso busch realtime search lr1010

Objectives

• Only evaluate as few documents as possible before terminating the query

• Rank documents in reverse time order (newest documents first)

20

Page 21: Lucene rev preso busch realtime search lr1010

A Billion Queries Per Day

Agenda

- Introduction

- Posting list format and early query termination

‣ Index memory model

- Lock-free algorithms and data structures

- Roadmap

21

Page 22: Lucene rev preso busch realtime search lr1010

Index memory model

22

Page 23: Lucene rev preso busch realtime search lr1010

Objectives

• Store large amount of linked lists of a priori unknown lengths efficiently in memory

• Allow traversal of lists in reverse order (backwards linked)

• Allow efficient garbage collection

23

Page 24: Lucene rev preso busch realtime search lr1010

Inverted Index

1 The old night keeper keeps the keep in the town

2 In the big old house in the big old gown.

3 The house in the town had the big old keep

4 Where the old night keeper never did sleep.

5 The night keeper keeps the keep in the night

6 And keeps in the dark and sleeps in the light.

term freq

and 1 <6>big 2 <2> <3>

dark 1 <6>did 1 <4>

gown 1 <2>had 1 <3>

house 2 <2> <3>in 5 <1> <2> <3> <5> <6>

keep 3 <1> <3> <5>keeper 3 <1> <4> <5>keeps 3 <1> <5> <6>light 1 <6>

never 1 <4>night 3 <1> <4> <5>old 4 <1> <2> <3> <4>

sleep 1 <4>sleeps 1 <6>

the 6 <1> <2> <3> <4> <5> <6>town 2 <1> <3>where 1 <4>

Table with 6 documents

Dictionary and posting lists

Per term we store different kinds of metadata: text pointer, frequency, postings pointer, etc.

24

Page 25: Lucene rev preso busch realtime search lr1010

LUCENE-2329: Parallel posting arrays

class PostingList

int textPointer;int postingsPointer;int frequency;...

• Term hashtable is an array of these objects: PostingList[] termsHash

• For each unique term in a segment we need an instance; this results in a very large number of objects that are long-living, i.e. the garbage collecter can’t remove them quickly (they need to stay in memory until the segment is flushed)

• With a searchable RAM buffer we want to flush much less often and allow DocumentsWriter to fill up the available memory

25

Page 26: Lucene rev preso busch realtime search lr1010

LUCENE-2329: Parallel posting arrays

class PostingList

int textPointer;int postingsPointer;int frequency;...

• Having a large number of long-living objects is very expensive in Java, especially when the default mark-and-sweep garbage collector is used

• The mark phase of GC becomes very expensive, because all long-living objects in memory have to be checked

• We need to reduce the number of objects to improve GC performance!-> Parallel posting arrays

26

Page 27: Lucene rev preso busch realtime search lr1010

LUCENE-2329: Parallel posting arrays

PostingList[]

textPointer;

frequency;

intintint

postingsPointer;

27

Page 28: Lucene rev preso busch realtime search lr1010

LUCENE-2329: Parallel posting arrays

termIDint[] textPointer; frequency;postingsPointer;

int[] int[] int[]

0

1

2

3

4

5

6

28

Page 29: Lucene rev preso busch realtime search lr1010

LUCENE-2329: Parallel posting arrays

0

termIDint[] textPointer; frequency;postingsPointer;

int[] int[] int[]

t0 p0 f00

1

2

3

4

5

6

29

Page 30: Lucene rev preso busch realtime search lr1010

LUCENE-2329: Parallel posting arrays

1

0

termIDint[] textPointer; frequency;postingsPointer;

int[] int[] int[]

t0

t1

p0

p1

f0

f1

0

1

2

3

4

5

6

30

Page 31: Lucene rev preso busch realtime search lr1010

LUCENE-2329: Parallel posting arrays

1

0

2

termIDint[] textPointer; frequency;postingsPointer;

int[] int[] int[]

t0

t1

t2

p0

p1

p2

f0

f1

f2

0

1

2

3

4

5

6

• Total number of objects is now greatly reduced and is constant and independent of number of unique terms

• With parallel arrays we safe 28 bytes per unique term -> 41% savings compared to PostingList object

31

Page 32: Lucene rev preso busch realtime search lr1010

LUCENE-2329: Parallel posting arrays - Performance

• Performance experiments: Index 1M wikipedia docs

1) -Xmx2048M, indexWriter.setMaxBufferSizeMB(200)

4.3% improvement

2) -Xmx256M, indexWriter.setMaxBufferSizeMB(200)

86.5% improvement

32

Page 33: Lucene rev preso busch realtime search lr1010

LUCENE-2329: Parallel posting arrays - Performance

• With large heap there is a small improvement due to per-term memory savings

• With small heap the garbage collector is invoked much more often - huge improvement due to smaller number of objects (depending on doc sizes we have seen improvements of up to 400%!)

• With searchable RAM buffers we want to utilize all the RAM we have; with parallel arrays we can maintain high indexing performance even if we get close to the max heap size

33

Page 34: Lucene rev preso busch realtime search lr1010

Inverted index components

Parallel arraysDictionary

pointer to the most recently indexed posting for a term

Posting list storage

?

34

Page 35: Lucene rev preso busch realtime search lr1010

• Store many single-linked lists of different lengths space-efficiently

• The number of java objects should be independent of the number of lists or number of items in the lists

• Every item should be a possible entry point into the lists for iterators, i.e. items should not be dependent on other items (e.g. no delta encoding)

• Append and read possible by multiple threads in a lock-free fashion (single append thread, multiple reader threads)

• Traversal in backwards order

Posting lists storage - Objectives

35

Page 36: Lucene rev preso busch realtime search lr1010

Memory management

= 32K int[]

4 int[]pools

36

Page 37: Lucene rev preso busch realtime search lr1010

Memory management

= 32K int[]

4 int[]pools

Each pool can be grown

individually by adding 32K

blocks

37

Page 38: Lucene rev preso busch realtime search lr1010

Memory management

• For simplicity we can forget about the blocks for now and think of the pools as continuous, unbounded int[] arrays

• Small total number of Java objects (each 32K block is one object)

4 int[]pools

38

Page 39: Lucene rev preso busch realtime search lr1010

Memory management

• Slices can be allocated in each pool

• Each pool has a different, but fixed slice size

21

24

27

211

slice size

39

Page 40: Lucene rev preso busch realtime search lr1010

Adding and appending to a list

21

24

27

211

slice size

available

allocated

current list

40

Page 41: Lucene rev preso busch realtime search lr1010

Adding and appending to a list

21

24

27

211

slice size

Store first twopostings in this slice

available

allocated

current list

41

Page 42: Lucene rev preso busch realtime search lr1010

Adding and appending to a list

21

24

27

211

slice size

When first slice is full, allocate another one in second pool

available

allocated

current list

42

Page 43: Lucene rev preso busch realtime search lr1010

Adding and appending to a list

21

24

27

211

slice size

available

allocated

current list

Allocate a slice on each level as list grows

43

Page 44: Lucene rev preso busch realtime search lr1010

Adding and appending to a list

21

24

27

211

slice size

available

allocated

current list

On upper most level one list can own multiple slices

44

Page 45: Lucene rev preso busch realtime search lr1010

Posting list format

int (32 bits)

docID24 bits

max. 16.7M

textPosition8 bits

max. 255

• Tweet text can only have 140 chars

• Decoding speed significantly improved compared to delta and VInt decoding (early experiments suggest 5x improvement compared to vanilla Lucene with FSDirectory)

45

Page 46: Lucene rev preso busch realtime search lr1010

Addressing items

• Use 32 bit (int) pointers to address any item in any list unambiguously:

int (32 bits)

poolIndex2 bits0-3

offset in slice1-11 bits

depends on pool

sliceIndex19-29 bits

depends on pool

• Nice symmetry: Postings and address pointers both fit into a 32 bit int

46

Page 47: Lucene rev preso busch realtime search lr1010

Linking the slices

21

24

27

211

slice size

available

allocated

current list

47

Page 48: Lucene rev preso busch realtime search lr1010

Linking the slices

21

24

27

211

slice size

available

allocated

current list

Parallel arraysDictionary

pointer to the last posting indexed for a term

48

Page 49: Lucene rev preso busch realtime search lr1010

Objectives

• Store large amount of linked lists of a priori unknown lengths efficiently in memory

• Allow traversal of lists in reverse order (backwards linked)

• Allow efficient garbage collection

49

Page 50: Lucene rev preso busch realtime search lr1010

A Billion Queries Per Day

Agenda

- Introduction

- Posting list format and early query termination

- Index memory model

‣ Lock-free algorithms and data structures

- Roadmap

50

Page 51: Lucene rev preso busch realtime search lr1010

Lock-free algorithms and data structures

51

Page 52: Lucene rev preso busch realtime search lr1010

Objectives

• Support continuously high indexing rate and concurrent search rate

• Both should be independent, i.e. indexing rate should not decrease when search rate increases and vice versa.

52

Page 53: Lucene rev preso busch realtime search lr1010

Definitions

• Pessimistic locking

• A thread holds an exclusive lock on a resource, while an action is performed [mutual exclusion]

• Usually used when conflicts are expected to be likely

• Optimistic locking

• Operations are tried to be performed atomically without holding a lock; conflicts can be detected; retry logic is often used in case of conflicts

• Usually used when conflicts are expected to be the exception

53

Page 54: Lucene rev preso busch realtime search lr1010

Definitions

• Non-blocking algorithm

Ensures, that threads competing for shared resources do not have their execution indefinitely postponed by mutual exclusion.

• Lock-free algorithm

A non-blocking algorithm is lock-free if there is guaranteed system-wide progress.

• Wait-free algorithm

A non-blocking algorithm is wait-free, if there is guaranteed per-thread progress.

* Source: Wikipedia54

Page 55: Lucene rev preso busch realtime search lr1010

Concurrency

• Having a single writer thread simplifies our problem: no locks have to be used to protect data structures from corruption (only one thread modifies data)

• But: we have to make sure that all readers always see a consistent state of all data structures -> this is much harder than it sounds!

• In Java, it is not guaranteed that one thread will see changes that another thread makes in program execution order, unless the same memory barrier is crossed by both threads -> safe publication

• Safe publication can be achieved in different, subtle ways. Read the great book “Java concurrency in practice” by Brian Goetz for more information!

55

Page 56: Lucene rev preso busch realtime search lr1010

Java Memory Model

• Program order rule

Each action in a thread happens-before every action in that thread that comes later in the program order.

• Volatile variable rule

A write to a volatile field happens-before every subsequent read of that same field.

• Transitivity

If A happens-before B, and B happens-before C, then A happens-before C.

* Source: Brian Goetz: Java Concurrency in Practice56

Page 57: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

Cache

Thread A Thread B

time

57

Page 58: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

Cache 5

Thread A Thread B

x = 5;

Thread A writes x=5 to cache

time

58

Page 59: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

Cache 5

Thread A Thread B

x = 5;

while(x != 5);time

This condition will likely never become false!

59

Page 60: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

Cache

Thread A Thread B

time

60

Page 61: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

1

Cache

Thread A Thread B

time

volatile int b;

x = 5;5

Thread A writes b=1 to RAM, because b is volatile

b = 1;

61

Page 62: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

1

Cache

Thread A Thread B

time

volatile int b;

x = 5;5

Read volatile b

b = 1;

int dummy = b;

while(x != 5);

62

Page 63: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

1

Cache

Thread A Thread B

time

volatile int b;

x = 5;5

b = 1;

int dummy = b;

while(x != 5);

• Program order rule: Each action in a thread happens-before every action in that thread that comes later in the program order.

happens-before

63

Page 64: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

1

Cache

Thread A Thread B

time

volatile int b;

x = 5;5

b = 1;

int dummy = b;

while(x != 5);

happens-before

• Volatile variable rule: A write to a volatile field happens-before every subsequent read of that same field.

64

Page 65: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

1

Cache

Thread A Thread B

time

volatile int b;

x = 5;5

b = 1;

int dummy = b;

while(x != 5);

happens-before

• Transitivity: If A happens-before B, and B happens-before C, then A happens-before C.

65

Page 66: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

1

Cache

Thread A Thread B

time

volatile int b;

x = 5;5

b = 1;

int dummy = b;

while(x != 5);

This condition will be false, i.e. x==5

• Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.

66

Page 67: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

1

Cache

Thread A Thread B

time

volatile int b;

x = 5;5

b = 1;

int dummy = b;

while(x != 5);

Memory barrier

• Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.

67

Page 68: Lucene rev preso busch realtime search lr1010

Demo

68

Page 69: Lucene rev preso busch realtime search lr1010

Concurrency

0RAM

int x;

1

Cache

Thread A Thread B

time

volatile int b;

x = 5;5

b = 1;

int dummy = b;

while(x != 5);

Memory barrier

• Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.

69

Page 70: Lucene rev preso busch realtime search lr1010

Concurrency

IndexWriter IndexReader

time

write 100 docs

maxDoc = 100

in IR.open(): read maxDoc

search upto maxDoc

maxDoc is volatile

write more docs

70

Page 71: Lucene rev preso busch realtime search lr1010

Concurrency

IndexWriter IndexReader

time

write 100 docs

maxDoc = 100

in IR.open(): read maxDoc

search upto maxDoc

maxDoc is volatile

write more docs

happens-before

• Only maxDoc is volatile. All other fields that IW writes to and IR reads from don’t need to be!

71

Page 72: Lucene rev preso busch realtime search lr1010

• Single writer/multi reader approach

• High performance can be achieved by avoiding many volatile/atomic fields

• Safe publication can be achieved by ensuring that memory barriers are crossed

• Opening an in-memory reader is extremely cheap - can be done for every query (at Twitter we open a billion IndexReaders per day!)

• Problem still unsolved: how to update term->postinglist pointers to ensure correct search results?

So far...

72

Page 73: Lucene rev preso busch realtime search lr1010

Updating posting pointers

21

24

27

211

slice size

written after memory barrier was crossed

postingpointer

73

Page 74: Lucene rev preso busch realtime search lr1010

Updating posting pointers

21

24

27

211

slice size

written after memory barrier was crossed

• Safe publication guarantees that everything after a memory barrier is crossed is visible to a thread

• But: a thread might already see a newer state even if the next memory barrier wasn’t crossed yet

updatedpointer

previouspointer

74

Page 75: Lucene rev preso busch realtime search lr1010

Updating posting pointers

21

24

27

211

slice size

written after memory barrier was crossed

• This means: a reader might get a postings pointer addressing an int[] block that is not allocated yet!

• We have to handle this case carefully

updatedpointer

previouspointer

75

Page 76: Lucene rev preso busch realtime search lr1010

Updating posting pointers

• Basic idea: keep two most recent pointers

• Writer toggles active pointer whenever memory barrier is crossed

• Reader can detect if a pointer is safe to follow

• Optimistic approach: Reader always considers latest pointer first - only if that one isn’t safe it falls back to previous pointer

76

Page 77: Lucene rev preso busch realtime search lr1010

• Not a single exclusive lock

• Writer thread can always make progress

• Optimistic locking (retry-logic) in a few places for searcher thread

• Retry logic very simple and guaranteed to always make progress - one of the two available posting pointers is always safe to use

Wait-free

77

Page 78: Lucene rev preso busch realtime search lr1010

Objectives

• Support continuously high indexing rate and concurrent search rate

• Both should be independent, i.e. indexing rate should not decrease when search rate increases and vice versa.

78

Page 79: Lucene rev preso busch realtime search lr1010

A Billion Queries Per Day

Agenda

- Introduction

- Posting list format and early query termination

- Index memory model

- Lock-free algorithms and data structures

‣ Roadmap

79

Page 80: Lucene rev preso busch realtime search lr1010

Roadmap

80

Page 81: Lucene rev preso busch realtime search lr1010

• LUCENE-2329: Parallel posting arrays

• LUCENE-2324: Per-thread DocumentsWriter and sequence IDs

• LUCENE-2346: Change in-memory postinglist format

• LUCENE-2312: Search on DocumentsWriters RAM buffer

• IndexReader, that can switch from RAM buffer to flushed segment on-the-fly

• Sorted term dictionary (wildcards, numeric queries)

• Stored fields, TermVectors, Payloads (Attributes)

Roadmap

81