1University of Pennsylvania
© 2014 A. Haeberlen, Z. Ives
CIS 455/555: Internet and Web Systems
Indexing
February 5, 2014
© 2014 A. Haeberlen, Z. Ives
2University of Pennsylvania
Announcements HW1 MS1 is due IN ONE WEEK
At this point, you should have a feature-complete prototype, so you have time to debug and test your solution
Debugging tips When in doubt about protocol details, please look in
the HTTP/1.1 spec (RFC2616; linked from HTTP Made Really Easy)
Reminder: You have three jokers; the late penalty without jokers is 20% per day
Please: Use private questions on Piazza sparingly
Reading: D. Comer: "The Ubiquitous B-Tree"
http://dl.acm.org/citation.cfm?id=356776
© 2014 A. Haeberlen, Z. Ives
3University of Pennsylvania
Plan for today
Inverted indices B+ trees
NEXT
© 2014 A. Haeberlen, Z. Ives
4
Finding data by content
We’ve seen two approaches to search: Flood the network with requests (example: Gnutella),
and do all the work at the data stores Have a directory based on names (example: LDAP)
Which of these is the 'best'?
An alternative, two-step process: Build a content index over what’s out there
An index is a keyvalue map Typically limited in what kinds of queries can be
supported Most common instance: an index of document
keywords
© 2014 A. Haeberlen, Z. Ives
A common model for search
Index the words in every document
“Forward index”: document (ID) list of words
“Inverted index”: word document (ID)
5
© 2014 A. Haeberlen, Z. Ives
6
Inverted indices A conceptually very simple map-multiset data
structure: <keyword, {list of occurrences}>
In its simplest form, each occurrence includes a document pointer (e.g., URI), perhaps a count and/or position
What might a count be useful for? A position?
Requires two components, an indexer and a retrieval system
We’ll consider the cost of building the index, plus searching the index using a single keyword
Storage efficiency is also a concern
© 2014 A. Haeberlen, Z. Ives
7
How do we lay out an inverted index?
Some data structures we could use: Unordered list (e.g., a log) Ordered list Tree Hash table
© 2014 A. Haeberlen, Z. Ives
8
Unordered and ordered lists
Assume that we have entries such as:<keyword, #items, {occurrences}>
What does ordering buy us?
Assume that we adopt a model in which we use:
<keyword, item><keyword, item>
Do we get any additional benefits?
How about:<keyword, {items}>
where we fix the size of the keyword and the number of items?
© 2014 A. Haeberlen, Z. Ives
9
Tree-based indices
Trees have several benefits over lists: Potentially logarithmic search time, as with a well-
designed sorted list if it is balanced!
Ability to handle variable-length records
We’ve already seen how trees might make a natural way of distributing data, as well
How does a binary search tree fare? Cost of building? Cost of finding an item in it?
© 2014 A. Haeberlen, Z. Ives
10University of Pennsylvania
Recap: Inverted indices
Useful for search
Different data structures can be used Pros / cons
© 2014 A. Haeberlen, Z. Ives
11University of Pennsylvania
Plan for today
Inverted indices B+ trees NEXT
© 2014 A. Haeberlen, Z. Ives
The B+ tree A flexible, height-balanced, high-fanout tree Insert/delete at logF N cost (F = fanout, N = # leaf
pages) Need to keep tree height-balanced
Minimum 50% occupancy (except for root) Each node contains d <= m <= 2d entries Inner nodes contain up to 2d+1 pointers d is called the order of the tree
Can search efficiently based on equality (or also range, though we don’t need that here)
Index Entries
Data Entries("Sequence set")
(Direct search)
...Linked list
(compare to B-tree!)
© 2014 A. Haeberlen, Z. Ives
Example B+ Tree Data (inverted list pointers) is at the leaves;
intermediate nodes have copies of search keys
Search begins at root, and key comparisons direct it to a leaf
Search for be↓, bobcat↓ ...
Based on the search for bobcat*, we know it is not in the tree!
Root
best but dog
a↓ am ↓ an↓ ant↓ art↓ be↓ best↓ bit↓ bob↓ but↓can↓cry↓ dog↓ dry↓ elf↓ fox↓
art
© 2014 A. Haeberlen, Z. Ives
Inserting data into a B+ Tree
Find correct leaf L Put data entry onto L
If L has enough spacewe are, done!
Else, must split leaf node L (into L and a new node L2)
Redistribute entries evenly, copy up middle key Insert index entry pointing to L2 into parent of L
This can happen recursively To split index node, redistribute entries evenly, but
push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height
Tree growth: gets wider or one level taller at the top
Root
best but dog
a↓am ↓an↓ ant↓ art↓ be↓ best↓bit↓ bob↓ but↓can↓cry↓ dog↓dry↓elf↓ fox↓
art
© 2014 A. Haeberlen, Z. Ives
15
Inserting “and↓” Example: Copy up
Want to insert here; no room, so split & copy up:
a↓ am ↓ an↓ and↓ ant↓
an
Entry to be inserted in parent node.(Note that key “an” is copied up andcontinues to appear in the leaf.)
and↓
Root
best but dog
a↓ am ↓ an↓ ant↓ art↓ be↓ best↓ bit↓ bob↓ but↓can↓cry↓ dog↓ dry↓ elf↓ fox↓
art
But where? Parent nodeis already "full"!
© 2014 A. Haeberlen, Z. Ives
16
Inserting “and↓” Example: Push up 1/2
Root
art↓ be↓ best↓ bit↓ bob↓ but↓can↓ cry↓
an
Need to split node & push up
best but dogart
a↓ am ↓ dog↓ dry↓ elf↓ fox↓
an↓ ant↓ and↓
© 2014 A. Haeberlen, Z. Ives
17
Inserting “and↓” Example: Push up 2/2
Root
art↓ be↓ best↓ bit↓ bob↓ but↓can↓ cry↓
an but dog
best
art
Entry to be inserted in parent node.
(Note that best is pushed up and only
appears once in the index. Contrast
this with a leaf split.)
a↓ am ↓ dog↓ dry↓ elf↓ fox↓
an↓ ant↓ and↓
© 2014 A. Haeberlen, Z. Ives
18
Summary: Copying vs. splitting
Every keyword (search key) appears in at most one intermediate node
Hence, in splitting an intermediate node, we push up
Every inverted list entry must appear in a leaf
We may also need it in an intermediate node to define a partition point in the tree
We must copy up the key of this entry
Note that B+ trees easily accommodate multiple occurrences of a keyword
© 2014 A. Haeberlen, Z. Ives
19University of Pennsylvania
Some details
How would you choose the order of the tree?
How would you find all the words starting with the letters 'com'?
How would you delete something?
Do you always have to split/merge?
© 2014 A. Haeberlen, Z. Ives
Virtues of the B+ Tree
B+ tree and other indices are quite efficient: Height-balanced; logF N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average
Berkeley DB library (C, C++, Java; Oracle) is a toolkit for B+ trees that you will be using later in the semester:
Interface: open B+ Tree; get and put items based on key
Handles concurrency, caching, etc.
© 2014 A. Haeberlen, Z. Ives
21University of Pennsylvania
Example: B+ tree
Insert 15, 11, 12, 32, 74
65 130 187
9 25 45 70 80 101 138 150159122 180
1 4 6 9 14 16 25 31 38 41 45 61 63 64
65 67 68 69 70 72 75 79
© 2014 A. Haeberlen, Z. Ives
22
How do we distribute a B+ Tree?
We need to host the root at one machine and distribute the rest
What are the implications for scalability?
Consider building the index as well as searching
© 2014 A. Haeberlen, Z. Ives
23
Eliminating the root
Sometimes we don’t want a tree-structured system because the higher levels can be a central point of congestion or failure
Two strategies: Modified tree structure (e.g., the BATON p2p tree;
see Jagadish et al., VLDB 2005) Non-hierarchical structure (distributed hash table,
discussed in a couple of weeks)
© 2014 A. Haeberlen, Z. Ives
24University of Pennsylvania
Recap: B+ trees
A very common data structure for indices
Used, e.g., in many file systems and many DBMS
Very efficient Height-balanced; logF N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average