24
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

Page 1: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

On Demand String Sorting over Unbounded Alphabets

Carmel Kent Moshe Lewenstein Dafna Sheinwald

Page 2: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

On-Demand String Sorting Preprocessing a set of n strings for an efficient subsequent repeated:

extract the next lexicographically smallest string Motivation, e.g.: Search engines recurrently return the next best k (k < n) pages

Pages typically ranked by relevance, but can also by values of specified field Our Heap of Strings (HoS ) preprocesses in O (n) time and extracts next smallest in

O (log n) time + amortized O (N ) time, over all operations on all n strings N = total length of the n strings.

Combines Heap and Longest Common Prefix properties Works for unbounded alphabets

Sorting of Strings is possible in O(n log n + N ) time. Implementing all classic Heap operations, only paying extra O (N ) amortized, is not

simple

Page 3: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

LCP Def: lcp (S1 , S2) denotes the length of the largest

common prefix of S1 and S2

Folklore Lemma: For strings S1 , S2,…, Sm , lcp (S1, Sm) = min1i<m lcp (Si , Si+1)

For strings S1 S2, S3 S2 S3 lcp (S1, S2) lcp (S1, S3) Equivalently: lcp (S1, S2) >lcp (S1, S3) S2 > S3

For strings S1 S2, S3 S2 S3 lcp (S1, S2) lcp (S1, S3) Equivalently: lcp (S1, S2) <lcp (S1, S3) S2 > S3

Page 4: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Heap of Strings (HoS)

Binary (balanced) tree Each node n holds a string S (n) and an lcp

value lcp (n) Each node n satisfies the HoS property:

S(n) S(p) for parent p of n lcp (n) = lcp (S(n), S(p) )

Page 5: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Procedures for Heapify

Basic Step for 2 (BS2): Given strings S1, S2, compare the two, letter by letter, until smaller is identified, and the lcp of the two is

determined. Basic Step for 3 (BS3):

Given strings S1, S2, and S3, compare all three, letter by letter, until smallest, Si , is identified, and the lcp of each of the

other strings with Si is determined.

Page 6: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

More Procedures for Heapify

Basic Step for 2, starting in position l, BS2(l ) : Given strings S1, S2, with common prefix of length l compare the two, letter by letter, starting from position l until smaller, Si , is identified, and the lcp of the two is

determined. Basic Step for 3, starting in position l, BS3(l ) :

Given strings S1, S2, and S3, with common prefix of length l compare all three, letter by letter, starting from position l until smallest, Si , is identified, and the lcp of each of the

other strings with Si is determined.

Page 7: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Heapify - based on the classic O(n) process

Strings thrown into a binary balanced tree

Bottom up,Subtrees are made into HoS-s

Page 8: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Merge two little HoS-es and a root node into one big HoS

BS3 of three nodes.(two larger get lcp wrt smallest)

At most one subtreegets a new root

This new root, as well as twochildren, all have larger stringsthan grand root, and havelcp-s wrt it

Swap, if needed, to getsmallest positioned as root

Page 9: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Merge two little HoS-es and a root node into one big HoS

Comparing lcp-s suffices.On equality, read,from lcp on, BS2(l), or BS3(l),until smallest is found and updated lcp determined.

Always record lcp withstrings found larger,and swap, if needed,to make smallest the parent.Thus maintaining HoS property

If swap needed, continue recursivelysifting down

Page 10: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Merge two little HoS-es and a root node into one big HoS: Sifting Down

On each swap, a sub-HoS gets a new root. That new root, as well as its two children, all

have larger strings than, and lcp wrt, old root now positioned as parent of new root.

Comparing lcp suffices to tell smallest of On tie, read more of two (or three) strings,

starting at common lcp, until smallestfound and updated lcp is determined Record updated lcp with larger string(s).

Page 11: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Sifting down in a HoS of height h

O(h) node operations For each string comparison, at least one string

has its lcp field increase by the number of letter comparisons made.

Page 12: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Heapifying into a HoS of n strings of total length N takes O(n+N) time

A string with lcp = l never gets its prefix of length l participating in any letter comparison

No more than a total of O(N) letter comparisons

Heapifying completes in O(n) node operations + a total of O(N) letter comparisons.

Page 13: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Extracting next smallest string from a HoS

Extract the root Both (now orphan) children are larger than, and have

lcp wrt, their gone parent Comparing lcp-s suffices to find smaller of the two. In case of a tie, BS2(l ) finds smaller and updates lcp;

record updated lcp with the larger child Promote smaller child to vacant parent position Recurse in subtree rooted by promoted child HoS property maintained.

Tree might become unbalanced, but not higher.

Page 14: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Extracting next smallest string from a HoS

For a HoS of height hO(h) node operations.

Letter comparison only from common lcp on.

No more than a total of O(N) letter comparisonsfor heapify followed by sequential extraction of all strings

For each letter comparison, at least one lcp grows.lcp never decreases.

HoS becomes unbalancedBut height does not grow.

Thm: Sorting of n strings of total length N,over unbounded alphabet,is possible in O(n log n + N) time,using O(n) space.

Page 15: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

O (n log n + N ) string sorting String sorting is a classical problem, appears in

textbooks [Knuth, AHU] Variants: multikey sorting, parallel sorting Weight balanced ternary search trie [M 79]

achieves this runtime QuickSort with average sorting time of

O (n log n + N ) [BS, 97] Multi phase merge sort, for enhancing cache

utilization [I, 05. IBM in the ’80s]

Page 16: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

O (n log n + N ) string sorting

Indexing data structures: suffix trees, suffix arrays, BIS [AKLL, 05] for suffixes of same string Allow O (n log n + N ) sorting Some can adjust to a general set of strings (BIS)

All use O (n log n + N) just to build the data structure and get the first result out.

Page 17: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Efficient On Demand Sorting

Thm: On Demand Sorting of n strings of total length N can be done with the extraction of the first result in O(n + N1) time, after which the retrieval of further results in O(log n + Ni) time for the i-th result,

with i Ni N.

Page 18: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Variation:Find the smallest k < n strings

Maintain a HoS of k elements, with parents LARGER than children. root holds largest of smallest k

Build a HoS from arbitrary k of the set For each remaining string in the set:

compare with root and determine lcp of the two if new is larger than root – discard new otherwise, discard root, and sift down

Page 19: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Find the smallest k < n strings

O (n log k + N) to identify k smallest of n + O ( k log k) to get these sorted.

Page 20: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Can HoS do additional operations

of the ordinary (integer) Heap? Already seen: Extract min costs O(height)

yet does not maintain tree balanced The classic delete takes the last leaf and sifts it

down from root, thus maintaining balance Leaf loses the lcp it has gained, and compares

again its leading letters Classic insertion by sifting up

Some nodes get their grandparent becoming their parent, need to decrease their lcp

Page 21: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

BIS

BIS

BIS BIS

BIS

Insertion, creating embedded data structure

Original nodes do not move.Grandparent do not become parents

When pumping up forextraction, smallest node leavesBIS and becomes HoS node.

HoS node never gets intoa BIS.

Page 22: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

BIS Balanced Indexing StructuresAKLL, 2005

Adapted here from suffixes to any set of strings BIS is an AVL tree with fixed size extra info (lcp s

and pointers) in nodes allows to insert a string of length l to a tree of size n in

O (log n + l ) time Deletion in O(log n)

Page 23: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Altogether

Thm: It is possible to construct a heap of n strings in O(n) time and support further string insertions and smallest string extractions in time O(log(n) + log(m)) + O(N) amortized over the whole sequence of heap operations, where m is the number of strings inserted post heapifying and were not extracted yet.

Page 24: On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

Conclusion

Combining basic elements, we support a modern concept: On Demand

lcp is proved again an interesting, useful measure (lcp with what?)

This is a real need: basic sort and k smallest sort are implemented in a search engine product