24
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer

Sets of Digital Data CSCI 2720 Fall 2005 Kraemer

Embed Size (px)

Citation preview

  • Sets of Digital DataCSCI 2720Fall 2005Kraemer

  • Digital Data In earlier work with BSTs and various balanced trees, we compared keys for order or equalityHere, we take advantage of structure of keyUse it as an index, orDecompose string key into characters, orTreat key as numerical quantity on which we can perform operations

  • AssumptionsWe will construct and manipulate sets thatAre drawn from a universe U of size NU = {u0, uN-1}A relatively simple procedure exists by which we can compute, for an element u U, the index i such that u = ui.Easy if U is set of integersAlso easy if U is set of characters with character codes in a contiguous interval

  • Bit VectorUsed to represent a subset S UA table of N bits, Bits[0.. N-1] Bits[i] == 1 if ui SBits[i] == 0 if ui SExample: todays attendance

    1 1 0 1 0 1 10 1 2 3 4 5 6 -- student number1 = present0 = absent

  • Bit VectorsAssume:determining element index takes constant timeaccessing position in table takes constant timeMay actually take several ops, and depend somewhat on N(size of universe), but not on size of set representedThen:Insert, Delete, Member are constant time ops

  • Bit VectorsA subset of a set of size N always takes N bits to represent, independent of size of subsetMakes sense if:N is not too large need to represent sets of size comparable to N

  • Storage EfficiencyBit Vector vs. Binary TreesBinary Tree, set of size nRequires n(2p + K) bitsK >= lg N, size of field to represent key valuep = number of bits in a pointerBit Vector, takes N bitsIf n N, then bit vector more efficientIf p = K = 32, then tree becomes more space efficient when n/N 1%Actually, when n(2p + K) = N, which is when n/N = 1/96

  • When to use Bit Vectors?When universe is relatively smallWhen sets are large in relation to size of universe

  • Advantages of Bit VectorsO(1) implementation of Insert, Delete, MemberUnion and Intersection easyImplement via Boolean and and or operationsMay actually take less than one op/element, as operations are performed on full machine word If machine word == 32, then one machine operation handles 32 potential elements of set

  • Disadvantages of Bit VectorsOn some computers access to individual bits can require shifting and masking operations (expensive)Result is that Member may be much more expensive than Union

    Initialization takes (N) -- zero all the bits in the vectorBut can use constant time initialization algorithmBut that makes storage requirement go to 2p + 1 bits per elementSo, in practice, just use machine ops to set to zero, which are efficient

  • Tries and Digital Search TreesIf the key can be decomposed into characters, then the characters of the key can be used as indicesTries are based on this ideatrie is the middle symbol of retrieval, a pun on tree, but pronounced try

  • TriesAssume k possible character valuesA trie is a (k+1)-ary treeeach node a table of k+1 pointersOne pointer for each possible characterOne for the end of string character,

  • Trie Example

  • TriesPath for key of m characters is length m, with pointer at Dont need to store key itself .. It is the path followed. Info field might be pointed to by element

  • Tries: AnalysisLet:n be the number of keys stored in a triel be the length(in characters) of the longest keys be the number of nodes in the triek be the size of the alphabetPro:Access time is O(l), independent of k, n and sCon:Size -- requires (k+1) * s * p bitsMost pointers are null, so lots of wasted space

  • Strategies for reducing storage requirements of triesImplement a k-ary trie with m nodes as a 2-D, m by k table

    A B C D E M . P . T . 012345

    ------1-2-3--45-----------6---7--8----------------9------------10-

  • Table approachNumber the nodes in the diagram of slide 13 from 1 to mThe table entry corresponding to jth child of ith node is the index of the child node

    How does that save space? Just as many nodes and elements as on slide 13 need only ceil(lg(m)) bits to represent, smaller than a pointer

  • Patricia Tree:Another strategy for reducing space in a triePatricia treePractical Algorithm to Retrieve Information Coded in AlphanumericEliminate nodes with only one nonempty child Can now skip right from T to in TURING in our exampleSkip from MA . To E or in the MENDEL , MENDELEEV chainBut need to store with each node the index of the character on which it discriminatesAnd need to store the key itself at the leaf

  • Patricia tree

  • de la Briandais treesAnother strategy to save space vs. standard triesUse a linked list instead of a table at the node levelEach pointer labeled with the character it indexeslonger search time than tries; depends on size of character setsaves significant amounts of memory

  • de la Briandais

  • Another strategy Use tries at the first few levelsUse ordinary BSTs or de la Briandais at the lower levelsreasoning: speed advantage at the top, but not too much extra memory requiredsave space at lower levels

  • Digital Search TreesTreat keys as bit strings (strings over the alphabet {0,1})Binary tree search directed left on 0, right on 1Each node contains not only two pointers, but also contains a key that matches that string prefixCompare for equality before searching left or rightIf frequencies are known, store higher frequency keys nearer rootCan be grown dynamicallyExpected Search time: O(log n)

  • Digital Search Tree