Signature File

Embed Size (px)

Citation preview

  • 8/3/2019 Signature File

    1/28

    1

    Signature Files

    Information Retrieval: Data Structures and Algorithms

    by W.B. Frakes and R. Baeza-Yates (Eds.)

    Englewood Cliffs, NJ: Prentice Hall, 1992.

    (Chapters 4)

  • 8/3/2019 Signature File

    2/28

    2

    Signature Files

    Characteristics

    Word-oriented index structures based on hashing

    Low overhead (10%~20% over the text size) at the cost of forcing a

    sequential search over the index

    Suitable for not very large texts

    Inverted files outperform signature files for most applications

  • 8/3/2019 Signature File

    3/28

    3

    Structure

    Use superimposed codingto create signature.

    Each textis divided into logical blocks.

    A blockcontains n distinct non-common words. Each wordyields word signature.

    A word signature is a B-bit pattern, with m 1-bit.

    Each word is divided into successive, overlapping triplets. e.g. free

    --> fr, fre, ree, ee

    Each such triplet is hashed to a bit position. The word signatures are ORed to form block signature.

    Block signatures are concatenated to form the document

    signature.

  • 8/3/2019 Signature File

    4/28

    4

    Example

    Example (n=2, B=12, m=4)word signature

    free 001 000 110 010

    text 000 010 101 001block signature 001 010 111 011

    Search

    Use hash function to determine the m 1-bit positions.

    Examine each block signature for 1s bit positions that the signature

    of the search word has a 1.

  • 8/3/2019 Signature File

    5/28

    5

    False Drop

    false alarm (false hit, or false drop) Fdthe probabilitythata block signature seems to qualify, given thatthe

    block does notactuallyqualify.

    Fd= Prob{signature qualifies/block does not} For a given value ofB, the value ofm that minimizes the false

    drop probability is such that each row of the matrix contains 1s

    with probability 0.5.

    Fd= 2-m

    m = B ln2/n

  • 8/3/2019 Signature File

    6/28

    documents

    assume documents span exactly one logical blockthe size of document signature F = the size of block signature B

    Sequential Signature File (SSF)

  • 8/3/2019 Signature File

    7/28

    7

    Classification of Signature-Based Methods

    CompressionIfthe signature matrix is deliberatelysparse, itcan be compressed.

    Vertical partitioning

    Storingthe signature matrix column-wise improves the response timeon the expense of insertion time.

    Horizontal partitioningGrouping similar signatures together and/or providing an index on the

    signature matrix mayresult in better-than-linear search.

  • 8/3/2019 Signature File

    8/28

    8

    Classification of Signature-Based Methods

    Sequential storage of the signature matrix

    without compressionsequential signature files (SSF)

    with compressionbit-block compression (BC)

    variable bit-block compression (VBC)

    Vertical partitioning

    without compression

    bit-sliced signature files (BSSF, BSSF)frame sliced (FSSF)

    generalized frame-sliced (GFSSF)

  • 8/3/2019 Signature File

    9/28

    9

    Classification of Signature-Based Methods(Continued)

    with compressioncompressed bit slices (CBS)

    doubly compressed bit slices (DCBS)

    no-false-drop method (NF

    D) Horizontal partitioning

    data independent partitioning

    Gustafsons method

    partitioned signature files

    data dependent partitioning2-level signature files

    5-trees

  • 8/3/2019 Signature File

    10/28

    10

    Criteria

    the storage overhead

    the response time on single word queries

    the performance on insertion, as well as whether theinsertion maintains the append-only property

  • 8/3/2019 Signature File

    11/28

    11

    Compression

    idea

    Create sparse document signatures on purpose.

    Compress them before storing them sequentially.

    Method

    Use B-bit vector, where B is large.

    Hash each word into one (or k) bit position(s).

    Use run-length encoding (McIlroy 1982).

  • 8/3/2019 Signature File

    12/28

    Compression using run-length encoding

    data 0000 0000 0000 0010 0000

    base 0000 000 1 0000 0000 0000management 0000 1000 0000 0000 0000

    system 0000 0000 0000 0000 1000

    block signature 0000 1001 0000 0010 1000

    L1 L2 L3 L4 L5

    [L1] [L2] [L3] [L4] [L5]

    where [x] is the encoded vale ofx.

    search: Decode the encoded lengths of all the preceding intervals

    example: search data

    (1) data ==> 0000 0000 0000 0010 0000

    (2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000

    disadvantage: search becomes low

  • 8/3/2019 Signature File

    13/28

    Bit-blockCompression (BC)

    Data Structure:

    (1) The sparse vector is divided into groups of consecutive bits(bit-blocks).

    (2) Each bit block is encoded individually.

    Algorithm:

    Part I. It is one bit long, and it indicates whether there are any

    1s in the bit-block (1) or the bit -block is (0). In

    the latter case, the bit-block signature stops here.

    0000 1001 0000 0010 1000

    0 1 0 1 1

    Part II. It indicates the numbers of 1s in the bit-block. It consists

    ofs-1 1 and a terminating zero.

    10 0 0Part III. It contains the offsets of the 1s from the beginning of the

    bit-block.

    0011 10 00

    4bits0, 1, 2, 300, 01, 10, 11block signature: 01011 | 10 00 | 00 11 10 00

  • 8/3/2019 Signature File

    14/28

    14

    Bit-blockCompression (BC)(Continued)

    Search data

    (1) data ==> 0000 0000 0000 0010 0000

    (2) check the 4th block of signature 01011 | 10 0 0 | 00 11 10 00

    (4) OK, there is at least one setting in the 4th bit-block.

    (5) Check furthermore. 0 tells us there is only one setting inthe 4th bit-clock. Is it the 3rd bit?

    (6) Yes, 10 confirms the result.

    Discussion:

    (1) Bit-block compression requires less space than Sequential

    Signature File for the same false drop probability.

    (2) The response time ofBit-block compression is lightly less

    then Sequential Signature File.

  • 8/3/2019 Signature File

    15/28

    15

    Vertical Partitioning

    idea

    avoidbringing useless portions ofthe documentsignature in

    main memory

    methods store the signature file in a bit-sliced form or in a frame-sliced form

    store the signature matrix column-wise to improve the response

    time on the expense of insertion time

  • 8/3/2019 Signature File

    16/28

    Bit-Sliced Signature Files (BSSF)

    Transposed bit matrix

    transpose

    represent

    documents

    documents(document signature)

  • 8/3/2019 Signature File

    17/28

    F bit-files

    search: (1) retrieve mbit-files.

    e.g., the word signature of free is 001 000 110 010

    the document contains free: 3rd, 7th, 8th, 11th bit are set

    i.e., only 3rd, 7th, 8th, 11th files are examined.

    (2) and these vectors. The 1s in the result N-bit vector

    denote the qualifying logical blocks (documents).

    (3) retrieve text file through pointer file.

    insertion: require F disk accesses for a new logical block (document),

    one for each bit-file, but no rewriting

    documents

  • 8/3/2019 Signature File

    18/28

    18

    Frame-Sliced Signature File (FSSF)

    Ideas

    random disk accesses are more expensive than sequential ones

    force each word to hash into bit positions that are closer to each

    other in the document signature these bit files are stored together and can be retrieved with a few

    random accesses

    Procedures

    The document signature (Fbits long) is divided into kframes ofs

    consecutive bits each.

    For each word in the document, one of the kframes will be chosen

    by a hash function.

    Using another hash function, the word sets m bits in that frame.

  • 8/3/2019 Signature File

    19/28

    19

    documents

    frames

    Each frame will be kept in consecutive disk blocks.

    Frame-Sliced Signature File (Cont.)

  • 8/3/2019 Signature File

    20/28

    20

    FSSF (Continued)

    Example (n=2, B=12, s=6, f=2, m=3)

    Word Signature

    free 000000 110010

    text 010110 000000

    doc. signature 010110 110010

    Search

    Only one frame has to be retrieved for a single word query. I.E., only

    one random disk access is required.

    e.g., search documents that contain the word free

    ->because the word signature of free is placed in 2nd frame,only the 2nd frame has to be examined.

    At most kframes have to be scanned for an kword query.

    Insertion

    Only f frames have to be accessed instead ofFbit-slices.

  • 8/3/2019 Signature File

    21/28

    21

    Vertical Partitioning with Compression

    idea

    create a very sparse signature matrix

    store it in a bit-sliced form

    compress each bit slice by storing the position of the 1s in

    the slice.

  • 8/3/2019 Signature File

    22/28

    22

    Compressed Bit Slices (CBS)

    Rooms for improvements

    Searching

    Each search word requires the retrieval ofm bit files.

    The search time could be improved ifm was forced to be 1.

    Insertion

    Require too many disk accesses (equal to F, which is typically

    600-1000).

  • 8/3/2019 Signature File

    23/28

    23

    Compressed Bit Slices (CBS)(Continued)

    Let m=1. To maintain the

    same false drop probability,

    F has to be increased.

    To compress each bit file,we store only the positions

    of the 1s.

    For unpredictable number

    of 1s, we store them in

    buckets of size Bp.

    documents

    Size

    ofa

    sign

    ature

    Sparse bit matrix

  • 8/3/2019 Signature File

    24/28

    h(base)=30

    Obtain the pointers to the

    relevant documents from

    bucketsHash a word to

    obtain bucket address

    Differences with

    inversion

    The directory (hash

    table) is sparse

    The actual word is

    stored nowhere

    Simple structure

  • 8/3/2019 Signature File

    25/28

    Doubly Compressed Bit Slices

    h1(base)=30 h2(base)=011Follow the pointers of posting

    buckets to retrieve the qualifying

    documents.

    Distinguish synonyms partially.

    Idea:

    compress

    the sparse

    directory

    S

    buckets

    hash

    function

  • 8/3/2019 Signature File

    26/28

    No False Drops Method

    Using pointer to the word

    in the text file

    To distinguish between

    synonyms completely.

  • 8/3/2019 Signature File

    27/28

    Horizontal Partitioning

    documents

    1. Goal: group the signatures into sets, partitioning the signature

    matrix horizontally.2. Grouping criterion

  • 8/3/2019 Signature File

    28/28

    28

    Partitioned Signature Files

    Using a portion of a document signature as a signature key to

    partition the signature file.

    All signatures with the same key will be grouped into a so-called

    module. When a query signature arrives,

    examine its signature key and look for the corresponding modules

    scan all the signatures within those modules that have been

    selected