Upload
laxmi-blossom
View
224
Download
0
Embed Size (px)
Citation preview
8/3/2019 Signature File
1/28
1
Signature Files
Information Retrieval: Data Structures and Algorithms
by W.B. Frakes and R. Baeza-Yates (Eds.)
Englewood Cliffs, NJ: Prentice Hall, 1992.
(Chapters 4)
8/3/2019 Signature File
2/28
2
Signature Files
Characteristics
Word-oriented index structures based on hashing
Low overhead (10%~20% over the text size) at the cost of forcing a
sequential search over the index
Suitable for not very large texts
Inverted files outperform signature files for most applications
8/3/2019 Signature File
3/28
3
Structure
Use superimposed codingto create signature.
Each textis divided into logical blocks.
A blockcontains n distinct non-common words. Each wordyields word signature.
A word signature is a B-bit pattern, with m 1-bit.
Each word is divided into successive, overlapping triplets. e.g. free
--> fr, fre, ree, ee
Each such triplet is hashed to a bit position. The word signatures are ORed to form block signature.
Block signatures are concatenated to form the document
signature.
8/3/2019 Signature File
4/28
4
Example
Example (n=2, B=12, m=4)word signature
free 001 000 110 010
text 000 010 101 001block signature 001 010 111 011
Search
Use hash function to determine the m 1-bit positions.
Examine each block signature for 1s bit positions that the signature
of the search word has a 1.
8/3/2019 Signature File
5/28
5
False Drop
false alarm (false hit, or false drop) Fdthe probabilitythata block signature seems to qualify, given thatthe
block does notactuallyqualify.
Fd= Prob{signature qualifies/block does not} For a given value ofB, the value ofm that minimizes the false
drop probability is such that each row of the matrix contains 1s
with probability 0.5.
Fd= 2-m
m = B ln2/n
8/3/2019 Signature File
6/28
documents
assume documents span exactly one logical blockthe size of document signature F = the size of block signature B
Sequential Signature File (SSF)
8/3/2019 Signature File
7/28
7
Classification of Signature-Based Methods
CompressionIfthe signature matrix is deliberatelysparse, itcan be compressed.
Vertical partitioning
Storingthe signature matrix column-wise improves the response timeon the expense of insertion time.
Horizontal partitioningGrouping similar signatures together and/or providing an index on the
signature matrix mayresult in better-than-linear search.
8/3/2019 Signature File
8/28
8
Classification of Signature-Based Methods
Sequential storage of the signature matrix
without compressionsequential signature files (SSF)
with compressionbit-block compression (BC)
variable bit-block compression (VBC)
Vertical partitioning
without compression
bit-sliced signature files (BSSF, BSSF)frame sliced (FSSF)
generalized frame-sliced (GFSSF)
8/3/2019 Signature File
9/28
9
Classification of Signature-Based Methods(Continued)
with compressioncompressed bit slices (CBS)
doubly compressed bit slices (DCBS)
no-false-drop method (NF
D) Horizontal partitioning
data independent partitioning
Gustafsons method
partitioned signature files
data dependent partitioning2-level signature files
5-trees
8/3/2019 Signature File
10/28
10
Criteria
the storage overhead
the response time on single word queries
the performance on insertion, as well as whether theinsertion maintains the append-only property
8/3/2019 Signature File
11/28
11
Compression
idea
Create sparse document signatures on purpose.
Compress them before storing them sequentially.
Method
Use B-bit vector, where B is large.
Hash each word into one (or k) bit position(s).
Use run-length encoding (McIlroy 1982).
8/3/2019 Signature File
12/28
Compression using run-length encoding
data 0000 0000 0000 0010 0000
base 0000 000 1 0000 0000 0000management 0000 1000 0000 0000 0000
system 0000 0000 0000 0000 1000
block signature 0000 1001 0000 0010 1000
L1 L2 L3 L4 L5
[L1] [L2] [L3] [L4] [L5]
where [x] is the encoded vale ofx.
search: Decode the encoded lengths of all the preceding intervals
example: search data
(1) data ==> 0000 0000 0000 0010 0000
(2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000
disadvantage: search becomes low
8/3/2019 Signature File
13/28
Bit-blockCompression (BC)
Data Structure:
(1) The sparse vector is divided into groups of consecutive bits(bit-blocks).
(2) Each bit block is encoded individually.
Algorithm:
Part I. It is one bit long, and it indicates whether there are any
1s in the bit-block (1) or the bit -block is (0). In
the latter case, the bit-block signature stops here.
0000 1001 0000 0010 1000
0 1 0 1 1
Part II. It indicates the numbers of 1s in the bit-block. It consists
ofs-1 1 and a terminating zero.
10 0 0Part III. It contains the offsets of the 1s from the beginning of the
bit-block.
0011 10 00
4bits0, 1, 2, 300, 01, 10, 11block signature: 01011 | 10 00 | 00 11 10 00
8/3/2019 Signature File
14/28
14
Bit-blockCompression (BC)(Continued)
Search data
(1) data ==> 0000 0000 0000 0010 0000
(2) check the 4th block of signature 01011 | 10 0 0 | 00 11 10 00
(4) OK, there is at least one setting in the 4th bit-block.
(5) Check furthermore. 0 tells us there is only one setting inthe 4th bit-clock. Is it the 3rd bit?
(6) Yes, 10 confirms the result.
Discussion:
(1) Bit-block compression requires less space than Sequential
Signature File for the same false drop probability.
(2) The response time ofBit-block compression is lightly less
then Sequential Signature File.
8/3/2019 Signature File
15/28
15
Vertical Partitioning
idea
avoidbringing useless portions ofthe documentsignature in
main memory
methods store the signature file in a bit-sliced form or in a frame-sliced form
store the signature matrix column-wise to improve the response
time on the expense of insertion time
8/3/2019 Signature File
16/28
Bit-Sliced Signature Files (BSSF)
Transposed bit matrix
transpose
represent
documents
documents(document signature)
8/3/2019 Signature File
17/28
F bit-files
search: (1) retrieve mbit-files.
e.g., the word signature of free is 001 000 110 010
the document contains free: 3rd, 7th, 8th, 11th bit are set
i.e., only 3rd, 7th, 8th, 11th files are examined.
(2) and these vectors. The 1s in the result N-bit vector
denote the qualifying logical blocks (documents).
(3) retrieve text file through pointer file.
insertion: require F disk accesses for a new logical block (document),
one for each bit-file, but no rewriting
documents
8/3/2019 Signature File
18/28
18
Frame-Sliced Signature File (FSSF)
Ideas
random disk accesses are more expensive than sequential ones
force each word to hash into bit positions that are closer to each
other in the document signature these bit files are stored together and can be retrieved with a few
random accesses
Procedures
The document signature (Fbits long) is divided into kframes ofs
consecutive bits each.
For each word in the document, one of the kframes will be chosen
by a hash function.
Using another hash function, the word sets m bits in that frame.
8/3/2019 Signature File
19/28
19
documents
frames
Each frame will be kept in consecutive disk blocks.
Frame-Sliced Signature File (Cont.)
8/3/2019 Signature File
20/28
20
FSSF (Continued)
Example (n=2, B=12, s=6, f=2, m=3)
Word Signature
free 000000 110010
text 010110 000000
doc. signature 010110 110010
Search
Only one frame has to be retrieved for a single word query. I.E., only
one random disk access is required.
e.g., search documents that contain the word free
->because the word signature of free is placed in 2nd frame,only the 2nd frame has to be examined.
At most kframes have to be scanned for an kword query.
Insertion
Only f frames have to be accessed instead ofFbit-slices.
8/3/2019 Signature File
21/28
21
Vertical Partitioning with Compression
idea
create a very sparse signature matrix
store it in a bit-sliced form
compress each bit slice by storing the position of the 1s in
the slice.
8/3/2019 Signature File
22/28
22
Compressed Bit Slices (CBS)
Rooms for improvements
Searching
Each search word requires the retrieval ofm bit files.
The search time could be improved ifm was forced to be 1.
Insertion
Require too many disk accesses (equal to F, which is typically
600-1000).
8/3/2019 Signature File
23/28
23
Compressed Bit Slices (CBS)(Continued)
Let m=1. To maintain the
same false drop probability,
F has to be increased.
To compress each bit file,we store only the positions
of the 1s.
For unpredictable number
of 1s, we store them in
buckets of size Bp.
documents
Size
ofa
sign
ature
Sparse bit matrix
8/3/2019 Signature File
24/28
h(base)=30
Obtain the pointers to the
relevant documents from
bucketsHash a word to
obtain bucket address
Differences with
inversion
The directory (hash
table) is sparse
The actual word is
stored nowhere
Simple structure
8/3/2019 Signature File
25/28
Doubly Compressed Bit Slices
h1(base)=30 h2(base)=011Follow the pointers of posting
buckets to retrieve the qualifying
documents.
Distinguish synonyms partially.
Idea:
compress
the sparse
directory
S
buckets
hash
function
8/3/2019 Signature File
26/28
No False Drops Method
Using pointer to the word
in the text file
To distinguish between
synonyms completely.
8/3/2019 Signature File
27/28
Horizontal Partitioning
documents
1. Goal: group the signatures into sets, partitioning the signature
matrix horizontally.2. Grouping criterion
8/3/2019 Signature File
28/28
28
Partitioned Signature Files
Using a portion of a document signature as a signature key to
partition the signature file.
All signatures with the same key will be grouped into a so-called
module. When a query signature arrives,
examine its signature key and look for the corresponding modules
scan all the signatures within those modules that have been
selected