Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval...

Preview:

Citation preview

www.monash.edu.au

CSE3201/CSE4500 Information Retrieval Systems

Signature Based Text Retrieval Systems

www.monash.edu.au

2

Signature File for Text Retrieval

• A “signature” is created as an abstraction of a document.

• All the signatures that represent the documents in the collection are kept in a file called “signature file”.

www.monash.edu.au

3

Word Signature(WS)

• A word signature – is a fixed-length bit-string represents a word.– is described by

> The length (N)> A number of bits set to 1(k)

1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0

N=24

k=7

www.monash.edu.au

4

Word Signature Generation

• Use a hash function to find the location of the bit(s) that will be set on.

• Using triplets of characters to generate word signature.

– divide the word into overlapping triplets.

– For each triplet of characters:> convert the characters to a numeric value (can be ASCII

representation of the character).> Use the the number as the input to the hash function.> The hash function will produce a number which represent the bit

position of the triplet in the word signature.

www.monash.edu.au

5

Signature Generator Algorithm

Set hash_value to 0

for each character in the triplet do

hash_value:=(hash_value*137+character ASCIIvalue)mod 256

K values

www.monash.edu.au

6

Word Signature Generation – simplified example

• Example:

– A signature 111000111001 is generated for the word “signature”.

• The position is read from left to right

-si sig ign gna nat atu tur ure re-

12 73 23 9 12 8

1 1 1 0 0 0 1 1 1 0 0 1

signature

Hash function

Position of the bit set to 1

1

www.monash.edu.au

7

Document Signature (DS)

• Document Signature can be created using two methods:– concatenation of word signatures.– superimposed coding.

www.monash.edu.au

8

Document Signature – Concatenation of WS

• The length of document signatures (DS) can vary. • A fixed number of bits may precede the document

signature (DS) to indicate the length of DS.• It is possible to fix the length of the Document Signature

(DS). – The length can be set to equal the longest document in the

collection.– Extra “0” bits are padded to the shorter documents.

www.monash.edu.au

9

Document Signature –Superimposed Coding

• Each document is divided into blocks containing a constant number of distinct words.

• To create a block signature, perform OR operation on all the words in the block.

free 001 000 110 010

text 000 010 101 001

Block signature 001 010 111 011

www.monash.edu.au

10

Document Signature – Superimposed Coding

• To create the document signature, all the block signatures are superimposed.

www.monash.edu.au

11

Query Signature

• Query will be converted to a block signature as in the document.

• Example:

free 0 0 1 0 0 0 1 1 0 0 1 0

Text 0 0 0 0 1 0 1 0 1 0 0 1

Block/Query

0 0 1 0 1 0 1 1 1 0 1 1

www.monash.edu.au

12

Matching the Query and Document Signature

• Premise:– The positions of the bits set to 1 represent the existence

of particular words in the query or document. • A relevant document is document that has a signature

with bits set to 1 at the same position of the bits in the query’s signature.

• The relevant document’s signature does not have to be an exact match of the query’s signature.

• Example:– Query: 0100– Match document signatures: 1111, 0111, 0110, 0100.

www.monash.edu.au

13

Query on Signature File

Query

001 010 111 011

0 0 1 0 0 0 1 1 1 0 1 1

0 0 1 1 1 1 1 1 1 0 1 1

0 0 1 0 1 0 1 0 1 0 1 1

0 0 1 0 1 0 1 1 1 0 1 0

1 1 1 0 1 0 1 1 1 0 1 1

0 0 1 1 0 0 1 1 1 0 1 1

0 0 1 0 1 0 1 1 1 1 1 1

No

No

No

Yes

YesNo

Yes

Match? Perform AND operation between the query and block signature, if ( result – query) = 0, they are matched

www.monash.edu.au

14

Signature File Structure

• Sequential– During searching, each signature will be compared to

query signature.– Time consuming because:

> Memory size is limited, hence all signatures cannot be loaded to the memory at once.

> May result in multiple number of I/O operations.

• We need a file structure for the signature file that minimise the I/O operation.

• Bit-Sliced Signature– At the maximum, only N (the size of the signature) number

of records need to be retrieved.

www.monash.edu.au

15

Matrix Transposed

2313

2212

2111

232221

131211

xx

xx

xx

xxx

xxxT

xij -> xji

fc

eb

da

fed

cbaT

www.monash.edu.au

16

Bit-Sliced

0 0 1 0 0 0 1 1 1 0 1 1

0 0 1 1 1 1 1 1 1 0 1 1

0 0 1 0 1 0 1 0 1 0 1 1

0 0 1 0 1 0 1 1 1 0 1 0

0 0 0 0

0 0 0 0

1 1 1 1

0 1 0 0

0 1 1 1

0 1 0 0

1 1 1 1

1 1 0 1

1 1 1 1

0 0 0 0

1 1 1 1

1 1 1 0Bit slicedsequential

N bits

N records

d1

d4

d2d3

Query: 001 010 111 011

dn

d1 d2 d3 d4 dn

www.monash.edu.au

17

Bit Sliced Signature File

• Retrieval– If ith bit in the query signature is set to 1, retrieve

the ith signature block/record.– If there is n number of bits are set to 1 in the

query, only n number of records needs to be retrieved.

www.monash.edu.au

18

Bit Slice Signature File

0 0 0 0

0 0 0 0

1 1 1 1

0 1 0 0

0 1 1 1

0 1 0 0

1 1 1 1

1 1 0 1

1 1 1 1

0 0 0 0

1 1 1 1

1 1 1 0

Query: 001 010 111 011

1 1 1 1

0 1 1 1

1 1 1 1

1 1 0 1

1 1 1 1

1 1 1 1

1 1 1 0

Match, because all bits in this column is set to 1 (the 2nd block).

Retrieved records

www.monash.edu.au

19

Bit Sliced Signature File

• Advantages:– Smaller number of records are retrieved -> faster

retrieval.• Disadvantages:

– An update operation become a very costly exercise.

www.monash.edu.au

20

False Drop

• False drop occurs when a document’s signature matches a query’s signature but the query’s word does not match any word in the document.

• It is possible because 2 distinct blocks may have the same signatures due to:– the hashing algorithm– superimposed coding

www.monash.edu.au

21

Relation Between the Signature Properties and False Drop

• The rate of false drop depends on:– The size of the signature (N bits)

> Increase in N will decrease the false drop

– The size of bits set to 1(k bits)> Increase in k to a certain level will decrease the false

drop

– The number of unique words per-block> Decrease in the number of unique word per-block will

decrease the false drop.

Recommended