21
www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

Embed Size (px)

Citation preview

Page 1: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

CSE3201/CSE4500 Information Retrieval Systems

Signature Based Text Retrieval Systems

Page 2: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

2

Signature File for Text Retrieval

• A “signature” is created as an abstraction of a document.

• All the signatures that represent the documents in the collection are kept in a file called “signature file”.

Page 3: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

3

Word Signature(WS)

• A word signature – is a fixed-length bit-string represents a word.– is described by

> The length (N)> A number of bits set to 1(k)

1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0

N=24

k=7

Page 4: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

4

Word Signature Generation

• Use a hash function to find the location of the bit(s) that will be set on.

• Using triplets of characters to generate word signature.

– divide the word into overlapping triplets.

– For each triplet of characters:> convert the characters to a numeric value (can be ASCII

representation of the character).> Use the the number as the input to the hash function.> The hash function will produce a number which represent the bit

position of the triplet in the word signature.

Page 5: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

5

Signature Generator Algorithm

Set hash_value to 0

for each character in the triplet do

hash_value:=(hash_value*137+character ASCIIvalue)mod 256

K values

Page 6: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

6

Word Signature Generation – simplified example

• Example:

– A signature 111000111001 is generated for the word “signature”.

• The position is read from left to right

-si sig ign gna nat atu tur ure re-

12 73 23 9 12 8

1 1 1 0 0 0 1 1 1 0 0 1

signature

Hash function

Position of the bit set to 1

1

Page 7: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

7

Document Signature (DS)

• Document Signature can be created using two methods:– concatenation of word signatures.– superimposed coding.

Page 8: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

8

Document Signature – Concatenation of WS

• The length of document signatures (DS) can vary. • A fixed number of bits may precede the document

signature (DS) to indicate the length of DS.• It is possible to fix the length of the Document Signature

(DS). – The length can be set to equal the longest document in the

collection.– Extra “0” bits are padded to the shorter documents.

Page 9: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

9

Document Signature –Superimposed Coding

• Each document is divided into blocks containing a constant number of distinct words.

• To create a block signature, perform OR operation on all the words in the block.

free 001 000 110 010

text 000 010 101 001

Block signature 001 010 111 011

Page 10: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

10

Document Signature – Superimposed Coding

• To create the document signature, all the block signatures are superimposed.

Page 11: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

11

Query Signature

• Query will be converted to a block signature as in the document.

• Example:

free 0 0 1 0 0 0 1 1 0 0 1 0

Text 0 0 0 0 1 0 1 0 1 0 0 1

Block/Query

0 0 1 0 1 0 1 1 1 0 1 1

Page 12: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

12

Matching the Query and Document Signature

• Premise:– The positions of the bits set to 1 represent the existence

of particular words in the query or document. • A relevant document is document that has a signature

with bits set to 1 at the same position of the bits in the query’s signature.

• The relevant document’s signature does not have to be an exact match of the query’s signature.

• Example:– Query: 0100– Match document signatures: 1111, 0111, 0110, 0100.

Page 13: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

13

Query on Signature File

Query

001 010 111 011

0 0 1 0 0 0 1 1 1 0 1 1

0 0 1 1 1 1 1 1 1 0 1 1

0 0 1 0 1 0 1 0 1 0 1 1

0 0 1 0 1 0 1 1 1 0 1 0

1 1 1 0 1 0 1 1 1 0 1 1

0 0 1 1 0 0 1 1 1 0 1 1

0 0 1 0 1 0 1 1 1 1 1 1

No

No

No

Yes

YesNo

Yes

Match? Perform AND operation between the query and block signature, if ( result – query) = 0, they are matched

Page 14: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

14

Signature File Structure

• Sequential– During searching, each signature will be compared to

query signature.– Time consuming because:

> Memory size is limited, hence all signatures cannot be loaded to the memory at once.

> May result in multiple number of I/O operations.

• We need a file structure for the signature file that minimise the I/O operation.

• Bit-Sliced Signature– At the maximum, only N (the size of the signature) number

of records need to be retrieved.

Page 15: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

15

Matrix Transposed

2313

2212

2111

232221

131211

xx

xx

xx

xxx

xxxT

xij -> xji

fc

eb

da

fed

cbaT

Page 16: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

16

Bit-Sliced

0 0 1 0 0 0 1 1 1 0 1 1

0 0 1 1 1 1 1 1 1 0 1 1

0 0 1 0 1 0 1 0 1 0 1 1

0 0 1 0 1 0 1 1 1 0 1 0

0 0 0 0

0 0 0 0

1 1 1 1

0 1 0 0

0 1 1 1

0 1 0 0

1 1 1 1

1 1 0 1

1 1 1 1

0 0 0 0

1 1 1 1

1 1 1 0Bit slicedsequential

N bits

N records

d1

d4

d2d3

Query: 001 010 111 011

dn

d1 d2 d3 d4 dn

Page 17: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

17

Bit Sliced Signature File

• Retrieval– If ith bit in the query signature is set to 1, retrieve

the ith signature block/record.– If there is n number of bits are set to 1 in the

query, only n number of records needs to be retrieved.

Page 18: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

18

Bit Slice Signature File

0 0 0 0

0 0 0 0

1 1 1 1

0 1 0 0

0 1 1 1

0 1 0 0

1 1 1 1

1 1 0 1

1 1 1 1

0 0 0 0

1 1 1 1

1 1 1 0

Query: 001 010 111 011

1 1 1 1

0 1 1 1

1 1 1 1

1 1 0 1

1 1 1 1

1 1 1 1

1 1 1 0

Match, because all bits in this column is set to 1 (the 2nd block).

Retrieved records

Page 19: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

19

Bit Sliced Signature File

• Advantages:– Smaller number of records are retrieved -> faster

retrieval.• Disadvantages:

– An update operation become a very costly exercise.

Page 20: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

20

False Drop

• False drop occurs when a document’s signature matches a query’s signature but the query’s word does not match any word in the document.

• It is possible because 2 distinct blocks may have the same signatures due to:– the hashing algorithm– superimposed coding

Page 21: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems

www.monash.edu.au

21

Relation Between the Signature Properties and False Drop

• The rate of false drop depends on:– The size of the signature (N bits)

> Increase in N will decrease the false drop

– The size of bits set to 1(k bits)> Increase in k to a certain level will decrease the false

drop

– The number of unique words per-block> Decrease in the number of unique word per-block will

decrease the false drop.