Click here to load reader

The Chinese University of Hong Kong Introduction to PAT-Tree and its variations Kenny Kwok [email protected] Department of Computer Science and Engineering

Embed Size (px)

Citation preview

Introduction to PAT-Tree and its variationsKenny Kwok
Shatin, N.T., Hong Kong SAR
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
Outline
Application examples
A Novel PAT-Tree Approach to Chinese Document Clustering
PAT tree
Definition: Patricia Tree that storing every semi-infinite string (sistring) of a document
Two things we have to know
PATRICIA TREE
A Novel PAT-Tree Approach to Chinese Document Clustering
PATRICIA TREE
A particular type of “trie”
Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
PATRICIA TREE
Therefore, PATRICIA TREE will have the following attributes in its internal nodes:
Index bit (check bit)
Child pointers (each node must contain exactly 2 children)
On the other hand, leave nodes must be storing actual content for final comparison
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
SISTRING
Sistring is the short form of ‘Semi-Infinite String’
String, no matter what they actually are, is a form of binary bit pattern. (e.g. 11001)
One of the sistring in the above example is 11001000…
There are totally 5 sistrings in this example
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
SISTRING
110010000…
10010000…
0010000…
010000…
10000…
Practically, we cannot store it infinite. For the above example, we only need to store each sistrings up to 5 bits long. They are descriptive enough distinguish each from one another.
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
SISTRING
Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea!
e.g. CUHK
We require each should be at least 4 characters long.
(Why we pad 0/NULL at the end of sistring?)
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
SISTRING (USAGE)
SISTRINGs are efficient in storing substring information.
A string with n characters will have n(n+1)/2 sub-strings. Since the longest one is with size n. Storage requirement for sub-strings would be O(n3)
e.g. ‘CUHK’ is 4 character long, which consist of 4(5)/2 = 10 different sub-strings: C, U, …, CU, UK, …, CUH, UHK, CUHK.
Storage requirement is O(n2)max(length) -> O(n3)
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
SISTRING (USAGE)
We may instead storing the sistrings of ‘CUHK’, which requires O(n2) storage.
CUHK <- represent C CU CUH CUHK at the same time
UHK0 <- represent U UH UHK at the same time
HK00 <- represent H HK at the same time
K000 <- represent K only
A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings.
Conclusion, sistrings is better representation for storing sub-string information.
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
PAT Tree
Now it is time for PAT Tree again
PAT Tree is a PATRICIA TREE store every sistrings of a document
What if the document is now contain simply ‘CUHK’?
We like character at this moment, but PATRICIA is working on bits, therefore, we have to know the bit pattern of each sistrings in order to know the actual figure of the PAT tree result
It looks frustrating for even small example, but it is how PAT tree works!
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
PAT Tree (Example)
By digitalizing the string, we can manually visualize how the PAT Tree could be.
Following is the actual bit pattern
of the four sistrings
Sheet1
CUHK
A Novel PAT-Tree Approach to Chinese Document Clustering
PAT Tree
In a document, we don’t view it as a packed string of characters. A document consist of words. e.g. “Hello. This is a simple document.”
In this case, sistrings can be applied in ‘document level’; the document is treated as a big string, we may tokenize it word-by-word, instead of character-by-character.
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
PAT Tree (Example)
This works! BUT…
01001000 …
UHK0
01001000 …
A Novel PAT-Tree Approach to Chinese Document Clustering
PAT Tree (Actual Structure)
the document itself
Leaves pointers, O(n)
Therefore, PAT Tree is a linear data structure that contains sub-strings, O(n3), information
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
The Chinese PAT tree
we can built PAT tree for english easily. Sistrings are decomposed word by word.
for Chinese document, the document layout shows no idea about words. Sadly, they packed together.
e.g. “”
We know there are 5 characters, what’s more?
In fact, there are 2 words “” and “”, but we have no way to KNOW about this by just reading the text without any other supporting knowledge.
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
Semi-Infinite String (Sistring)
The sistrings becomes:
This make sistrings comparable to each others
We can examine a particular bit of a sistring and there will not have ‘missing-bit’ in any sistrings
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
The Chinese PAT tree
In the research of Chinese information processing, researchers suggest to have sistrings for Chinese document in “sentense level”
i.e. each documents decompose into many sentences by their punctuation marks.
“” will be viewed as 2 sentences “” and “”
For each sentences, their sistrings can be obtained liked “”, “”, “”, etc.
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
The Chinese PAT tree
By this way, Chinese PAT tree is built. Since Chinese words must be a sub-string of the document, all Chinese words can still be found in the Chinese PAT tree efficiently.
Therefore, Chinese word segmentation is one of the most important application using the PAT tree.
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
The Chinese PAT Tree Structure
In Chinese PAT tree, a document is decomposed into sentences. It is possible that sistrings of one sentence will be a subset of another sentence.
e.g. “”. Sistrings “” appear twice. Once of them will be eaten by another.
Therefore, we usually have a frequency count attached to each leave node of the tree.
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
The Chinese PAT Treee Structure
Internal node remains the same. It has check-bit information
Leave node will now have a frequency count attribute
The document is decomposed into a number of sentences.
Storage complexity is remains O(n).
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
Structure modification
We can see that node structure for internal node and leave node are not the same
tree will be more flexible if their nodes are generic (have a universal node structure)
Trade off: generic node structure will enlarge the individual node size
But..
Memory are cheap now
Even the low end computer can support hundreds MB of RAM
The modified tree is still a O(n) structure
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
Structure of the modified node
Check Bit
Frequency Count
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
Example of our Modified Version
Chinese Text
A Novel PAT-Tree Approach to Chinese Document Clustering
Essential Length
Essential Length is the number of Chinese character a tree node can represent
In general, Chinese characters is a double-byte character (16-bit)
The essential length equal to the check bit, truncated to the nearest Chinese character
e.g. a node with check bit = 53
It can represent only 3 Chinese characters (48 bits) but not 4 Chinese characters (64 bits)
Its essential length = 48
A Novel PAT-Tree Approach to Chinese Document Clustering
Essential Node
We call a node “Essential Node” (EN) if and only if its,
Essential Length >= 32
Essential Length is at least 16 more than the previous ancestral EN
Each Essential Node can uniquely represent a sub-string(phrase).
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
Essential Node
With the definition of “Essential Node”(EN)
Each essential node will represent a possible Chinese substring, e.g. “”, “”
With the generalized structure, each EN will also have the frequency count, which reflect the occurrence of the particular associated sub-string.
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
Essential Node
A Novel PAT-Tree Approach to Chinese Document Clustering
Applications
PAT tree may embedded more information depends on the application
Famous Chinese information processing applications include
Keyword extractions
Sentences Segmentation
Document Classification

These show the importance of PAT tree structure on those applications
The Chinese University of Hong Kong
A Novel PAT-Tree Approach to Chinese Document Clustering
Conclusion
PAT tree is a O(n) data structure for document indexing
PAT tree is good for solving sub-string matching problem
Chinese PAT tree has sistrings in sentence level. Frequency count is introduced to overcome the duplicate sistrings problem
On generalizing the node structure, the modified version increase the pat tree capability for varies applications





Text
This document is simple01010100 …
bit 2
bit 3
bit 4