Upload
samuel-dennis
View
213
Download
0
Embed Size (px)
Citation preview
Information Retrieval
CSE 8337
Spring 2005
Simple Text Processing
Material for these slides obtained from:
Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/book
CSE 8337 Spring 2005 2
Text Processing TOC
Simple Text Storage String Matching String-to-String Correction
(Approximate matching)
CSE 8337 Spring 2005 3
Text storage
EBCDIC/ASCII Array of character Linked list of character Trees- B Tree, Trie Stuart E. Madnick, “String Processing
Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp 420-424.
CSE 8337 Spring 2005 4
Pattern Matching(Recognition)
Pattern Matching: finds occurrences of a predefined pattern in the data.
Applications include speech recognition, information retrieval, time series analysis.
CSE 8337 Spring 2005 5
Similarity Measures Determine similarity between two objects. Similarity characteristics:
Alternatively, distance measures measure how unlike or dissimilar objects are.
CSE 8337 Spring 2005 6
String Matching Problem
Input: Pattern – length m Text string – length n
Find one (next, all) occurrences of string in pattern
Ex: String:
00110011011110010100100111 Pattern: 011010
CSE 8337 Spring 2005 7
String Matching Algorithms
Brute Force Kknuth-Morris Pratt Boyer Moore P209 in text
CSE 8337 Spring 2005 8
Brute Force String Matching
Brute Force Handbook of Algorithms and Data Structures
http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html Space O(m+n) Time O(mn)
00110011011110010100100111
011010011010011010
CSE 8337 Spring 2005 10
Creating FSR
Create FSM: Construct the “correct” spine. Add a default “failure bus” to state 0. Add a default “initial bus” to state 1. For each state, decide its attachments
to failure bus, initial bus, or other failure links.
CSE 8337 Spring 2005 11
Knuth-Morris-Pratt Apply FSM to string by processing
characters one at a time. Accepting state is reached when pattern
is found. Space O(m+n) Time O(m+n) Handbook of Algorithms and Data
Structureshttp://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html
CSE 8337 Spring 2005 12
Boyer-Moore Scan pattern from right to left Skip many positions on illegal
character string. O(mn) Expected time better than KMP Expected behavior better Handbook of Algorithms and Data Structureshttp://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html
CSE 8337 Spring 2005 13
String-to-String Correction Measure of similarity between strings Can be used to determine how to convert
from one string to another Cost to convert one to the other Transformations
Match: Current characters in both strings are the same
Delete: Delete current character in input string Insert: Insert current character in target string
into string