15
Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics by Margaret H. Dunham http:// www.engr.smu.edu/~mhd/book

Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics

Embed Size (px)

Citation preview

Information Retrieval

CSE 8337

Spring 2005

Simple Text Processing

Material for these slides obtained from:

Data Mining Introductory and Advanced Topics by Margaret H. Dunham

http://www.engr.smu.edu/~mhd/book

CSE 8337 Spring 2005 2

Text Processing TOC

Simple Text Storage String Matching String-to-String Correction

(Approximate matching)

CSE 8337 Spring 2005 3

Text storage

EBCDIC/ASCII Array of character Linked list of character Trees- B Tree, Trie Stuart E. Madnick, “String Processing

Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp 420-424.

CSE 8337 Spring 2005 4

Pattern Matching(Recognition)

Pattern Matching: finds occurrences of a predefined pattern in the data.

Applications include speech recognition, information retrieval, time series analysis.

CSE 8337 Spring 2005 5

Similarity Measures Determine similarity between two objects. Similarity characteristics:

Alternatively, distance measures measure how unlike or dissimilar objects are.

CSE 8337 Spring 2005 6

String Matching Problem

Input: Pattern – length m Text string – length n

Find one (next, all) occurrences of string in pattern

Ex: String:

00110011011110010100100111 Pattern: 011010

CSE 8337 Spring 2005 7

String Matching Algorithms

Brute Force Kknuth-Morris Pratt Boyer Moore P209 in text

CSE 8337 Spring 2005 8

Brute Force String Matching

Brute Force Handbook of Algorithms and Data Structures

http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html Space O(m+n) Time O(mn)

00110011011110010100100111

011010011010011010

CSE 8337 Spring 2005 9

FSR

CSE 8337 Spring 2005 10

Creating FSR

Create FSM: Construct the “correct” spine. Add a default “failure bus” to state 0. Add a default “initial bus” to state 1. For each state, decide its attachments

to failure bus, initial bus, or other failure links.

CSE 8337 Spring 2005 11

Knuth-Morris-Pratt Apply FSM to string by processing

characters one at a time. Accepting state is reached when pattern

is found. Space O(m+n) Time O(m+n) Handbook of Algorithms and Data

Structureshttp://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html

CSE 8337 Spring 2005 12

Boyer-Moore Scan pattern from right to left Skip many positions on illegal

character string. O(mn) Expected time better than KMP Expected behavior better Handbook of Algorithms and Data Structureshttp://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html

CSE 8337 Spring 2005 13

String-to-String Correction Measure of similarity between strings Can be used to determine how to convert

from one string to another Cost to convert one to the other Transformations

Match: Current characters in both strings are the same

Delete: Delete current character in input string Insert: Insert current character in target string

into string

CSE 8337 Spring 2005 14

Distance Between Strings

CSE 8337 Spring 2005 15

Approximate String Matching

Find patterns “close to” the string Fuzzy matching Applications:

Spelling checkers IR

Define similarity (distance) between string and pattern