25
TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh-Gama, Eric Engquist

TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Embed Size (px)

Citation preview

Page 1: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

TCN Spell Checker

Team AZP:Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh-Gama, Eric Engquist

Page 2: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Team AZP

Team descendant of previous project groups Primary roles by member:

Joshua Correa – Project Lead, TCN Liason Eric Engquist – Materials and Metrics Manager Mark Biddlecom – Resource and Process Manager Zianeh Kemeh-Gama – Schedule Manager Jatinder Singh – Research Lead Dr. Ludi – Faculty Advisor

Website: http://www.se.rit.edu/~teamazp/index.htm

Page 3: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

TCN Software development and staffing company

based here in Rochester, NY http://www.tcnus.com

Developer of web-based search and knowledge management programsKnowledgeTrac

Customizable multilingual web search tool Standalone spider

TecTrac, AppTrac, AuditTrac, HelpTrac, TestTrac Document and database search and management tools

Page 4: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Document Collaboration Tool Online repository for management

documentsMeeting minutesMetricsResearch linksPresentations and diagramsTask and issues for each team memberEmail notifications of changes

Custom developed for this project

Page 5: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Spell Checker Should compensate for mistyped search terms

Match misspelled words with correct spelling “atourney” attorney

Match misspelled words with correct results “atourney” legal services, lawyers

Meant to make searches more useful for average web search users 1) Takes in search terms from user 2) Checks spelling/matches with known search

terms 3) Returns suggestions to search engine

Page 6: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Spell Checker Requirements

Functional Requirements:

Look up search terms in a dictionary Suggest replacements for misspelled terms

(closest match) Add new terms to dictionary Process phrases (as opposed to single words) Support multiple dictionaries

Page 7: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Spell Checker RequirementsNon-functional Requirements:

Object-oriented design to be implemented as a web service with VB.NET

Adaptability Must support ability to work with different data

stores Must support the addition of new components

Performance Analysis of a search string cannot take longer

than one second.

Page 8: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Spell Check Process

Load configuration Load dictionaries (from cache or rebuild) Apply rules

Parse search string Apply algorithm to each term Short-circuit if enough results have been found

Return results set of suggestions

Page 9: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Configuration

Application configuration file Provides application-level settings (e.g., maximum

memory usage, maximum processor time for search) Points to search configuration file

Search configuration file Allows control over how memory is used vs. algorithm

performance Defines dictionaries and methodologies Methodologies include rules

Page 10: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Loaders

Load a set of words for use in dictionaries Used to create root dictionaries (<root> in

the configuration file) Word sets returned by loaders are not

cached, but instead used to create algorithm dictionaries

Page 11: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Formatters

Provide a dictionary specialized for use with a specific algorithm

Created by <dictionary> tags in the configuration file

Dictionaries created by formatters are cached for use between application sessions

Page 12: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Parsers

Split a search string up into a number of terms

For a given rule, the algorithm is applied to each term supplied by the parser

Page 13: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Data Flow

BusinessNames

EnglishWords

Text-FileLoader

DatabaseLoader

CacheFile

NormalFormatter

PhoneticFormatter

Dictionary Configuration

Suggestions

Parser

Algorithm

Dictionary

TCN Website

Search Phrase

Met

hodo

logy

Page 14: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Algorithms – String Similarity

Calculates number of operations to go from one word to another Insertion, Deletion, Substitution

Few operations Good Suggestion Extra features

Swapping operationOperation weighting

Page 15: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Algorithms – String Similarity

Complexity of O(s1*s2) S1,s2 lengths of

strings being compared

Can be improved to O(s1*k) K is edit distance

w o r D

0 1 2 3 4

W 1 0 1 2 3

A 2 1 2 3 4

r 3 2 2 2 2

Page 16: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Algorithms - Phonetic

Several rules used to parse English words into a sequence of phonetic soundsExample: Phonetic pntk

Parse dictionary, parse search term String similarity comparison

Page 17: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Deliverable Schedule

Iteration 1: February 1st 2005 Complete system design for system iterations 1-3 Instructions for installation and integration with TCN client

software Research

Analysis of historic search strings and business names from TCN

Dictionaries (common words) Word search algorithms

Basic System Implementation Database integration Testing

Page 18: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Deliverable Schedule

Iteration 2: February 18th 2005 Suggest replacements for words not in the

dictionary Addition of a new search algorithm to provide

more intelligent searches Closest Match

Using multiple dictionaries Unit Testing for all written code

Page 19: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Deliverable Schedule

Iteration 3: March 21st 2005 Phonetic Matching Dynamically add words/phrases to the dictionary Support phrase searching Addition of further search algorithms GUI Configuration tool Algorithm Optimization

Page 20: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Metrics

Schedule/estimation accuracyEstimation accuracy (hours per task)Slippage percentages

Defect statistics and analysisSeverity and complexity of defectsDefect source trackingAverage age of defects

Page 21: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Age of Known DefectsAge of Defects (Open and Closed)

0

1

2

3

4

5

0 1 2 3 4 5 6 7 8 9 10

Age in Days

Nu

mb

er o

f K

no

wn

Def

ects

Page 22: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Severity of Defects

0

1

2

3

4

5

6

Number of Defects

Trivial Low Moderate High Critical

Severity

Defects by Severity Index (1-5)

Page 23: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Complexity of Defects

0

1

2

3

4

5

6

Number of Defects

Trivial Low Moderate High Critical

Complexity

Defects by Complexity Index (1-5)

Page 24: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Sources of Defects

0123456

Number of Defects

Defects by Source

Page 25: TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist

Research References

“Approximate String Matching” by Ricardo Baeza-Yates at University of Chile

“A Guided Tour to Approximate String Matching” by Gonzalo Navarro at University of Chile, 2001

“An Extension of Ukkonen’s Enhanced Dynamic Programming ASM Algorithm” by Hal Berghel (U of Arkansas) and David Roach (Acxiom Corp.), 1996