Upload
georgiana-higgins
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
TCN Spell Checker
Team AZP:Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh-Gama, Eric Engquist
Team AZP
Team descendant of previous project groups Primary roles by member:
Joshua Correa – Project Lead, TCN Liason Eric Engquist – Materials and Metrics Manager Mark Biddlecom – Resource and Process Manager Zianeh Kemeh-Gama – Schedule Manager Jatinder Singh – Research Lead Dr. Ludi – Faculty Advisor
Website: http://www.se.rit.edu/~teamazp/index.htm
TCN Software development and staffing company
based here in Rochester, NY http://www.tcnus.com
Developer of web-based search and knowledge management programsKnowledgeTrac
Customizable multilingual web search tool Standalone spider
TecTrac, AppTrac, AuditTrac, HelpTrac, TestTrac Document and database search and management tools
Document Collaboration Tool Online repository for management
documentsMeeting minutesMetricsResearch linksPresentations and diagramsTask and issues for each team memberEmail notifications of changes
Custom developed for this project
Spell Checker Should compensate for mistyped search terms
Match misspelled words with correct spelling “atourney” attorney
Match misspelled words with correct results “atourney” legal services, lawyers
Meant to make searches more useful for average web search users 1) Takes in search terms from user 2) Checks spelling/matches with known search
terms 3) Returns suggestions to search engine
Spell Checker Requirements
Functional Requirements:
Look up search terms in a dictionary Suggest replacements for misspelled terms
(closest match) Add new terms to dictionary Process phrases (as opposed to single words) Support multiple dictionaries
Spell Checker RequirementsNon-functional Requirements:
Object-oriented design to be implemented as a web service with VB.NET
Adaptability Must support ability to work with different data
stores Must support the addition of new components
Performance Analysis of a search string cannot take longer
than one second.
Spell Check Process
Load configuration Load dictionaries (from cache or rebuild) Apply rules
Parse search string Apply algorithm to each term Short-circuit if enough results have been found
Return results set of suggestions
Configuration
Application configuration file Provides application-level settings (e.g., maximum
memory usage, maximum processor time for search) Points to search configuration file
Search configuration file Allows control over how memory is used vs. algorithm
performance Defines dictionaries and methodologies Methodologies include rules
Loaders
Load a set of words for use in dictionaries Used to create root dictionaries (<root> in
the configuration file) Word sets returned by loaders are not
cached, but instead used to create algorithm dictionaries
Formatters
Provide a dictionary specialized for use with a specific algorithm
Created by <dictionary> tags in the configuration file
Dictionaries created by formatters are cached for use between application sessions
Parsers
Split a search string up into a number of terms
For a given rule, the algorithm is applied to each term supplied by the parser
Data Flow
BusinessNames
EnglishWords
Text-FileLoader
DatabaseLoader
CacheFile
NormalFormatter
PhoneticFormatter
Dictionary Configuration
Suggestions
Parser
Algorithm
Dictionary
TCN Website
Search Phrase
Met
hodo
logy
Algorithms – String Similarity
Calculates number of operations to go from one word to another Insertion, Deletion, Substitution
Few operations Good Suggestion Extra features
Swapping operationOperation weighting
Algorithms – String Similarity
Complexity of O(s1*s2) S1,s2 lengths of
strings being compared
Can be improved to O(s1*k) K is edit distance
w o r D
0 1 2 3 4
W 1 0 1 2 3
A 2 1 2 3 4
r 3 2 2 2 2
Algorithms - Phonetic
Several rules used to parse English words into a sequence of phonetic soundsExample: Phonetic pntk
Parse dictionary, parse search term String similarity comparison
Deliverable Schedule
Iteration 1: February 1st 2005 Complete system design for system iterations 1-3 Instructions for installation and integration with TCN client
software Research
Analysis of historic search strings and business names from TCN
Dictionaries (common words) Word search algorithms
Basic System Implementation Database integration Testing
Deliverable Schedule
Iteration 2: February 18th 2005 Suggest replacements for words not in the
dictionary Addition of a new search algorithm to provide
more intelligent searches Closest Match
Using multiple dictionaries Unit Testing for all written code
Deliverable Schedule
Iteration 3: March 21st 2005 Phonetic Matching Dynamically add words/phrases to the dictionary Support phrase searching Addition of further search algorithms GUI Configuration tool Algorithm Optimization
Metrics
Schedule/estimation accuracyEstimation accuracy (hours per task)Slippage percentages
Defect statistics and analysisSeverity and complexity of defectsDefect source trackingAverage age of defects
Age of Known DefectsAge of Defects (Open and Closed)
0
1
2
3
4
5
0 1 2 3 4 5 6 7 8 9 10
Age in Days
Nu
mb
er o
f K
no
wn
Def
ects
Severity of Defects
0
1
2
3
4
5
6
Number of Defects
Trivial Low Moderate High Critical
Severity
Defects by Severity Index (1-5)
Complexity of Defects
0
1
2
3
4
5
6
Number of Defects
Trivial Low Moderate High Critical
Complexity
Defects by Complexity Index (1-5)
Sources of Defects
0123456
Number of Defects
Defects by Source
Research References
“Approximate String Matching” by Ricardo Baeza-Yates at University of Chile
“A Guided Tour to Approximate String Matching” by Gonzalo Navarro at University of Chile, 2001
“An Extension of Ukkonen’s Enhanced Dynamic Programming ASM Algorithm” by Hal Berghel (U of Arkansas) and David Roach (Acxiom Corp.), 1996