Upload
rosemary-eaton
View
228
Download
0
Tags:
Embed Size (px)
Citation preview
Project Overview• Motivation: Email Overload• Potential solution: Automatic categorization and
management techniques• Problem: The potential solution is very experimental.
Email use and user interaction is difficult to model, requiring a prototype that users can try on actual email
• The purpose of this work is to present a Microsoft Outlook 2000TM add-in that:– Can be used as a first step toward more experimental research into
automatic email management techniques– Helps manage the inbox via classification and relevancy-based
search
What’s the Problem with Email?
• Too much
• 6/26/2001 USA Today– “Workers polled this year by market
researcher Gartner spent an average of 49 minutes a day on e-mail, 30% to 35% more time than they did a year ago. Ferris Research estimates management-level workers will spend four hours a day on e-mail by 2002.”
Solutions?• Educate users
– Don’t send so much mail, don’t subscribe to lists
• Use technology in some way– Current efforts are toward some type of
classification system that learns
Folder “Conferences”with emailsregarding conferences
Training: Systemlearns what email
belongs to “Conferences”
New SIGIR email
New Miss Cleo email
Classify into “Conferences”
Classify into “Trash”
This Project• An architecture for exploring automatic
email management techniques• Built on Outlook 2000
– Primary code in Visual Basic• Produces DLL add-in for Outlook
– Visual C++ DLL component • Hashes strings to longs (logical operators not
available in VB)• Referenced from VB
– Not tested with Outlook 2002!
Architectural Overview
Outlook
Outlook Object ModelEvents
C++ Helper DLL(Hash Strings)
VB Add-In DLL
Out
look
/ C
lass
Int
erfa
ce G
lue
Folder ClassAddMsg()GetMessages via DictionaryCompareMsg()
Message Class
AddTerms() Display() Get Vals CompareMsg()
Add-In Interface : Messages• Message Class
– Mail folders scanned on startup, class instance created for each mail item (except Trash, Sent Items).
– Message text is tokenized and stoplisted using• Sender• Recipients• Subject• Text Body (possible to use more fields if desired)
– Text tokens are hashed to 32-bit longs to save space, greatly increase token comparison time
• Hash function by Bob Jenkins• 2 collisions on 87111 dictionary words• 10x faster to compare longs vs. strings via strcmp on Pentium II
– CompareMsg function computes similarity between two email messages
Add-In Interface : Folders
• Folder Class– User-created mail folders are scanned on
startup and a folder instance created for each mail folder (except Trash, Sent Items).
– Messages that the user has placed in each folder are added to the folder’s classifier for training
– CompareMsg function computes similarity between a new message and the classifier for the folder
• i.e. can use to classify a new message into folders
Classifier Implementation• CompareMsg
– It is the goal of this project to experiment with different classifiers and algorithms as the implementation of CompareMsg to find out what works and what doesn’t
– A simple classification scheme is implemented for now• Nearest Neighbor, common terms & frequencies
– Others schemes that have been examined in the past:• TF-IDF, Neural Networks, Bayesian, Rule Induction, SVM
• What should the classifier do when new email arrives?– Some options
• Move new email directly to classified folder• Annotate email with a category tag
Classifier Usage Challenges• In previous work, we built a proprietary rule
induction and tf-idf classifier into Outlook and GroupWise that classified messages into categories. It was tested on managers and developers.
• Problems we encountered were usage-driven: 1. The need for constant re-training to keep up with
dynamically changing categories.2. Classification errors are puzzling and instill distrust on
behalf of the users. 3. Insufficient data may be available as training examples.4. It is difficult for a user to examine or manually edit a
classifier.
Challenge 1: Categories Change• Common for Categories to change over time; “Topic
Drift” as in Newsgroups– Project ends or changes direction– Conversation slowly changes topics– General discussion might turn more technical
• Problems for learning algorithms– Classifiers need to be re-trained; how well can they handle
it? How fast is it?• Our users were willing to wait seconds, not minutes• Most classifiers are not incremental; require re-training using all
positive/negative examples, not just new ones• Often too slow for many algorithms (e.g. rule induction)
– Vector-based classifiers• Fast to re-train but may have problems with threshold calculations or
new vocabulary not in the vector
Challenge 2: Classifiers Make Errors, Destroy User Trust
• Users tolerate few errors• Want immediate corrections so the same error won’t
happen again– Vector classifier may require several examples before
centroid shifts enough to include similar message– Rule classifiers need explicit retrain
• Classification errors are inevitable– Classifier may over-generalize or be too specific– Errors could “break” users hard work setting up a folder– In some cases it’s more work to fix errors than the savings
the tool is intended to provide!
• Trust is easy to lose, users abandon the system
Challenge 3: Insufficient Data Available
• Many classifiers require a large amount of training data, e.g. statistical-based classifiers– May not have enough email available
– Users expect system to work well given only 6-12 training examples
– Effort to find more examples typically too high
– One solution: Bootstrap using data in existing folders• What about negative examples? Can be problematic for some
classification algorithms
Challenge 4: Model Editing and Understanding
• Some users want to manually fix or edit the classifier– These are naïve users, not programmers!
• Easy to understand, modify– Rule-based classifiers
• More difficult– Vector classifiers, may have many keywords
• Very difficult– Neural Network– SVM
Current Implementation• Publicly available source, binaries for open development
purposes• Simple nearest-neighbor classifier for Folders
– Speed, easy to train and classify– May help classify user-created folders that really encompass
multiple sub-folders (e.g. “work” where there are many work projects) better than classification techniques that rely on global data
• Individual term frequencies of sub-folders topics will be low• But message-to-message comparison may be high
– Don’t need negative examples
• Tag messages with category rather than move into a folder– Hopefully not too critical when misclassification occur
Current Implementation : User Interface
Upon startup of Outlook : Scan outlook folders, create classifiers and messages
View inbox grouped by category
Current Interface : New Email
New email automatically classified into the Best-matching folder (but not moved, only grouped)
Current Interface : Related Email
• Interface also supports finding other email similar to the current one– Iterate through all email message class objects
invoking the comparison function• Simple term-frequency comparison of both emails
for now
• Linear time, but not too bad– 300 of the author’s messages scanned per second on
400Mhz PII
Current Interface: Related Email
Select a message,Click on button
List of similar messages displayed, click to open
Comments on Personal Use• No formal user studies performed yet• But, I’ve been using it…some anecdotes:
– Nearest Neighbor classifier OK, could be better– Would be useful to index trash or sent-items
• If not indexed, there is no folder to classify into when junk mail arrives so it gets put somewhere else
• Temporary solution: Make a “Trash” folder with examples• But indexing trash could be a lot of messages…
– Grouping if incoming email useful?• Not really needed for frequent email reading• Useful when returning from a trip and need to triage the mail
– Relevant email• Useful for finding uncoupled email threads• Sent-Items would be useful to index here
Lots of Work To Do• Experiment with other classifiers
– Need to see relation with users on training issues, speed, etc. not just classification accuracy
• Latch onto more events– Better mail detection, drag & drop events
• Clean up code implementation– Support persistence, speed issues on startup scan– Implementation issues– Compatibility with Outlook 2002, VB .NET
• Other forms of visualization / categorization– E.g., color, thread information, graphical techniques
• Extend to other forms of Outlook data– Calendaring, Notes, Files
Try It Out
• Source Code & Binaries available online
– http://www.math.uaa.alaska.edu/~afkjm/emailaddin/
– Only tested with Windows 2000 & Outlook 2000
– Feel free to use or modify code as you see fit
– Warning: Developer docs and code cleanup still needs to be done!
• But I’ll be glad to answer any questions!