34
Topes: Meeting the Challenges Topes: Meeting the Challenges of User Input Validation of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

Embed Size (px)

Citation preview

Page 1: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

Topes: Meeting the ChallengesTopes: Meeting the Challengesof User Input Validationof User Input Validation

Christopher Scaffidi

Key collaborators: Brad Myers, Mary Shaw

Carnegie Mellon University

Page 2: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

22

Hurricane Katrina “Person Locator” site:Hurricane Katrina “Person Locator” site:Many inputs unvalidated... and error-ful Many inputs unvalidated... and error-ful

Introduction Challenges Topes Tools Evaluation Conclusion

Page 3: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

33

Data errors reduce the usefulness of data.Data errors reduce the usefulness of data.

Even little typos impede data de-duplication.

Age is not useful for flying my helicopter to come rescue you.

Nor is a “city name” with 1 letter.

Introduction Challenges Topes Tools Evaluation Conclusion

Page 4: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

44

Hurricane Katrina sites are not alone in Hurricane Katrina sites are not alone in lacking input validation.lacking input validation.

• Eg: Google Base web application– 13 primary web forms – Even numeric fields accept unreasonable inputs

(such as a salary of “-45”)

• Eg: Spreadsheets– 40% of cells are non-numeric, non-date textual data– Often used to gather/organize textual data for reports

Introduction Challenges Topes Tools Evaluation Conclusion

Page 5: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

55

OutlineOutline

1. Challenges of data validation

2. Topes• Model for describing data• Tools for creating/using topes

3. Evaluation

4. Conclusion

Introduction Challenges Topes Tools Evaluation Conclusion

Page 6: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

66

Digging into the details: Digging into the details: real user inputs that need validation.real user inputs that need validation.

• Sources:– Interviews of Hurricane Katrina website creators– Survey of Information Week readers– Contextual inquiry of information workers who

created and used websites– Logs of what admin assistants typed into browsers– Exploration of the EUSES spreadsheet corpus

• Validating user inputs has 3 primary challenges…

Introduction Challenges Topes Tools Evaluation Conclusion

Page 7: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

77

1. Inputs don’t always conform well1. Inputs don’t always conform wellto the simple “binary” validation model.to the simple “binary” validation model.

• Data is sometimes questionable… yet valid.– Eg: a suspiciously long email address– In practice, person names and other proper nouns are

never validated with regexps… too brittle.– Life is full of corner cases and exceptions.

• If code can identify questionable data, then it can double-check the data:– Ask an application end user to confirm the input– Flag the input for checking by a system administrator– Compare the value to a list of known exceptions– Call up a server and see if it can confirm the value

Introduction Challenges Topes Tools Evaluation Conclusion

Page 8: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

88

2. User inputs often can occur in multiple2. User inputs often can occur in multipledifferent formats.different formats.

• Two different strings can be equivalent.– How many ways can you write a date?– What if an end user types a date in the wrong format?– “Jan-1-2007” and “1/1/2007” mean the same thing

because of the category that they are in: date.– Sometimes the interpretation is ambiguous. In real

life, preferences and experience guide interpretation.

• If code can transform among formats (ie: not just recognize formats with regexps), then it can put data in an unambiguous format as needed.– Display result so users can check/fix interpretation

Introduction Challenges Topes Tools Evaluation Conclusion

Page 9: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

99

3. The meaning of data is often tied to3. The meaning of data is often tied toits “parts”, not directly to its characters.its “parts”, not directly to its characters.

• Data often has parts, each with a meaning.– What are the parts of a date, 12/31/2008?– Valid data obeys intra- and inter-part constraints.– Constraints are usually platform-independent– Writing regexps requires you to translate constraints

into a character sequence… tough in many cases, practically or truly impossible in others.

• If code could succinctly state the parts, as well as mandatory and optional constraints on the parts, wouldn’t the code be easier to write and maintain?– Especially if it was platform-independent!

Introduction Challenges Topes Tools Evaluation Conclusion

Page 10: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

1010

Limitations of existing approachesLimitations of existing approaches

• Types do not support questionable values

• Grammars do not, either, nor can they reformat

• Information extraction algorithms rely on grammatical cues that are absent during validation

• Cues, Forms/3, -calculus, Slate, pollution markers, etc, infer numerical constraints but not constraints on strings, nor are they platform-independent

Introduction Challenges Topes Tools Evaluation Conclusion

Page 11: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

1111

Imagine a world where…Imagine a world where…

• Code can ask an oracle, “Is this a company name?”, and the oracle replies yes, no, almost definitely, probably not, and other shades of gray.

• Code allows input in any reasonable format, since the code can ask the oracle to put the input into the format that is actually needed.

• People teach the oracle about a new data category by concisely stating its parts and constraints.

Introduction Challenges Topes Tools Evaluation Conclusion

Page 12: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

1212

New Approach: TopesNew Approach: Topes

• A tope = a platform-independent abstraction describing how to recognize and transform strings in one category of data

• Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain

• Validating with topes improves– Accuracy of validation– Reusability of validation code– Subsequent duplicate identification

Introduction Challenges Topes Tools Evaluation Conclusion

Page 13: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

1313

A tope is a graph.A tope is a graph.Node = format, edge = transformationNode = format, edge = transformation

Notional representation for a CMU room number tope…

Formal building name& room number

Elliot Dunlap Smith Hall 225

Colloquial building name& room number

Smith 225

Building abbreviation& room number

EDSH 225

Introduction Challenges Topes Tools Evaluation Conclusion

Page 14: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

1414

A tope is a conceptual abstraction.A tope is a conceptual abstraction.A tope A tope implementationimplementation is code. is code.

• Each tope implementation has executable functions:– 1 isa:string[0,1] function per format, for

recognizing instances of the format (a fuzzy set)– 0 or more trf:stringstring functions linking formats,

for transforming values from one format to another

• Validation function:(str) = max(isaf(str))where f ranges over tope’s formats– Valid when (str) = 1– Invalid when (str) = 0– Questionable when 0 < (str) < 1

Introduction Challenges Topes Tools Evaluation Conclusion

Page 15: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

1515

Common kinds of topes:Common kinds of topes:enumerations and proper nouns enumerations and proper nouns

• Multi-format Enumerations, e.g: US states– “New York”, “CA”, maybe “Guam”

• Open-set proper nouns, e.g.: Company names– Whitelist of definitely valid names (“Google”), with

alternate formats (e.g. “Google Corp”, “GOOG”)– Augmented with a pattern for promising inputs that

are not yet on the whitelist

Introduction Challenges Topes Tools Evaluation Conclusion

Page 16: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

1616

Two other common kinds of topes:Two other common kinds of topes:numeric and hierarchicalnumeric and hierarchical

• Numeric, e.g.: human masses– Numeric and in a certain range– Values slightly outside range might be questionable– (Very rarely) labeled with an explicit unit– Transformation usually by multiplication

• Hierarchical, e.g.: address lines– Parts described with other topes (e.g.: “100 Main St.”

uses a numeric, a proper noun, and an enum)– Simple isas can be implemented with regexps.– Transformations involve permutation of parts,

changes to separators, arithmetic, and lookup tables.

Introduction Challenges Topes Tools Evaluation Conclusion

Page 17: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

1717

Tope Development Environment (TDE)Tope Development Environment (TDE)

Topei ModuleInfers tope from

examples

Toped ModuleEnables EUPs to create/edit topes

Topeg ModuleGenerates context-free

grammars and transformations

Topep ModuleParses data against grammars, performs

transformations

Plug-insRead/write program

data

RobofoxWeb macros

Vegemite/CoScripterWeb macros

Microsoft ExcelSpreadsheets

Visual Studio.NETWeb applications

Introduction Challenges Topes Tools Evaluation Conclusion

Page 18: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

1818

Toped User InterfaceToped User Interface

Features• Format inference• Format/part names• Soft constraints• Testing features• Format reusability

Introduction Challenges Topes Tools Evaluation Conclusion

User StudyEUPs are fast & accurate at creating tope formats

Page 19: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

1919

Integration with programming platformsIntegration with programming platforms

Microsoft Excel:

buttons and menus

Visual Studio: drag-and drop

code generation

Introduction Challenges Topes Tools Evaluation Conclusion

Page 20: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2020

Other integrations to date:Other integrations to date:CoScripter, Robofox, XML/HTML libraryCoScripter, Robofox, XML/HTML library

Introduction Challenges Topes Tools Evaluation Conclusion

Page 21: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2121

Evaluating accuracy, reusability, and Evaluating accuracy, reusability, and usefulness for data cleaningusefulness for data cleaning

• Implemented topes for spreadsheet data– 32 topes based on 720 online spreadsheets– Tested accuracy

• Reused topes on web application data– 8 data categories in Google Base and

5 data categories in Hurricane Katrina site– Tested accuracy

• Used transformations to reformat data– 5 data categories in Hurricane Katrina site– Measured increase in number of duplicates identified

Introduction Challenges Topes Tools Evaluation Conclusion

Page 22: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2222

Extracting spreadsheet test dataExtracting spreadsheet test data

• Cluster spreadsheet columns based on data category– EUSES spreadsheet corpus “database” section– Hierarchical agglomerative clustering– Manual inspection– Result = 1713 columns in 246 clusters

(1 cluster per data category)

• Created 1 tope for each of 32 most common categories – Yielding 32 topes– Covered 70% of clustered columns

Introduction Challenges Topes Tools Evaluation Conclusion

Page 23: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2323

We considered 5 validation strategiesWe considered 5 validation strategies

• Strategy 1: Current spreadsheet practice(accept all inputs)

• Strategy 2: Current webapp practice(validate with regexp or fixed list, when available; accept all other inputs)– 36 regexps + 35 fixed lists, in 7 categories

• Strategy 3A: Tope rejecting questionable(accept when (str)=1)

• Strategy 3B: Tope accepting questionable(accept when (str)>0)

• Strategy 4: Tope warn on questionable(simulate double-check by user when 0<(str)<1)

Introduction Challenges Topes Tools Evaluation Conclusion

Page 24: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2424

MeasurementsMeasurements

• Based on 100 random values per category

• Used F1 to measure accuracy– standard measure of accuracy for

classifiers = (precision*recall)/avg(precision,recall)

• Considered topes with 1, 2, 3, 4, or 5 formats

Introduction Challenges Topes Tools Evaluation Conclusion

Page 25: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2525

Recognizing multiple formats and Recognizing multiple formats and questionable inputs raises accuracyquestionable inputs raises accuracy

Condition 4: Hypothetical user has to help on ~ 3% of inputs

Condition 1: Recall = 0 (fails to identify any invalid inputs)

Introduction Challenges Topes Tools Evaluation Conclusion

Page 26: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2626

Topes based on spreadsheet data were Topes based on spreadsheet data were accurate on web application data.accurate on web application data.

Google Base

Introduction Challenges Topes Tools Evaluation Conclusion

Hurricane Katrina

Page 27: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2727

Putting data in a consistent format improves Putting data in a consistent format improves duplicate identification.duplicate identification.

• Randomly extracted 10000 values for each of 5 Hurricane Katrina data categories

• Implemented transformations for each 5-format tope from the less commonly used formats to the most commonly used

• Found approximately 8% more duplicates after transformation

Introduction Challenges Topes Tools Evaluation Conclusion

Page 28: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2828

Conclusion: Topes improve data validationConclusion: Topes improve data validation

• Validating with topes improves– Accuracy of validation– Reusability of validation code– Subsequent duplicate identification

• Contributions:– Support for ambiguous data categories– Support for transforming values– Platform-independent validation

Introduction Challenges Topes Tools Evaluation Conclusion

Page 29: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

2929

Primary LimitationsPrimary Limitations

Topes are only appropriate…

• For string-ish, categorical data– Not for validating images, audio files, etc – Values must appear in a single field or variable– Validation rules derive from categorical constraints

• When validation rules are known by programmer– Who must label the field/variable with a tope– Who must implement the tope, which runs locally

(future work…)

Introduction Challenges Topes Tools Evaluation Conclusion

Page 30: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

3030

Future Work: Sharing topesFuture Work: Sharing topes

Introduction Challenges Topes Tools Evaluation Conclusion

Future topes development/use process:

1. People implement new topes by using the basic tope editor (or another language such as JavaScript)

2. People publish tope implementations on repositories.

3. People download tope implementations to local cache

4. Tool plug-ins let people browse their local cacheand associate topes with variables and input fields.

5. Plug-ins use tope implementations to validate data.

Stay tuned (or come collaborate !!)

Page 31: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

3131

Thank You…Thank You…

• To Margaret Burnett, Martin Erwig, and many others for suggestions over the past 3 years

• To Oregon State University for this opportunity

• To NSF for funding

Introduction Challenges Topes Tools Evaluation Conclusion

Page 32: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

3232

Professional programmers use lots of tricks Professional programmers use lots of tricks to simplify validation code. Eg: njtransit.comto simplify validation code. Eg: njtransit.com

Split inputs into many easy-to-validate fields.Who cares if the user has to type tabs now,or if he can’t just copy-paste into one field?

Make users pick from drop-downs.Who cares if it’s faster for users to type

“NJ” or “1/2007”?(Disclaimer: drop-downs sometimes are good!)

I implemented this site in 2003.

Introduction Challenges Topes Tools Evaluation Conclusion

Page 33: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

3333

Even with these tricks, writing validation is Even with these tricks, writing validation is still very time-consuming.still very time-consuming.

Overall, the site had over 1100 lines of JavaScript

just for validation….Plus equivalent server-side Java code (too bad code

isn’t platform-independent)

if (!rfcCheckEmail(frm.primaryemail.value)) return messageHelper(frm.primaryemail, "Please enter a valid Primary Email address.");var atloc = frm.primaryemail.value.indexOf('@');if (atloc > 31 || atloc < frm.primaryemail.value.length-33) return messageHelper(frm.primaryemail, "Sorry. You may only enter 32 characters or less for your email name\r\n”+ ”and 32 characters or less for your email domain (including @).");

Introduction Challenges Topes Tools Evaluation Conclusion

Page 34: Topes: Meeting the Challenges of User Input Validation Christopher Scaffidi Key collaborators: Brad Myers, Mary Shaw Carnegie Mellon University

3434

That was worst case.That was worst case.Best case: reusable regexps.Best case: reusable regexps.

• Many IDEs allow the programmer to enter oneregular expression for validating each input field.– Usually, this drastically reduces the amount of code,

since most validation ain’t fancy.– So why don’t programmers validate most inputs?

Introduction Challenges Topes Tools Evaluation Conclusion