View
112
Download
3
Category
Tags:
Preview:
DESCRIPTION
Lecture presented at Catedra Walter Lippmann, Universidad del Rosario, Bogota, Colombia, 23 Nov. 2012 See: http://issuu.com/consejo_de_redaccion/docs/ur_-_semana_-_seminario_walter_lippmann_2012_2
Citation preview
Árbol de vida de los datos
(Data validation in the Digital Age)
Tom JohnsonManaging DirectorInst. for Analytic JournalismSanta Fe, New Mexico USAt o m @ j t j o h n s o n . c o m @ j t j o h n s o n
1
Data validation in the Digital Age
Presentation by Tom Johnson at
Cátedra Walter Lippmann de Periodismo y Opinión PúblicaClaustro de la UniversidadUniversidad del Rosario, Bogota, Colombia
Date/Time: 22 November 2012
This PowerPoint deck and Tipsheets posted at:
http:// s d r v . m s / w N t i M 7
2
Impt. Point 1-You know more than I do
Important point
3
1Each of you know more about some aspect of insuring data quality than I do.
DataSet--Story
4
The STORY!
01001110101001010010001010101001001010010100101010100100101000101010100100111010100101001000101010100100101001010010101010010010100010101010010011101010010100100010101010010010100101001010101001001010001011101010010010101010100100101001110101001010010001010101001001010010100101010100100101000101010101101101010010100101
DataSet
DataSet--CollectionProcess
5
CollectionProcess
The STORY!
01001110101001010010001010101001001010010100101010100100101000101010100100111010100101001000101010100100101001010010101010010010100010101010010011101010010100100010101010010010100101001010101001001010001011101010010010101010100100101001110101001010010001010101001001010010100101010100100101000101010101101101010010100101
DataSet
DataSet-ValidationProcess
[6]
CollectionProcess
ValidationProcess
The STORY!
Paying the price of bad dataIllinois and Missouri sex-offender DB•“St. Louis Post-Dispatch - 2 May 1999: A11 – “ABOUT 700 SEX OFFENDERS DO NOT APPEAR TO LIVE AT THE ADDRESSES LISTED ON A ST. LOUIS REGISTRY; MANY SEX OFFENDERS NEVER MAKE THE LIST” By Reese Dunklin; Data Analysis By David Heath and Julie Luca
•Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A “Criminal checks deficient; State's database of convictions is hurt by lack of reporting, putting public safety at risk, law officials say” By Diane Jennings and Darlean Spangenberger
How bad data can do you wrong2011 - New Mexico Sec. of State’s “questionable voters” data set – “The Big Bundle”•~1.1m voters•Previous Sec. of State didn’t clean rolls•Matched name, address, DoB and SS#
• SSA data base; NM driver’s licenses• 2 variables “mismatch” = Questionable?• Asked State Police (not AG’s office) to
investigate
8
Problems with Sec. of State methodology
• What is the error rate of original DB?• Definition of “error”? (Gonzales or
Gonzalez)• Sample(s) by county and state total?• Error rates of comparative DBs?• Aggregation of error problem
• 2011 Help America Vote Verification Transaction Totals, Year-to-Date, by State https://www.socialsecurity.gov/open/havv/havv-year-to-date-2011.html
01001110101001010010001010101001001010010100101010100100101000101010100100111010100101001000101010100100101001010010101010010010100010101010010011101010010100100010101010010010100101001010101001001010001011101010010010101010100100101001110101001010010001010101001001010010100101010100100101000101010101101101010010100101
DataSet
DataSetCollectionProcess
10
CollectionProcess
The STORY!
Data sets are living things; they have pedigree and genealogy
Important point
11
2•Most [all?] data sets are living things. •And they have a pedigree, a genealogy, an “árbol de vida”. •Data sets live in a dynamic environment. •Understand the DB ecology
Data sets are living things; they have pedigree and genealogy
Important point
12
3• NEVER work with your original data set; always a copy of the file(s)
• More combined data sets = greater chance of error
• Larger data sets = greater chance of error
01001110101001010010001010101001001010010100101010100100101000101010100100111010100101001000101010100100101001010010101010010010100010101010010011101010010100100010101010010010100101001010101001001010001011101010010010101010100100101001110101001010010001010101001001010010100101010100100101000101010101101101010010100101
DataSet
13
Types of Data
DataQuality=FunctionOf…• Data Quality = function of…• Objectives, reputation of data-base
creator• Validity and precision of the
collection/creation process – and resulting data
• Statistical Data?• Primary Data (collected, managed by
agency or individual)• Secondary (Agency or individual is
using someone else’s “primary” data)
[14]
Pyramid of significance
• How to judge whether some data – and its potential stories -- are more trustworthy than others?• Go back to librarians’ hierarchy of
trusted sources when searching? (Has anyone tested the “quality” of data sets from those strata of sources? If not, a good research project.)
[15]
Learn from Librarians
• Evaluating Web Pages: Techniques to Apply & Questions to Ask http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Evaluate.html
• What can the URL tell you?• Gov’t agency? Scholarly? Interest Group? Individual?
• Has a reputation for accuracy been created over time?
[16]
Learn from Librarians
• Does it all add up?• Why was the page put on the web?
• Inform, give facts, give data?• Explain, persuade?• Sell, entice?• Share?• Disclose?
• Is the information current? When was it last updated and by whom?
• If the data is available on other sites, who/what was the original creator and editor of the data?
[17]
Hierarchy of Trust
• For .gov, .edu, or .mil, probably the information has been vetted before it was posted.
• Websites with .gov, .edu and .mil have to be applied for, and their use is controlled.
• It doesn’t mean they are fool-proof though.
• ".org" is organization. Sites that end in .org are usually non-profit organizations.
• Can be very good sources or very poor sources; take care to research their possible agendas or political biases.
“.net” means network.
“.info” is the Internet’s first unrestricted top-level domain since .COM. There are no restrictions on who may register .INFO names. .INFO was created for general use around the world.
Source: http://www.morriscs.org/webpages/jwaffle/index.cfm?subpage=1317299
Hierarchy of Trust
• Credible websites should list contact information and resources.
• If only cell phones and PO boxes = suspicion
• If the author is named, find his/her web page to…
• Verify educational credits • Discover if the writer is either
published in a scholarly journal • Verify that the writer is
employed by a research institution or university
Hierarchy of Trust
• Internet pages that have been published more recently are usually more credible.
• Find this information at the bottom of a website; in the "about us“; or “view page source”
Hierarchy of Trust
• Selling something?
• Asking you to sign up for something?
• May not be presenting you with neutral, unbiased information.
Hierarchy of Trust
Probably reliable sites,
but not necessarily reliable data
01001110101001010010001010101001001010010100101010100100101000101010100100111010100101001000101010100100101001010010101010010010100010101010010011101010010100100010101010010010100101001010101001001010001011101010010010101010100100101001110101001010010001010101001001010010100101010100100101000101010101101101010010100101
DataSet
CollectionProcessDataSet
23
CollectionProcess
The STORY!
Precess of Data Evaluation
24
1. Pre-planning
• 2nd Monitor• “Logbook”
(bitácora) apps
• Checklist of intended steps
2. Lit. review/ interview peers
• Nothing is new; everything has a precedent
• How have others attacked this problem?
3. Do data fit theoretical models?
- Depends on subject: traffic flow vs. Crime or educational level vs. Income
- Sometimes good to use non-trad. models: Crime and disease
Precess of Data Evaluation
25
4. Do a “critical biography” of the data
- Why was data collected? Who ordered its creation (law? Agency? Individual?)
- When first collected?
- News stories about the data?
5. Does biography raise critical warnings?
- Have laws related to data remained the same?
- Have definitions remained the same?
6. Have others run analysis of this data?
- Not only journalists, but other agencies/people
Precess of Data Evaluation
26
7. Acquire latest data and related documentation
- Get data schema & code sheet
- Get instructions to data collectors and data entry clerks
Process of DB evaluation
27
Ask for copy of DATA ENTRY formData Sheet Codes & Explanation
Data base schema sheet
Computer Data-Entry
Sheet
Precess of Data Evaluation
28
7. Acquire latest data and related documentation
- Get data schema & code sheet
- Get instructions to data collectors and data entry clerks
8. Compare record layout to tables
This may tell you:- What data
you did not receive
- Possibly, what data is feeding into other variables or calculations
9. Do documents specify expected ranges & frequencies?
- Suggests variables to be found. If expected range is 1-7 and you find 8…
Precess of Data Evaluation
29
10. Are data values missing or out of range?
- Use Excel (or R) formula to test “expected” ranges- =MIN(A1:A100) or
=MAX(A1:A100)- Use Excel's
conditional formatting feature
Process of DB evaluation
30
10. Review major checklist10. Review major checklist - Revise your list of major checkpointsMajor questions•Are there changes in definitions
• Changed by law?• By the administrators?• Formal or informal by data entry process?
•Are there changes in the collection methods, data entry, editing of data, quality checking, and the type and form of files?•Were there changes in the users and the use of the data?•Now it is time to clean the data
Is perfection necessary?
• How “clean” must the data be?• Depends on the goals – and scale -- of
the analysis• How important is the actual age of an
individual? Or…• How precise should be the lat/longitude
data?
• Precision: Are the numbers rounded or?• Hope for fine-grained, not summaries or
aggregates • Can be especially important with temporal
and geographic data, i.e. What is the range(s) of the time scales?
31
Data Quality checkpoints
• Constancy of definitions and coding categories?
• Completeness: • How many records have unfilled cells? • Are the tendencies of “nulls” consistent
in all records, variable types?
COMMON VERIFICATION METHODS
•CountingDo you have the number of records indicated/promised?
• If >1,000 records, sample to test• To confirm your mythology
• Proportion of completed fields
• If a record has X fields, what % of records are complete?
• Are there trends of null (empty) fields?
•Draw on many Excel functions:• COUNTIFs or SUMIF
33
ScatterPlots+BoxPlots
35
Box Plots
What is a scatterplot?
• Scatterplot is often 1st step in analysis
• Examine relationship between the variables; determine if there are any problems/issues with the data
• Scatterplot indicates anything unique or interesting about the data, such as:• How is the data dispersed? • Are there outliers? A
scatterplot is useful for "eyeballing" the presence of outliers.
36
Convergence of Data Quality with Data Veracity
What is the difference?•Data quality is the responsibility of who
or what agency is collecting or creating thedata setThis suggests questions journalists should ask about DQ
Do methodologies differ?
Resources• Free
• Power Pivot – Excel 2010 add-on for working with large data sets
• R – free software environment for statistical computing and graphics• Shiny – Lets R users turn analyses into interactive web applications
• Google Refine - tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to database
• Google Fusion Tables - an experimental data visualization web application to gather, visualize, and share larger data tables.
• Tableau Public - Interact with the data, download it, or create visualizations of it
• Junar - cloud-based platform for opening data
Resources
• Open Source• Flat File Checker - a simple, intuitive tool for validation of
structured data in flat files (*.txt, *.csv, etc.).• Shiny – Lets R users turn analyses into interactive
web applications
• Excel add-ons• Commercial Companies & Products
• Techspeed Data Cleansing• SAS® Data Quality Advanced
Resources
Professional disciplines and organizations• International Association for Information and Data
Quality• DAMA International
• Forensic Accounting/ Performance Measurement• National Association of Forensic Accountants (NAFA) • Certified Fraud Examiner (CFE)• International Forensic Accounting Association• Forensic Accountants Society of North America• International City/County Management Association
Contabilidad Forense
41
Recursos
Disciplinas profesionales, organizaciones y otros•La Contabilidad o Auditoria Forense: un conocimiento básico en Colombia
•Contabilidad Forense: ¿El lado sexy de la Contaduría?
•La Contabilidad Forense•Contabilidad Forense, una herramienta que busca la verdad•Aplicación del Derecho a la Contabilidad Forense: La práctica indagatoria contra el delito económico
Árbol de vida de los datos
(Data validation in the Digital Age)
Tom JohnsonManaging DirectorInst. for Analytic JournalismSanta Fe, New Mexico USAt o m @ j t j o h n s o n . c o m @ j t j o h n s o n
43
Recommended