Upload
buinguyet
View
219
Download
2
Embed Size (px)
Citation preview
Outlines
Data Quality Step by Step
Ronnie Babigumira
PEN Workshop, 08/01/08
Ronnie Babigumira Data Quality Step by Step
OutlinesPart I: General PrinciplesPart II: PEN’s Approach to DQ
Outline of Part I
1 A generalized approachBackgroundHow can we get it right?
Ronnie Babigumira Data Quality Step by Step
OutlinesPart I: General PrinciplesPart II: PEN’s Approach to DQ
Outline of Part II
2 PEN’s DQ ProceduresThe DilemmaThe Fixes
Ronnie Babigumira Data Quality Step by Step
Overview
Part I
General Principles
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
The big questions
. . . need Data. . . At one level "data" are the world that we want toexplain .... At the other level, they are the source of allour troubles ... Griliches, Zvi. "Data andEconometricians-the Uneasy Alliance." AmericanEconomic Review 75, no. 2 (1985): 196-200.
Anyone who has delved into data from the real worldknows it can be messy, very messyAnd yet the quality of a data is of prime importance foraccurate, reliable and valid results.
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Where does it go wrong
Main Culprit
The Human Element
Accomplice
The Tech Element
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Good Quality Data Workflow
ResearchQuestion
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Research Question
... It is better to use an approximate solution to theright question than an exact solution to a wrongquestion ...
If you don’t ask the right question, you will likely correct thewrong dataClarityOptimal ignorance
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Good Quality Data Workflow
ResearchQuestion
SurveyInstrument
H.E
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Survey Instrument
My heart sinks when someone produces their ownquestionnaire, consisting of questions that they havethought of on the bus [Dr Fisher’s Casebook: A shynurse consults the good doctor. 2006. Significance 3(3):122-122.]
Hurriedly throwing a questionnaire together is at best awaste of time and at worst, a source of flawed data thatcould affect your work and a reputation.More details in section 6.4 of the PEN technical guidelines
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Good Quality Data Workflow
ResearchQuestion
SurveyInstrument
H.E
Survey
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
The Survey
The Government are very keen on amassing statistics. They collect them,add them, raise them to the nth power, take the cube root and preparewonderful diagrams. But you must never forget that every one of thesefigures comes in the first instance from the village watchman, who justputs down what he damn pleases - Sir Josiah Stamp, Inland RevenueDepartment of England, 1896-1919
Pre-testing.Hiring and training enumerators (probing.. respondentsmay mis-report)SamplingJens will talk more about thisAlso see section 6.5 of the PEN technical guidelines
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Good Quality Data Workflow
ResearchQuestion
SurveyInstrument
H.E
Survey Checking
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Checking
Visual inspection of hard copies of the questionnaires as theyreturn. Best done while in the field. Look out for
GapsLegibilityConsistency e.g. No livestock but you have livestock dataA debriefing session recommended where enumeratorsshare their experiences
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Good Quality Data Workflow
ResearchQuestion
SurveyInstrument
H.E
Survey Checking Coding
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Coding
Is a a crucial part of the data preparation and should betreated as such.“Close” all open questions before entering dataThe creation of the “other” code is best done on the PC i.e.Only code something as other as a last resortAll questionnaires should be coded before data entry
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Good Quality Data Workflow
ResearchQuestion
SurveyInstrument
H.E
Survey Checking Coding
Data Entry
T.E
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Data Entry
Good practice
Is about capturing & storing raw data as accurately as possible
Not to be confused with data management or analysis
Do it in a way that eases analysis. Preferably done in the field
What to useWord processors or text editorsSpreadsheets ("To err is human. But to really foul things up, you need Excel")
Statistical packages e.g SPSS. Base not good for entry, pay more andyou get the data entry module. Epidata highly recommended (gratis)
Database packages. Hard to set up but good rewards
Use the wrong software & not only will it garble your data, it might eatyou up as well.
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Good Quality Data Workflow
ResearchQuestion
SurveyInstrument
H.E
Survey Checking Coding
Data Entry
T.E
DataCleaning
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Data Cleaning
I found mothers younger than their children. I found men who claimed tohave cervical smears ("Dr Fisher’s Casebook the Trouble with Data."Significance 4, (2007))
What is it
A set of procedures aimed at ensuring the reliability and correctness of the databy detecting and correcting (or removing) corrupt or inaccurate records.
Almost always necessary yet often ignored
What to look for
Completeness
Outliers (what to do about them?)
Range
Logic
Arithmetic
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Good Quality Data Workflow
ResearchQuestion
SurveyInstrument
H.E
Survey Checking Coding
Data Entry
T.E
DataCleaning
Data Man-agement
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Data Management
What is it
Some say everything from data entry to analysis
Our definition: All the pre-analysis processing of clean data. Includes
Data documentation (labeling)Creating new variables / obs and changing existing onesManaging files separating, combining, collapsing, reshaping
Good practice
Raw data: Leave it alone, create flat files.
Transparency transparency transparency: Leave a clear audit trail.
Document: Some say, if data is not documented, run as fast as you can.
Tech Element: Save your hair & sanity. Use a command drive program.
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Good Quality Data Workflow
ResearchQuestion
SurveyInstrument
H.E
Survey Checking Coding
Data Entry
T.E
DataCleaning
Data Man-agement
DataAnalysis
Ronnie Babigumira Data Quality Step by Step
OverviewBackgroundHow can we get it right?
Data AnalysisThis is what it is all about
It is all well and good to collect data atgreat length, expense, and effort butthe most important aspect is often notthe actual information collected but theinterpretation and use it is put to.
Columb, M.O, P Haji-Michael, and P Nightingale. 2003.Data collection in the emergency setting. EmergencyMedicine Journal 20 (5):459-463.
Ronnie Babigumira Data Quality Step by Step
PEN
Part II
PEN’s Approach to DQ
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
PEN in a Second
30 + researchers, scientists and partners to answer a bigquestion about the the importance of forests to the livelihoodsof millions of poor people world wide.
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
PEN in the Workflow
ResearchQuestion
SurveyInstrument
H.E
Survey Checking Coding
Data Entry
T.E
DataCleaning
Data Man-agement
DataAnalysis
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
PEN in the Workflow
Research Question X
Survey instrument XSurvey X
364 villages9,100 households
Visual Inspection XCoding X
All responses have been codedCoding centralized which has provided an unifiedunderstanding of products and no duplication
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
A monster is born
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Desafío
Stock294 150 questionnaire pages700,000 + tables.Heterogeneity in data management skills of partners
How do we1 Accurately capture the raw data from the questionnaire’s2 Store it in an organized fashion3 Prepare and make it ready for individual and global
analysis4 Analyze it
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Data entry
Excel was never an option. Why?Individual creativity a nightmare for the collectiveMore serious excel issues such as mixed variable types,propagation of blanks, checks, and dangerous proximity todata.
A discussion on what to use, in the end, MS Access waschosen
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
The Database
We 1 have designed and implemented MS Access databasemodules for data entry
WhatEach survey is a databaseWithin each database, “each” page of the questionnaire isa table
WhyUniformity and compatibility among users i.e. variablenames, codes etcIntuitive and user friendly interfaceData quality controlsFlexible
1The database development team is Arild Betty and RonnieRonnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
A Modular approach
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Intuitive and user friendly GUI
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Controls
Questionnaire
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Controls
Eye Candy
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Controls
Deceptively Simple
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Controls
Bells and Whistles
In ControlRonnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Flexible
Household Data
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Flexible
Common Solutions
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Flexible
Access.. Flexible format
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
The last work on Flexibility
We focus on ease of entry and deal with data management later
Wide Formathhid village district d_mkt1001 Moss Akershus 151002 Ås Akershus 11003 Drøbak Akershus 42
Long Formathhid pid sex1001 1 11001 2 11001 3 01002 1 21002 2 11002 3 11002 3 21003 1 11003 1 2
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Data Cleaning
Example of data cleaning in Stata* -----------------------------------------------------------------
* CIFOR PEN Poverty Environment Network
* A demo do-file to clean v2
* By: Ronnie Babigumira
* -------------------------------------------------------------------cd e:\pen\cleaning\miriam\s_data\v2use v2_a_geo, clear
/* Some simple checks1. Year is between 2005 and 2007,2. Month is between 1 and 123. Day is between 1 and 31
*/
list village villcode intyear if intyear < 2005 | (intyear > 2007 & intyear !=.)list village villcode intmon if intmon < 1 | (intmon > 12 & intmon !=.)list village villcode intday if intday < 1 | (intday > 31 & intday !=.)
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
PENs data submission Workflow
1 Partner submits access data2 Data transferred to Stata3 Cleaning program run on data. Report of bugs sent back to
partners4 Partner addresses bugs, re-submits data.5 Steps 1 to 4 repeated till all bugs are addressed or can be
explained. Final report on data compiled.6 Data sent to master database for archiving7 Programs written to access master data set and create flat
files for data analysis
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Putting it together
Country V1 V2 A1 A2 QuarterlyLocal Access Databases from all PEN Partners
1
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Putting it together
Data CleaningData Cleaning
Country V1 V2 A1 A2 QuarterlyLocal Access Databases from all PEN Partners
1
2
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Putting it together
Data CleaningData Cleaning
Country V1 V2 A1 A2 QuarterlyClean Data (Access and Stata)
Country V1 V2 A1 A2 QuarterlyLocal Access Databases from all PEN Partners
1
3
2
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Putting it together
Data CleaningData Cleaning
Country V1 V2 A1 A2 QuarterlyClean Data (Access and Stata)
Country V1 V2 A1 A2 QuarterlyLocal Access Databases from all PEN Partners
1
3
2
Country V1 V2 A1 A2 QuarterlyGlobal dataset Aggregation of local datasets
4
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Putting it together
Data CleaningData Cleaning
Country V1 V2 A1 A2 QuarterlyClean Data (Access and Stata)
Country V1 V2 A1 A2 QuarterlyLocal Access Databases from all PEN Partners
Sub-datasets for analysisFlat files
1
3
2
5
Country V1 V2 A1 A2 QuarterlyGlobal dataset Aggregation of local datasets
4
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Conclusion
What is a good data set
Accurate
Complete
Clean
Documented
Easy to use.
But
... data preparation may occupy 90% of a project time line and is amajor source of delay. Mikhail Golovnya, Salford Systems 2007.
A number of Steps, a number of opportunities to mess up. Its all in yourhands.
Ronnie Babigumira Data Quality Step by Step
PENThe DilemmaThe Fixes
Minds think with ideas, not information. No amount ofdata, bandwidth, or processing power can substitutefor inspired thought. Clifford Stoll
Ronnie Babigumira Data Quality Step by Step