View
217
Download
0
Category
Preview:
Citation preview
Die ZBW ist Mitglied der Leibniz-Gemeinschaft
A Data Restore Model
for Reproducibility in Computational Statistics
Daniel Bahls, ZBW, I-Know 2013, Graz, Austria
Outline
1. Motivation – Repeatability in Empirical Research
2. Our Approach – The Data Restore Model
3. Outlook – Status of this Work / Next Steps
Seite 2
Repeatability in Science
• Fundamental criterion – to verify is the job of the community
• Experiments must lead to the same findings• different researchers• under certain constant parameters
• Further• Robustness (w.r.t measuring errors, etc.)• Repeatability vs. Reproducibility vs. Verifiability
Seite 3
Repeatability in Economicsand the infamous case of Rogoff and Reinhard
Seite 4
Improving Review Processes
Seite 5
- Justin Wolfers, Betsey Stevenson, economists at University of Michigan
....so we need access to the data
If we try it all on our own
and cannot reproduce the results,
what does it mean?
McCullough – Experiences & Recommendations
Seite 6
McCullough – Requirements & Experiences
Seite 7
McCullough – Requirements & Experiences
Seite 8
Sweave – Literate Programming for Statistics
Seite 9
Sweave – Literate Programming for Statistics
Seite 10
Data Publishing in Economics / Social Sciences
Different disciplines have different challenges
Characteristics of empirical research:
• sensitive / protected data
• distributed external data sources
Seite 11
Data Sharing
submit data bundles to 3rd-party repositories?
?
Data ManagementThe Black Box Approach
data reviewcuration legal situation
re-use transparency repeatability
Seite 12
a data set copy(some resource bundle)
Statistical Data on the Semantic Web
Seite 13
Outline
1. Motivation – Repeatability in Empirical Research
2. Our Approach – The Data Restore Model
3. Outlook – Status of this Work / Next Steps
Seite 14
Data Restore Model
Seite 15
Spreadsheet
obs data set
Data Restore Model
Seite 16
Spreadsheet
obs data set
DataSet
type
UserDataSet
Data Items
type
Data Itemsfrom own survey
includesData
external dataset
buildScript
No gaps
Trust
Incentive
17
Seite 18
Source: EuroStatDataset: Household XZVersion: 0.2Published: Jan 2009[read more]
Integration with Research Environments
Seite 19
Seite 20
Review and Re-use
Seite 21
Client
Source CodeRepository
Archive DArchive CArchive B
Archive A
DOI
Code andData Templates
Authenticate & Request Data
Data Infrastructure Concept
• One source per data set
transparency, curation by highest expertise
• Data protection
make data publishing possible for all scenarios
• Data and code integration
one-click-solution – no manual efforts for replication attempts
• Precise Citation
traceable data provenance
Seite 22
Incentives for the Research Community
• Transparency increases trust:
no gaps – trust – incentive
• Easy re-use:
the research models applied live longer
• More impact:
more citation
Seite 23
Incentives for the Research Community
• Material for tutorials:
Students learn computational research in practice
• Research is more efficient:
Easier to understand and pick up the research of others
• Secured Knowledge:
Replication attempts in different research environments and context
discussion, inspiration, innovation
“Non-Findings” may get more recognition
Seite 24
Outline
1. Motivation – Repeatability in Empirical Research
2. Our Approach – The Data Restore Model
3. Outlook – Status of this Work / Next Steps
Seite 25
What we are currently working on
Seite 26
The Rogoff and Reinhard / Herndon case
• apply Data Restore Model
• add semantic data documentation (partly available as RDF already)
• model by Data and Code ontology
Data and Code Ontology
Seite 27
Data and Code
System Environment
Resources
HW
SW
Replication Attempts
ExperimentSetup
• Maven• Make
• Build
• Virtualisation
• Emulation
• Linked Science
• Social M
edia
Data References
• Semantic Coding?
What we are currently working on
Seite 28
The Koenker Zeileis case
• Model relations between Data and Code instances
protectedpublic use file
figures
data set
transformationby code
The Koenker Zeileis case
Data Access and Retrieval
Next Steps
Seite 30
1. Challenge, Goals, Requirements
2. The Data Restore Model
3. Semantic Linkup / Data Annotation
4. Data Retrieval and Reuse
5. System Architecture
6. Validation / Evaluation
So there are still gaps
Examples:
•data set is titled “EU Unemployment statistics 2012, EuroStat”• age class? seasonal adjustments?
•Executing the code does not produce the results• wrong data? system environment? error?• cf. Herndon’s replication of Rogoff/Reinhard research
•DOI does not specify file format
Seite 32
Data and Code Ontology
Seite 33
observation string value
s p o
data ref
default value
for_stata
for_spss
Such relationship can be stated within the semantic model
Proxy Relations
Dataset foreconomic growth(GDP or the like)
Dataset forAluminium
Price Index
Describes the proxy relation: - details on correlation
- best practices - frequency of use
- ...
hasProxyRel
Recommended