Upload
meryl-shelton
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
W. Thomas - 2014
DDI Background
Wendy ThomasDDI Sprint 2015 - Dagstuhl
Copyright © DDI 2015Published under Creative Commons Attribute-ShareAlike 3.0 Unported
W. Thomas - 2014
Credits• Slides marked “W. Thomas – 2014” were
created by Wendy Thomas• Except as noted all other slides are from the
slide decks created for DDI training workshop at Dagstuhl October 2014 and related events– Arofan Gregory, Wendy Thomas, Joachim
Wackerow, Jon Johnson• All slides are published under Creative
Commons Attribute-ShareAlike 3.0 UnportedOn-line available at: http://creativecommons.org/licenses/by-sa/3.0/This is a human-readable summary of the Legal Code at:
http://creativecommons.org/licenses/by-sa/3.0/legalcode
W. Thomas - 2014
Environment
• Background of content metadata– Rapid advances in data/metadata storage and
access since 1960’s– Rising expectations for access and openness– Shifting focus on metadata coverage– DDI development in light of these changes
• DDI history
W. Thomas - 2014
Time Period Access Expectation Applied Metadata DDI
1960-1970 Electronic data capture and dissemination
Print; Bibliographic records for machine readable files
Discussion begins regarding consistent formats for metadata
1980-1990 Increased access to computing; internet; CD-ROMS
Data file description; Electronic format; Discovery and basic codebook metadata
NSF grant to develop DDI; focus on simple study; description of a data file
2000-2005 Web access; web publication
Access systems; Generic access system
Publication of DDI 1.0; NESSTAR and local use tools; Publication of DDI through 2.1; DDI Alliance; IHSN Toolkit
2006-2010 One-stop-shopping; Open Data
Preservation; Provenance
DDI-L [3.0, 3.1]; Expansion of IHSN Toolkit to over 80 countries
2011-2015 "Semantic Statistics"; Linked Data; Big Data
Shift from top down to bottom up
DDI-L 3.2; Moving Forward Project
W. Thomas - 2014
Types of Metadata
• Bibliographic– Short description applicable to multiple object types for the purpose
of discovery and access• Access metadata
– Additional details required to access specific content of an object (location of content within the object, required hardware/software, etc.)
• Content metadata– Detailed information on the content structure and source of a data file
• Production metadata– Metadata that prescribes and drives processes– Metadata that is not limited to a single data file
W. Thomas - 2014
Types of metadataBibliographic Access
MetadataContent Metadata
Production Metadata
Dublin CoreMARCDMARCEtc.
SPSSSASRStataPREMISEtc.
EMIANZLICSDMXDDI-C
BPLDDI-LGSBPMGSLPMGSIM
W. Thomas - 2014
What is DDI
• DDI describes provenance, processing, structure, and meaning of your data for intelligent use by human beings and machines
• DDI allows you to say what you plan to do, what you actually did, and what the analyst needs to know to make intelligent use of the data
DDI Background• Concept of DDI and definition of needs grew out of
the data archival community• Established in 1995 as a grant funded project
initiated and organized by ICPSR• Members:– Social Science Data Archives (US, Canada, Europe) – Statistical data producers (including US Bureau of the
Census, the US Bureau of Labor Statistics, Statistics Canada and Health Canada)
• February 2003 – Formation of DDI Alliance– Membership based alliance– Formalized development procedures
DDI Alliance
• The Alliance is an unincorporated, self‐sustaining membership organization whose members have a voice in the development, promotion, and dissemination of DDI specifications.
• 40 member institutions including universities, archives, national statistical institutes, and international organizations.
DDI Alliance Structure• DDI-L specifications are created by committees drawn from
among the member organizations– Some outside experts are invited to attend
• The elected Executive Board governs the organization– They are elected by Member representatives
• The Scientific Board vote to approve all published work– One representative per member organization
• The Technical Committee (TC) creates the technical work products (XML schemas, UML models, documentation, etc.)
• Working Groups are short term groups working on future DDI topical content (i.e., Active Data Management Plan)
• Web Site Maintenance Group
DDI-C and DDI-L
• DDI has 2 development lines– DDI Codebook (DDI-C)– DDI Lifecycle (DDI-L)
• Both lines will continue to be improved– DDI-C focusing just on single study codebook
structures– DDI-L focusing on a more inclusive lifecycle model
and support for machine actionability
Early DDI:Characteristics of DDI-C
• Focuses on the static object of a codebook• Designed for limited uses– End user data discovery via the variable or high level study
identification (bibliographic)– Only heavily structured content relates to information
used to drive statistical analysis • Coverage is focused on single study, single data file,
simple survey and aggregate data files• Variable contains majority of information (question,
categories, data typing, physical storage information, statistics)
Limitations of these Characteristics
• Treated as an “add on” to the data collection process• Focus is on the data end product and end users
(static)• Limited tools for creation or exploitation• The Variable must exist before metadata can be
created• Producers hesitant to take up DDI creation because it
is a cost and does not support their development or collection process
DDI Lifecycle Model
Metadata Reuse
Concept Universe
Representation
Geography
Measurement Means of Capture
Processes InstrumentCollection Event
Methods / Protocols
Variables Data Relations
Storage Structure
Store (file level)
Summary Statistics
Management
Agents
Comparison
Discovery
Information Captured in DDI
W. Thomas - 2014
DDI-L Driving Principles• Capture the full lifecycle of data• Capture metadata at the point of creation– Increases accuracy– Prevents loss through neglect
• Reuse– Increases accuracy– Provides implicit comparability
• Management over time– Provides context– Ensures preservation
W. Thomas - 2014
Additional Issues• Metadata is “touched” by multiple agents– Metadata needs to be accessible throughout the lifecycle
• Metadata can be an “input”, a process definition, and an “output”– Metadata needs to be captured and retained at all points
in the development, production, and usage lifecycle• Metadata can be descriptive or prescriptive;
actionable or non-actionable– The role of the metadata dictates how it needs to be
captured
Metadata Management Over Time
• Modern best practice is to manage metadata in centralized repositories for use throughout the organization
• Metadata management is the foundation of all other applications
• Metadata is not static. It versions, it is enhanced to support new applications, use cases, and to capture provenance at all levels.
Metadata Reuse
• Metadata is often not managed in such a way that it can be easily reused
• Reuse of metadata provides a high degree of consistency and comparability – elements of data quality
• This benefits data producers, managers, and users
Metadata-Driven Systems
• DDI supports the automation of many tasks which today are manual and resource-intensive
• This requires detailed and rich metadata in a machine-actionable form
• May require the re-design of workflows within an organization – put the metadata first!
• A major part of the “modernization” of production and dissemination systems
• Funding agencies are demanding more metadata to allow the reuse and replication of research data they pay for
W. Thomas - 2014
DDI RDF Products
• DDI held two workshops on “Semantic Statistics” at Schlöss Dagstuhl over the past few years to address the expression of DDI in an RDF environment
• DDI has just completed the first review for the following RDF vocabularies:– DISCO – DDI-RDF Discovery Vocabulary– XKOS – Extended Knowledge Organization System– PHDD – Physical Data Description
• DDI 4 will be model based and implemented in both XML and RDF
DDI Moving Forward Project
• The development of the next version of DDI is a project with a limited lifespan
• The project has an Advisory Committee to oversee the organization of the work
• It has its own project management and working group structure
• All development work is being done in an open, transparent fashion
• The outputs (for review or publication) are overseen by the Technical Committee
STANDARDS USED WITH DDI
DDI and Other Standards
• DDI is used in combination with other standards in many different contexts. The following are two examples:– Data archives and libraries doing preservation,
management, and dissemination– Statistical offices doing data production
• This is not a comprehensive list of scenarios, but is indicative of what takes place
Archival/Management Standards
• Dublin Core – discovery metadata• ISO/IEC 11179 – Metadata registries• ISO/IEC 19115 (and related series) –
geographical systems• OAIS/PREMIS – archival process • METS – information wrapper for transport and
storage• GLBPM – process description for longitudinal
data
Official Statistics
• Generic Statistical Business Process Model (GSBPM) – process model of statistical production
• Generic Statistical Information Model (GSIM) – information model for statistical production
• ISO 17369 SDMX – Statistical data and metadata exchange and dissemination
• Common Statistical Production Architecture (CSPA) – standard architecture for building re-usable statistical services
• Business Process Modelling Notation (BPMN) – generic standard for modelling and communicating business processes
RDF and the Web of Linked Data
• RDF and related standards are becoming increasingly important to the DDI Community
• There are a large number of RDF vocabularies which are important (dcterms, skos, foaf, prov, qb, etc.)
• In its own RDF vocabularies, DDI has made every effort to align with existing vocabularies in accordance with best practice
DDI as a Good Citizen
• The DDI Alliance has made many efforts to reach out to other standards organizations – Usually the goal is alignment of the standards
• The DDI Alliance has also tried to stay current with best practice as a standard specification– Shift to model-driven approach– Use of W3C XML Schema– Support for RDF standards
Expanded Domain Usage
• Traditionally, DDI has been used by those in the Social, Behavioral, and Economic (SBE) sciences and by official statistical agencies.
• There are several “growth” domains:– Health research– Education– Environmental science
• Thus, even more existing standards are becoming of interest to the DDI Alliance
DDI TOOLS OVERVIEW
DDI Tools Outline
• Editing• Data Collection• Transformation• Management• Metadata/Data Dissemination
Editing Tools• Rogatus Survey (DIPF) – WIP, Open Source, .NET, DDI 3.2
– Being updated – available end of 2015• Colectica – Commercial, .NET, DDI-C and DDI-L 3.2
– Also, free Excel Plug-In• DDI Editor (DDA) – Open Source, Java, DDI 3.1• OM Survey Manager – Open Source, Java, DDI 3.1
– Supports merging codes/categories created from transforming DDI-C• GESIS tool for researcher documentation, variable harmonization
(CharmStats)- WIP• Nesstar Publisher (NSD) – Freeware, DDI Codebook• Metadata Management Toolkit (IHSN) – freeware/open source,
Java (based on Nesstar); editor and publications tools for websites/CD-ROM, DDI Codebook
• CED2AR (Cornell) – Open Source, Java, DDI Codebook
Data Collection• Rogatus Survey (DIPF) – WIP, Open Source, .NET, DDI 3.2
– Generate & document surveys, manages fieldwork– Being updated – available end of 2015
• Michigan Questionnaire Documentation System (SRO)– Export DDI documentation from Blaise instruments
• BlaiseDoc (SRO) - WIP – Create documented Blaise instruments
• Colectica – DDI-C, DDI-L (3.2), Commercial, .NET – Read & generates Blaise code, CASES, CSPro, and others– Document questionnaires manually
• DDI Editor (Danish Data Archive) – Open Source, Java, DDI 3.1– Document questionnaires manually
• R code for integrating REDCap questionnaires with DDI (Larry Hoyle) – DDI 3.1
Transformation Tools• DDI XML Upgrade (Metadata Technology NA) – Open Source, DDI-C
to DDI 3.1, command line transform (Java)• Sledgehammer (Metadata Technology NA)– Free and paid versions,
GUI or batch mode, mines metadata from stats packages and produces set-ups for other stats packages, SQL, other formats. All versions of DDI, SSS. (Java)
• Caelum (Metadata Technology NA) - command-line tool for XSLT reports from DDI metadata, freeware (Java)
• StatTransfer – commercial tool supporting DDI 3.1 extraction from stats packages and spreadsheets
• DDI with R Tool – creates R set-ups from DDI Codebook (will be extended for DDI Lifecycle) – Adrian Dusa
• r2ddi (SOEP) – creates R set-ups from DDI 3.1
Management Tools
• Colectica Repository – Commercial, .NET• Rogatus Repository (DIPF) – WIP, Open Source,
.NET, DDI 3.2• Ariā (Metadata Technology North America) –
Commercial, Java, various DDI versions. Classification management system.
Metadata/Data Dissemination Tools
• Colectica Web – Commercial, .NET, DDI 3.2• Rogatus Repository (DIPF) – Open Source, .NET, DDI 3.2 • SuperCross (Space Time Research) – Commercial, online
tabulation reading DDI 3.1 • Nesstar Server (NSD) – Online tabulation and search, DDI
Codebook• NADA Catalog (IHSN) – Open Source, Java, online catalog tool,
DDI Codebook• Questasy (CentreData) – community development, PHP, online
system for documentation creation and dissemination• DDI on Rails (SOEP) – Open Source, based on Ruby on Rails web
framework
W. Thomas - 2014
RESOURCES
DDI Resources• DDI Alliance Site
– http://www.ddialliance.org– General link to all resources/news– Link to Sourceforge for standards distributions– Controlled Vocabularies
• Tools/Resources Page– http://tools.ddialliance.org– Best place for tools, slides, and resources
• Mailing Lists– www.icpsr.umich.edu/mailman/admin/– All of the lists starting with “DDI” are related to DDI topics
• DDI Users (best place to link into the group)• List for each sub-committee (not all groups are active)
S20 38
DDI Publications
• Best Practices Across the Data Life Cycle– www.ddialliance.org/resources/publications/working/bestpractices
• Use Cases– www.ddialliance.org/resources/publications/working/usecases
• IASSIST Quarterly– www.iassistdata.org/iq/– A special double feature focusing on various projects related to DDI 3
and it’s enhanced features– Articles related to DDI can be found in many issues of the IQ
S20 39
Want to know more on DDI 4.0?
• Minutes and formal products from the Dagstuhl and EDDI sprints are found at the DDI site– http://www.ddialliance.org/ddi-moving-forward-proces
s• A new public wiki has been established to maintain
and organize all working and final documents– https://ddi-alliance.atlassian.net/wiki/pages/viewpage.
action?pageId=491703– Current development platform for content capture– http://lion.ddialliance.org/