28
Preserving Social Science Research Data Using Fedora Bryan Beecher Inter-university Consortium for Political and Social Research (ICPSR) CNI Fall 2010 Membership Meeting

Beecher cni fall 2010 v4

Embed Size (px)

DESCRIPTION

This is a talk from the Coalition for Networked Information Fall 2010 Member Meeting (CNIfall2010). I talked about our project to use Fedora as archival storage for social science research data and documentation.

Citation preview

Page 1: Beecher cni fall 2010 v4

Preserving Social Science Research Data Using Fedora

Bryan Beecher

Inter-university Consortium for Political and Social Research (ICPSR)

CNI Fall 2010 Membership Meeting

Page 2: Beecher cni fall 2010 v4

ICPSR

• World’s largest social science research data archive– Lots of files (millions)– Small files (6TB total)

• Long track record of success – 50 yrs– Trust us– Enormous legacy burden

Page 3: Beecher cni fall 2010 v4

ICPSR

• Survey data are our core– Low volume of new content compared

to natural sciences– We curate each item extensively

(disclosure, quality, format, usability)

• Strong access orientation– Talk like an archive– Walk like an archive?

Page 4: Beecher cni fall 2010 v4

Walking the walk

• Good storage container for content and its metadata

• OAIS-compliant• Generate SIPs and AIPs (and DIPs)• But…

Page 5: Beecher cni fall 2010 v4

What should we do?

Page 6: Beecher cni fall 2010 v4

Where to begin?

Focus areas• Preservation• Going forward• Reusable

Do not try to include• Access• Everything we have

Page 7: Beecher cni fall 2010 v4

A Solution

• Fedora objects– Container for stuff we ingest and

preserve

• Fedora services– To generate AIPs and SIPs

• Tool to generate FOs from existing content and metadata

Page 8: Beecher cni fall 2010 v4

Ingest

• The Motivated Depositor– Eager to describe

the research data in great detail

– Uploads complete, machine-readable metadata

Page 9: Beecher cni fall 2010 v4

Ingest (continued)

• The Unmotivated Depositor– Upload a variety

of proprietary file formats for documentation and data

– Leaves the baby on the doorstep

Page 10: Beecher cni fall 2010 v4

Ingest – Nov 2010 deposits

Page 11: Beecher cni fall 2010 v4

Ingest (continued)

• Typical deposit– Research data in one of the common

stat packages (SAS, SPSS, etc)– Technical documentation in a

proprietary format (Word, PDF)– A proto-SIP in quasi-OAIS terms– Minimal level of metadata regarding

how the survey was conducted

Page 12: Beecher cni fall 2010 v4

Ingest container – file level

• Vanilla Fedora Object– Will never know

what sort of content format to expect

– Use the RELS-EXT to connect related files

Page 13: Beecher cni fall 2010 v4

Ingest container – deposit

• Another plain Fedora Object– Points to all of the

files stored in the file-level objects

– Relatively little metadata stored for this level of object

Page 14: Beecher cni fall 2010 v4

Ingest container – example

Page 15: Beecher cni fall 2010 v4

Ingest container – example

Page 16: Beecher cni fall 2010 v4

Ingest and the OAIS PDI

• Reference – unique Fedora PID• Fixity – Fedora-generated checksum• Provenance – identity of depositor

recorded in the DC Datastream• Context – original file name captured

in the content Datastream• Access Rights – terms of deposit

Page 17: Beecher cni fall 2010 v4

Generating OAIS SIPs

• Original content– Normalized version too, if applicable– What’s normalization in this context?

• Preservation Description Information (PDI)– As described previously

• Delivered via SDef/SDep combo

Page 18: Beecher cni fall 2010 v4

Ingest – continued

• Data– Disclosure analysis– Recoding

• Documentation– Corrections– Clarifications

• Normalized formats

Page 19: Beecher cni fall 2010 v4

Ingest – finale

• Packaged into a “study”– Data, doc

questionnaire, user guide, etc

– Normalized formats for preservation

– Convenient formats for access

Page 20: Beecher cni fall 2010 v4

Ingest – finale

PID

REPORT(test/plain)

objectProperties

DC

RELS-EXT

AUDIT

icpsr:release-28748-file-3

QUESTIONNAIRE(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

icpsr:release-28748-file-1

STATA-DICT(text/plain)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

DATA(text/plain)

DDI(text/xml)

SAS-SETUPS(text/plain)

SPSS-SETUPS(text/plain)

STATA-SETUPS(text/plain)

icpsr:release-28748-file-2

CODEBOOK(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

Page 21: Beecher cni fall 2010 v4

Generating OAIS AIPs

• For each object (file)– Everything from the SIP plus

• Preservation events• Description of the transformation used• Preservation commitment

– Its post-processed version

• Delivered via SDef/SDep combo

Page 22: Beecher cni fall 2010 v4

Example AIP

PID

REPORT(test/plain)

objectProperties

DC

RELS-EXT

AUDIT

icpsr:release-28748-file-3

QUESTIONNAIRE(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

icpsr:release-28748-file-1

STATA-DICT(text/plain)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

DATA(text/plain)

DDI(text/xml)

SAS-SETUPS(text/plain)

SPSS-SETUPS(text/plain)

STATA-SETUPS(text/plain)

icpsr:release-28748-file-2

CODEBOOK(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

PID

objectProperties

DC

RELS-EXT

AUDIT

Page 23: Beecher cni fall 2010 v4

Questions we faced

• Datastreams or relationships?• What about our XML?• AIPs or DIPs?• How to build FOXML?

Page 24: Beecher cni fall 2010 v4

Datastreams /relationships?

PID

CONTENT X

objectProperties

DC

RELS-EXT

AUDIT

PID

CONTENT Y

objectProperties

DC

RELS-EXT

AUDIT

PID

CONTENT Y

objectProperties

DC

RELS-EXT

AUDIT

CONTENT X

Page 25: Beecher cni fall 2010 v4

Our XML

• DDI v2– Contains lots of the information one

might expect to find in the DC

• Strategy– Duplicate it

Page 26: Beecher cni fall 2010 v4

AIPs or DIPs

• Lots of copies• Destination

– Archival Storage remote location– Repository for ingest

Page 27: Beecher cni fall 2010 v4

Building FOXML

• Source– Database– DDI XML

• Re-usable tool

Page 28: Beecher cni fall 2010 v4

Special Thanks

The Team• Peggy Overcashier• Nathan Adams• Nancy McGovern• Mary Vardigan

The Funder• National Science

Foundation Award 0958382

• INTEROP EAGER program