Upload
elden
View
54
Download
4
Tags:
Embed Size (px)
DESCRIPTION
UC3 Summer Webinar Series. An Introduction to the Merritt Curation Repository. University of California Curation Center Team California Digital Library June 9, 2011. First, a word about the webinar series…. A forum for timely topics of interest to the UC community - PowerPoint PPT Presentation
Citation preview
An Introduction to the Merritt Curation Repository
University of California Curation Center TeamCalifornia Digital Library
June 9, 2011
UC3 Summer Webinar Series
First, a word about the webinar series…• A forum for timely topics of interest to the UC
community– Highlighting projects, services, and developments in the
areas of digital preservation, web archiving, and data curation
– Intended to raise awareness of issues, and provide information on useful resources and services available to the UC community
– 2nd and 4th Thursday of the month, and as scheduled, featuring UC3 staff and UC librarians, content managers, and technologists
Teleconference +1 (866) 740-1260, access code 9879016#Webconference http://bit.ly/jdjMAP
First, a word about the webinar series…
• Some logistics…– Participant phones will be muted during the formal
presentation, but we will be monitoring the online chat
– Slides, Q & A, and web and voice recordings will be posted after each presentation
– Schedule available at http://www.cdlib.org/uc3/uc3webinars.html
– Please suggest additional [email protected]
– Take the short surveyhttp://www.surveymonkey.com/s/XSGWP8R
Now on with the show…
• Today’s topic is an introduction to the Merritt curation repository– Who is it for?
– What can it do?
– Why use it?
– What does it cost?
– Next steps?
– Q & A
What keeps you up at night?
Are there standards or best practices I should
be aware of?
How much will it cost?
How can I transfer my content to an
appropriate curation environment
How do I know my content is safe?
What’s the best strategy to ensure
permanent availability?
Do I need to create new derivatives just for preservation purposes?
How can I get a persistent reference
to my content? What if my content needs to evolve over
time?
Can I control who can see my
content?
I have a good discovery platform; how can I add preservation services?
“There’s an app for that”
Are there standards or best practices I should
be aware of?
How much will it cost?
How can I transfer my content to an
appropriate curation environment
How do I know my content is safe?
What’s the best strategy to ensure
permanent availability?
Do I need to create new derivatives just for preservation purposes?
How can I get a persistent reference
to my content? What if my content needs to evolve over
time?
Can I control who can see my
content?
I have a good discovery platform; how can I add preservation services?
Automatic replication and high-availability redundancy
Periodic fixity audit
Simple submission UI/APIMETS “feeder” duplicates
existing DPR workflow
Model freeNo packaging, format, or metadata requirements
Strongly versionedIntegration with
EZID and DataCiteCurator-defined
access control rules
Modular micro-services “toolkit”
UC3 consultation
Storage at $1.04/GB/year
Merritt repository
• Merritt is available for use by all members of the UC community
– Libraries/archives/museums– ORU/MRUs– Faculty/staff
• Centrally hosted by UC3/CDL on behalf of the UC community– Economies of scale– Shared experience and
expertise
Mediated through campus libraries
Modes of use: dark archive
• Pro-active preservation, but no expectation of direct end user access– Legacy DPR content contributed by campus libraries– Cultural heritage texts, master images, sound, moving
image, data sets
– All DPR content will be automatically migrated to Merritt
Modes of use: bright archive
• Provide preservation and end user access– NIH Healthy Pathways project on bio-demographics
• Multi-institutional: UC Davis, University of Colorado, University of Virginia, Syddansk University (Denmark)
• Need to restrict access to project partners initially, with eventual public access
Modes of use: bright archive
• Content discovery: search
Modes of use: bright archive
• Content discovery: search
Modes of use: bright archive
• Content discovery: browse
Modes of use: bright archive
• Content discovery: browse
Modes of use: preservation “back end”
• Preservation only; content discovery/delivery provided by well-known external systems– Using direct hooks into Merritt to retrieve content
– eScholarshipOpen access publishing
– Open ContextArchaeological data publishing
– Investigating integration with Islandora/Drupal and Alfresco
Modes of use: distributed data grids
• DataONE “Enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it”
More information
• Online help http://merritt.cdlib.org/help
• FAQ http://merritt.cdlib.org/docs/merritt_handout.pdf
• User’s guidehttp://merritt.cdlib.org/docs/merritt_user_guide.pdf
• UC3 contact http://www.cdlib.org/uc3/[email protected]
Merritt cost model
• UC3 provides technical infrastructure, data center hosting, staff, monitoring, maintenance, enhancements, help, outreach, consultation, etc.
• Contributors are charged only for storage used, at the UC3 recovery rate of $1.04/GB/year
• Developing an “endowment” model: Pay once, preserve forever
• Will soon extend model for non-UC contributors
How does this compare?• Cost of a physical book in RLF † $
4.62/year• Cost of a digital book in HathiTrust ‡ $
0.15/year• Cost of a digital book in Merritt $
0.06/year
† Gary Lawrence (2007) Internal analysis, CDL; ‡ Paul Courant and Matthew Nielsen (2010), On the cost of keeping a book, HathiTrust.
Average collection sizes and costs
Collection Objects Size Annual cost
CA DOE reports 8,000 12.0 GB $ 12.48
Cal Cultures 420 65.6 GB $ 68.22
eScholarship 46,425 118.6 GB $ 123.34
A “cost calculator” spreadsheet is available athttp://www.cdlib.org/uc3/docs/Merritt-cost-calculator-v3.xlsx
Average ETD size and cost
Campus ETD titles Size Annual cost
Berkeley 797 12.4 GB $ 12.88
Davis 837 13.0 GB $ 13.52
Irvine 390 6.1 GB $ 6.30
Los Angeles 720 11.2 GB $ 11.63
Riverside 192 2.9 GB $ 3.10
San Diego 558 8.7 GB $ 9.02
San Francisco * 560 8.7 GB $ 9.05
Santa Barbara 325 5.0 GB $ 5.25
Santa Cruz 155 2.4 GB $ 2.50
Based on 2009 holdings in ProQuest * UCSF based on total ETD holdings in Merritt
Average research data size and cost
• Almost 50% of all research data is less than 1 GB
Source: Science 331:6018 (February 11, 2011): 692-693 <DOI: 10.1126/science.331.6018.692>
Size Percentage Annual cost
< 1 GB 48.3 % < $ 1.04
1 – 100 GB 32.0 % $ 1.04 – 104.00
100 GB – 1 TB 12.1 % $ 104.00 – 1,040.00
> 1 TB 7.6 % > $ 1,040.00
Next steps
• UC3 is working with campus partners to determine ongoing development and collection priorities
ReplicationIdM/Authn/AuthzIngest, Access Inventory, QueuingStorage and Identity
Technology watchMetadata standardsPolicy and business modelData management guidelinesObject and collection modeling
New contentacquisition
Next steps
In production• Model-free objects• Submission via UI and API• Persistent identifiers• Format identification• Version provenance• Automated replication• Automated fixity audit• Role-based access control• Collections• Semantic index and search• Object/version/file download
In progress
• Simplified update
• Enhanced characterization (JHOVE2)
• Faceted search and browse (XTF)• CMS/DAMS-like function
(Islandora)
In planning
• Simplified batch
• UCTrust integration
• Linked data
• Transformation• Notification• Annotation• Support for NGTS/DLSTF
recommendations
We welcome your feedback on needs and priorities!http://www.cdlib.org/uc3/[email protected]
Simplified update
• Variant form of object update requiring the submission of only the changed components
• Client-side tools to simplify the creation of batch manifests #%checkm_0.7
#%profile | http://uc3.cdlib.org/registry/ingest/mani#%prefix | mrt: | http://merritt.cdlib.org/terms##%prefix | nfo: | http://www.semanticdesktop.org/onto#%fields | nfo:fileUrl | nfo:hashAlgorithm | nfo:hash
http://merritt.cdlib.org/samples/goldenDragon.jpg | mhttp://merritt.cdlib.org/samples/tumbleBug.jpg | md5 http://merritt.cdlib.org/samples/generalDrapery.jpg | http://merritt.cdlib.org/samples/generalDrapery.jpg |
#%eof
Enhanced characterization
• JHOVE2 next-generation framework for format-aware characterization http://jhove2.org/
– Automated extraction and inference of extensive technical metadata significant for preservation analysis and planning
"Module": { "scope": "ICCModule“, "Header": { "scope": "ICCHeader“, "ProfileSize": { "unit": "byte“, "value": 60960 } ,"ProfileVersionNumber": "4.2.0.0“ ,"ProfileDeviceClass_raw": "spac“ ,"ProfileDeviceClass_descriptive": "ColorSpace Conversion profile“ ,"ColourSpace_raw": "RGB “ ,"ColourSpace_descriptive": "rgbData“ ,"ProfileConnectionSpace_raw": "Lab “ ,"ProfileConnectionSpace_descriptive": "labData“
Enhanced discovery via XTF
• eXtensible Text Framework http://xtf.cdlib.org/
– CDL developed/supported open source discovery platform– Robust, scalable faceted search and browse
CMS/DAMS-like function
• Many campuses are looking for CMS/DAMS solutions• Investigating integration with Islandora to provide a
Drupal CMS/DAMS front-end to Merritt
http://islandora.ca/ http://drupal.org/
Questions?
Upcoming webinarsDate/time TopicWednesday, June 1512:30 pm
Data Sharing by Scientists: Practices and PerceptionsCarol Tenopir, Univ. TennesseeMike Frame, USGS
Thursday, June 302:00 pm
The Data Management Planning Tool (DMP Tool)Trisha Cruse, UC3
Thursday, July 142:00 pm
Data as PublicationJohn Kunze, UC3Catherine Mitchell, CDL Publishing Program
Thursday, July 282:00 pm
Merritt: Depositing Content and Providing Access
Thursday, August 112:00 pm
DCXL (Data Curation Excel)
http://www.cdlib.org/uc3/uc3webinars.html
Please take the webinar survey http://www.surveymonkey.com/s/XSGWP8R
For more information
UC Curation Centerhttp://www.cdlib.org/uc3http://www.cdlib.org/uc3/[email protected]
Stephen Abrams Margaret LowLisa Colvin David LoyPatricia Cruse Mark Reyes Scott Fisher Tracy Seneca Erik Hetzner Joan StarrGreg Janée Marisa StrongJohn Kunze Perry Willett
UC3 webinar serieshttp://www.cdlib.org/uc3/uc3webinars.html
Merritt repositoryhttp://merritt.cdlib.org/ http://merritt.cdlib.org/helphttp://merritt.cdlib.org/docs/merritt_handout.pdfhttp://merritt.cdlib.org/docs/merritt_user_guide.pdf