31
정보 생애주기에 따른 데이터 보존 을 위해 고려할 사항 - 국가 디지털 아카이빙 전략 연구 TF 내부 세미나 - 2010. 4. 1. 정영임 한국과학기술정보연구원 정보유통본부 지식기반실

20100401 정영임 da 전략 tft_0330

  • Upload
    -

  • View
    230

  • Download
    0

Embed Size (px)

DESCRIPTION

정보 생애주기별 데이터 보존을 위해 고려할 사항

Citation preview

Page 1: 20100401 정영임 da 전략 tft_0330

정보 생애주기에 따른 데이터 보존

을 위해 고려할 사항

- 국가 디지털 아카이빙 전략 연구 TF 내부 세미나 -

2010. 4. 1.

정영임

한국과학기술정보연구원 정보유통본부 지식기반실

Page 2: 20100401 정영임 da 전략 tft_0330

- 2 -

Table of Contents

1. Digital Archiving in the Framework of Information Life Cycle

Management

2. Creation

3. Acquisition

4. Cataloging/Identification

5. Storage

6. Preservation

7. Access

Page 3: 20100401 정영임 da 전략 tft_0330

Digital Archiving in the Framework of Information Life

Cycle Management Digital archiving framework

– Considered at all stages of the information life cycle management

– Information life cycle

• Creation

• Acquisition

• Cataloging/Identification

• Storage

• Preservation

• Access

- 3 -

Page 4: 20100401 정영임 da 전략 tft_0330

Creation

Creation

– Defined as an act of producing the information product in the broadest

sense

– Should be regarded as a starting point of long-term and preservation

Suggestion of provision of a preservation indicator for creators

– U.S. Department of Agriculture’s Digital Publications Preservation

Steering Committee

Establishment of guidelines for creators

– Oak Ridge National Laboratory, USA

• A Guide To Record Series Supporting Epidemiological Studies Conducted

for the Department of Energy

• Limits on software

• Format and layout of the documents

- 4 -

Page 5: 20100401 정영임 da 전략 tft_0330

Creation

Adaption of Standard Descriptive Languages

– Standard groups incorporate XML and RDF architectures

Attachment of Metadata on Digital Contents

- 5 -

Page 6: 20100401 정영임 da 전략 tft_0330

Acquisition and Collection Development

Three main aspects to acquisition of digital objects

– Collection policies

– Gathering methods

– Intellectual Property Concerns

- 6 -

Page 7: 20100401 정영임 da 전략 tft_0330

Establishment of Collection Policies

Collection policies

– Selecting What to Archive

• Purpose

– For Dark Archiving: Back issue

– For Light Archiving: Current issue

• Criteria

– Easiness of Content Acquisition

– Quality of Contents

– Utilization

– On-going access fee

• Content Type Coverage: E-journals/R&D Reports/Patents/Scientific Data

– Determining Extent

– Archiving Links

– Refreshing the Archived Contents

- 7 -

Page 8: 20100401 정영임 da 전략 tft_0330

Considerations on Gathering Method

Gathering methods

– Hand selection

• Value Judgment and Retention Scheduling (Edinburgh University Library)

– Not preserved

– Preserved for defined period

– Preserved indefinitely

– Automatic selection

• National Library of Sweden: Automatic acquisition without making value

judgment (priority: periodicals, static documents, HTML pages >>

conferences, usenet groups, ftp archives)

• EVA projects: Establishment of time limits to avoid the overloading

- 8 -

Page 9: 20100401 정영임 da 전략 tft_0330

Considerations on Intellectual Property Concerns

Reliance on Legislation

– Freedom of Information Act 2001

• The public may have unrestricted access to certain records.

(Consider what categories of information may need to be viewed by the public - these

records need to remain accessible at all times.)

– In general, due to absence of international digital deposit legislation

• PANDORA project seeks permission from the copyright owner

• Swedish and Finnish national library projects do not contact the owners

Making Agreement with Content Providers

– E-journal: Publishers or academic associations

• CLIR/DLF draft model license, NESLi2 Standard license model

• Agreement of Cornell University with publishers

– Government document: Open to public

– Scientific data: individual creators or data centers

• Arts and Humanities Data Service provide information on what is needed

for a digital archive and what creators are likely to be willing to deposit- 9 -

Page 10: 20100401 정영임 da 전략 tft_0330

Agreement of Cornell University with Publishers

Topics identified in the agreement(Thomson and Kroch, 2000)– The general responsibilities of the publishers and Cornell

– Characteristics of the data, accompanying metadata, and any additional documentation that are

to be deposited

– Guidelines on transmission methods and media for deposit

– Procedures for the deposit

– Procedures and protocols Cornell will use to verify the arrival and completeness of the data

– Rights of the depositing organizations to audit the repository

– The respective roles, responsibilities, and rights of the Cornell and the data producers with

regard to the data

– Articulation of Cornell's responsibilities and capabilities with regard to the accessioning,

description, management, and even transformation of the deposited data

– Access policies for users of the repository, and how they may vary over time

– Conditions on the use of the data, and again how they may vary over time

– Fees (if any) associated with the deposit

– Cornell's ability to share the data with partners to create an agreed-upon level of redundancy

– Clarification of issues surrounding copyright retained by authors

- 10 -

Page 11: 20100401 정영임 da 전략 tft_0330

Identification and Cataloging

Identification

– Provision of a unique key for finding the digital object and linking object

to other related objects

Cataloging in the form of metadata

– Support for organization, access and curation

- 11 -

Page 12: 20100401 정영임 da 전략 tft_0330

Persistent Identification

Problems in using URL as Identifier

– Use of server as location identifier can result in lack of persistent over

time both for the source object and any linked objects

– Continuous use of URL

New approaches on persistent identification

– OCLC: PURLs

– ACS: Digital Object Identifier (DOI), MN (Manuscript Number)

– DTIC: Handle® system

– AAS: Bibcode, PubRef numbers

- 12 -

Page 13: 20100401 정영임 da 전략 tft_0330

Creation of Metadata at Cataloging Stage (1/3)

Creation Method of Metadata

– Manual creation of metadata

– Automatic generation of metadata

• A project by US Environmental Protection Agency

• Defense Information Technology Testbed project

- 13 -

Page 14: 20100401 정영임 da 전략 tft_0330

Creation of Metadata at Cataloging Stage (2/3)

Formats of Descriptive Metadata

– E-journal

• Full MARC cataloging

– Traditional library cataloging standards

– NLA’s PANDORA Archive

• Current development of descriptive metadata standards

– MARCXML, MODS(Metadata Object Descriptive Schema)

– Web-based resources

• Dublin Core-like format

• EVA project

– Non-textual data

• Identification of metadata elements needed for non-textual data types such

as images, video, multimedia and others

– Z39.87 NISO/AIIM Technical metadata for digital still images

– AES X089 core audio metadata

- 14 -

Page 15: 20100401 정영임 da 전략 tft_0330

Creation of Metadata at Cataloging Stage (3/3)

Management of Heterogeneous Metadata Format

– Translation between various metadata formats

– Key to the development of networked, heterogeneous archives

– Adaption of packaging metadata standards

• Open Archival Information System (OAIS) Reference Model

– Is developed by ISO Consultative Committee for Space Data Systems

– Encapsulates specific metadata as needed for each object type in a consistent

data model

• Metadata Encoding and Transmission Standard (METS)

– Is produced by Library of Congress Standards Office and Digital Library

Federation

– Provides framework for holding all types of metadata for digital object

• Others

– MPEG-21 Digital Item Declaration Language

– IMS Global Learning Consortium Content Packaging Standards

– Sharable Content Object Reference Model (SCORM)

– CCSDS XML Packaging scheme- 15 -

Page 16: 20100401 정영임 da 전략 tft_0330

Development of Technical Model for Storage

Recommendation for Developing a technical model for the

repository (Cornell University)

– Establishing a baseline of e-journal software and file format needs

– Specify the archival repository

– Specifying monitoring tools that will flag documents within the

repository that require migration

– Specifying a baseline hardware and software infrastructure to house

the repository

– Exploring the need and implementation models for redundancy in the

repository

- 16 -

Page 17: 20100401 정영임 da 전략 tft_0330

Issues on Changing Storage Media

Problem of changing storage media

– Block size, tape size and tape drive mechanism have changed over

time.

Common Solution

– Data migration to new storage systems

• Much cost and imperfect transferring system is still an issue.

• Check/validation algorithms are extremely important

• Manual check is still necessary.

• Atmospheric Radiation Monitoring Center plans to migrate to new storage

systems every 4-5 years

– Each data migration will take 6-12 months

- 17 -

Page 18: 20100401 정영임 da 전략 tft_0330

Issues on Terabytes of Data Storage

Problem of dealing with large-scale data

– Extensive validation routines to ensure the quality of the information as

the information is migrated

• NCBI has 30 Ph.D.s reviewing the information manually, even after it has

passed a variety of validation algorithms

• Similar cost has been spent for

– Corrections and additions to particular records

– Maintenance of a history of changes

– Approval by the owner of all changes controlled by NCBI

Common Solution

– Large-scale data can be stored in different file formats

• Biological sequence data is held in simple ASCII files for preservation

purposes.

• Data in a structured database is provided for searching, reporting and

maintenance

– Extensive tasks can be transitioned to a non-profit consortia

• Protein Data Bank: Collaboratory for Structured Bioinformatics - 18 -

Page 19: 20100401 정영임 da 전략 tft_0330

Preservation

Long-term preservation

– No common agreement on the definition of long-term preservation

Main aspects on preservation

– Selection of digital preservation strategies/technologies

– Cycle for hardware/software migration

• No specific investigation on the cycle for hw/sw migration has been done.

• Depending on the particular technologies and subject disciplines, it can be

vary from 2 to 10 years.

– Preservation of the “look and feel” of digital contents

- 19 -

Page 20: 20100401 정영임 da 전략 tft_0330

Digital Preservation Strategies

Bitstream Copying

Refreshing

Durable/Persistent Media

Technology Preservation

Digital Archaeology

Analog Backups

Migration (SW, HW migration)

Replication

Reliance on Standards

Normalization

Canonicalization

Emulation

Encapsulation

Universal Virtual Computer- 20 -

Page 21: 20100401 정영임 da 전략 tft_0330

Hardware and Software Migration

Problems on Migration

– Migration is not guaranteed to work for all data types

– Migration of information products having used sophisticated software

feature is unreliable

– Generally, there is no backward compatibility, and if it is possible,

there is certainly loss of integrity in the result.

Emulation as an alternative to migration

– Encapsulates the behavior of the hardware/software with the objects

• MS Word 2000 document with metadata indicating how to reconstruct the

document at the engineering level

– Creates an emulation registry identifying the HW/SW environment and

providing information on how to recreate the environment

- 21 -

Page 22: 20100401 정영임 da 전략 tft_0330

Advantages and Disadvantages of Preservation Strategies

- 22 -

Page 23: 20100401 정영임 da 전략 tft_0330

Selection of Preservation Strategies

A schematic diagram for selection of preservation techniques of digital information.

(Lee et al, 2002)

- 23 -

Page 24: 20100401 정영임 da 전략 tft_0330

Preservation of the Look and Feel

Format of materials

– In order to save the “look and feel” of material

• TIFF

– The most prevalent for those organizations involved with the conversion of

paper back file

» E.g.) JSTOR

– This does not allow the embedded references to be active hyper links

• SGML/HTML

– Used by many large publishers after years of converting publication systems

from proprietary format to SGML

– American Astronomical Society (AAS)

• PDF

– The most prevalent format for purely electronic documents used for both formal

publications and grey literature

– National Library of Sweden

– Concerns remain for long-time preservation

» It may not be accepted as a legal depository form because of its

proprietary nature

- 24 -

Page 25: 20100401 정영임 da 전략 tft_0330

Normalization vs. Native Formats

Normalization

– Process of converting the native format to a standard format

• AAS, ACS transform the incoming file into SGML-tagged ASCII format

– Electronic master copy is able to serve as the robust electronic archival copy.

– Well-tagged copy can be updated periodically, at very little cost.

– It takes advantage of advances in both technology and standards.

» Content remains unchanged, but the public electronic version can be

updated to remain compatible with the browsers and other access

technology

– Examples of data normalization provided data community

• NASA Data Active Archive Centers

– Transform incoming satellite and ground monitoring information into standard

Common Data Format

• U.K’s National Digital Archive of Datasets

– Transforms the native format into one of its own devising

• Normalized formats are considered to be the archival versions

– Intellectual property question

- 25 -

Page 26: 20100401 정영임 da 전략 tft_0330

Reliance on Standards

Emphasis on Standards

– DOE OSTI

• Limited the number of acceptable input formats

• Text in SGML (and its relatives HTML and XML), PDF, WordPerfect and

Word.

• Image in TIFF Group4 and PDF Image

- 26 -

Page 27: 20100401 정영임 da 전략 tft_0330

Preservation Strategies Used in Major Projects

- 27 -

CSI: CISTI Csi, ECO: OCLC Electronic Collections Online, EJO: Ohio LINK Electronic Journal Center

KB: KB e-Depot, KOP: Kopal DDB, LA: LOCKSS Alliance, LANL: Los Alamos National Laboratory Research Library,

NLA: National Library of Australia PANDORA, OSP: Ontario Scholars Portal, PMC: PubMed Central, PORT: Portico

Page 28: 20100401 정영임 da 전략 tft_0330

Issues on Access

Access Mechanisms

– Access and display mechanisms

• Providing access

• Restricting access

Rights Management and Security Requirements

– Security and version control

– Creation metadata to manage encryption, watermarks, digital

signatures

- 28 -

Page 29: 20100401 정영임 da 전략 tft_0330

Access Mechanisms

Providing Access

– NLM’s Profiles in Science

• Creates an electronic archive of the photographs, text, video, etc

• Electronic archive is used to create new access versions as access

mechanisms change

– Providing access technologies

• Super Distribution

• Value-chain support

Restricting Access

– Usage rule

– Persistent protection

- 29 -

Page 30: 20100401 정영임 da 전략 tft_0330

Access

Rights Management and Security Requirements

– Most difficult access issues for digital archiving

– Security and version control impact digital archiving

• Right management includes providing or restricting access as appropriate

• Content protection technologies

– Contents Encryption

– Trusted Environment

– Metadata for managing encryption, watermarks, digital signatures

needs to be created.

- 30 -

Page 31: 20100401 정영임 da 전략 tft_0330

References

CLIR, 2002. The State of Digital Preservation: An International Perspective [online] [cited 2009-

07-23]

Hodge, 2000. Best Practices for Digital Archiving: An Information Life Cycle Approach, D-Lib Magazine:6(1) [online] [cited 2009-07-23] < http://www.dlib.org/dlib/january00/01hodge.html>

Hodge et al, 2004. Digital Preservation and Permanent Access to Scientific Information, [online]

[cited 2009-07-23]

ICPSR, 2009. Digital Preservation Management: Implementing Short-term Strategies for Long-

term Problems [online] [cited 2009-12-03] http://www.icpsr.umich.edu/dpm/index.html

Kenney, A. R., Entlich, R., Hirtle, P. B., McGovern, N. Y. and Buckley E. L., 2006. E-Journal Archiving Metes and Bounds: A Survey of the Landscape [online] [cited 2009-12-03]

Lee, K., Slattery, O., Lu, R., Tang, X. and McCrary, V. 2002. The State of the Art and Practice in

Digital Preservation, Journal of Research of the National Institute of Standards and Technology:

107(1), 93-106.

Thomas, S. E. and Kroch, C. A. 2000, Project Harvest: The Cornell University Library's Proposal

to The Andrew W. Mellon Foundation To Develop a Repository for E-Journals, [online] [cited

2010-03-26] <http http://www.diglib.org/preserve/cornellprop.htm >

Edinburgh University Library Digital Archives Research Project. A report and recommendations

- 31 -