38
e-Services to Keep Your Di it l Fil C t Digital Files Current Presented by: Peter Bajcsy -Research Scientist at NCSA -Associate Director of I-CHASS, I3 Institute -Adjunct Assistant Professor, CS & ECE UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

e-Services to Keep Your Digital Files Current

  • Upload
    pbajcsy

  • View
    331

  • Download
    3

Embed Size (px)

DESCRIPTION

These slides were presented on April 1st, 2010 at the Archives 2, College park, Washington DC

Citation preview

Page 1: e-Services to Keep Your Digital Files Current

e-Services to Keep Your Di it l Fil C tDigital Files Current

Presented by: Peter Bajcsy-Research Scientist at NCSA-Associate Director of I-CHASS, I3 ,Institute-Adjunct Assistant Professor, CS & ECE UIUC

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

Page 2: e-Services to Keep Your Digital Files Current

Acknowledgement

• This research was partially supported by a National Archives and Records Administration (NARA) ( )supplement to NSF PACI cooperative agreement CA #SCI-9619019 and NCSA Industrial Partners.The ie s and concl sions contained in this doc ment• The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Archives and Records Administration, or the U.S. government.

• Contributions by: Peter Bajcsy Kenton McHenry Rob• Contributions by: Peter Bajcsy, Kenton McHenry, Rob Kooper, Michal Ondrejcek, Jason Kastner, William McFadden, Sang-Chul Lee, Luigi Marini

Imaginations unbound

Page 3: e-Services to Keep Your Digital Files Current

Outline

• IntroductionTechnologies• Technologies• File format conversion software

registry• Automated file format conversions• Conversion quality assessment

• Summary• Summary• Future Work

Page 4: e-Services to Keep Your Digital Files Current

Introduction

Page 5: e-Services to Keep Your Digital Files Current

Supporting NARA’s Strategic Plan

• According to The Strategic Plan of The National Archives and Records Administration 2006–2016. “Preserving the Past to Protect the Future”

“Strategic Goal: We will preserve and• “Strategic Goal: We will preserve and process records to ensure access by the public as soon as legally possible”public as soon as legally possible • “Part D. We will improve the efficiency

with which we manage our holdings from the time they are scheduled through accessioning, processing, storage preservation and publicstorage, preservation, and public use.”

Page 6: e-Services to Keep Your Digital Files Current

To Preserve or Not To Preserve?Digital representation of

information & knowledge

Preservation

Information transfer ?transfer ?

Imaginations unbound

AGENCY ARCHIVES

Page 7: e-Services to Keep Your Digital Files Current

Do We Know the Answers?

• (1) What is the granularity of information that one should preserve about a decision process in order to reconstruct it? • Example: the granularity of information collected

from a decision process based on visual inspection of images has implications on storage and comp tational req irements/costscomputational requirements/costs –ImageProvenance2Learn (IP2Learn)

Page 8: e-Services to Keep Your Digital Files Current

Do We Know the Answers?

• (2) Given thousands of DVDs with files, which files are related?files are related? • Example: given files that contain 2D scans of

blue prints and 3D CAD models, find the p ,content-based file correspondence - File2Learn prototype system

Relationship Discovery

784 files30 files

Page 9: e-Services to Keep Your Digital Files Current

Do We Know the Answers?• (3) Given hundreds of versions of the ‘same’ file,

which file version(s) are similar and which one(s) h ld b d?should be preserved? • Example: given a collection of Adobe PDF

documents compare all pairs of Adobe PDFdocuments, compare all pairs of Adobe PDF documents containing text, images, vector graphics,… and order them chronologically orgraphics,… and order them chronologically or based on similarities - Doc2Learn prototype

Page 10: e-Services to Keep Your Digital Files Current

Do We Know the Answers?

• (4) Given thousands of file formats, which conversion software to use and whichconversion software to use and which target file format to use so that the content of those thousands of files wouldcontent of those thousands of files would be viewable in a long run? • Focus of today’s talk is on examples• Focus of today s talk is on examples

of technologies that would provide answers to (4) at large processinganswers to (4) at large processing scale with computational scalability.

Page 11: e-Services to Keep Your Digital Files Current

GoalOb ti Fil f t i• Observation: File format conversions are inevitably one part of our daily life

• Question: Can file format conversions assist in making digital content created today to be accessible and viewable throughout its lifecycle?

• Consideration: we do not know what file formats will be around 100+ years down the yroad

• Goal: to make files backward and forwardGoal: to make files backward and forward compatible

Page 12: e-Services to Keep Your Digital Files Current

Background on File Format Conversions• A very large number of file formats in which digital content is

stored.A i i b f l fil f t t i i• An increasing number of complex file formats containing multiple types of digital content (e.g., Adobe PDF, HDF) or having very elaborate specifications (e.g., STEP).

• Many software implementations of import (read) and export (write) operations.A id t f lit f ft i l t ti• A wide spectrum of quality of software implementations when reading and storing content in various file formats.

• Ephemeral support for many file formats and softwareEphemeral support for many file formats and software implementations

• Hardware dependency of many software implementations

Page 13: e-Services to Keep Your Digital Files Current

Illustration of 3D File Format Reality* k3d* * b * .k3d*.ma, *.mb, *.mp*.pdf (*.prc, *.u3d)

*.w3d

*.dwg *.max, *.3ds*.blend *.iam*.lwo *.c4d

Page 14: e-Services to Keep Your Digital Files Current

Challenges and Objective• Challenges:

• The quality of file format conversions is unknown when using a particular software to do the conversion

• The volume of file format conversions requires significant computational resourcescomputational resources

• Understanding information loss due to file format conversions is application dependent

• Estimating information loss is complicated due to the complexity of file formatsTh fil f t ft d h d d d i• The file format, software and hardware dependencies are often unknown

• Objective: Design and prototype services using a j g p yp gcomputational cloud to support forward-looking decisions

Page 15: e-Services to Keep Your Digital Files Current

Parameters of File Format Conversions

• File format: Content representation depends on a file formatfile format

• Software: Retrieval and storage of content in a file format depends on the quality of softwareformat depends on the quality of software implementation

• Hardware: Software execution depends on access a d a e So t a e e ecut o depe ds o accessto storage media, operating system, and hardware platform

• Criteria defining information loss: Information loss due to file format conversions is defined by application specific criteria

Page 16: e-Services to Keep Your Digital Files Current

Three Example Services of Interest

• (a) Find file format conversion software to convert from any file format to any other file formatother file format

• (b) Execute file format conversions with il bl thi d t ftany available third party software

• (c) Evaluate information loss due to file ( )format conversion over a set of files in multiple complex file formatsmultiple complex file formats

Page 17: e-Services to Keep Your Digital Files Current

Technologies

Page 18: e-Services to Keep Your Digital Files Current

Overview

Page 19: e-Services to Keep Your Digital Files Current

#1: Conversion Software Registry (CSR)

• Problem: Find file format conversion software to convert from any file format tosoftware to convert from any file format to any other file format

• Technology: Conversion Software Registry• Technology: Conversion Software Registry (CSR) at https://isda ncsa uiuc edu/NARA/CSR/https://isda.ncsa.uiuc.edu/NARA/CSR/

• Features: Support for searching, editing and ddi i f ti b t fil f tadding information about file format

conversion software, open access and login-b d difi tibased modification

Page 20: e-Services to Keep Your Digital Files Current

Movie of CSR

Page 21: e-Services to Keep Your Digital Files Current

Comparison of CSR with Other Systems• File Format Registries

• PRONOM developed by the National Archives of the United Kingdomg

• Unified Digital Formats Registry (UDFR – before GDFR)

• Software Registries/CataloguesC it ifi• Community specific

• The Geotechnical and Geoenvironmental Software Directory (GGSD)

• The Natural Language Software Registry (NLSR)• Business oriented

• The Bit9 Global Software Registry (whitelisting software) g y ( g )• Cnet (available software with links to feature descriptions)

• File Format Conversion RegistriesTh Pl t t t b d ( d t t d 18 ft k )• The Planets test bed (password protected, 18 software packages)

Page 22: e-Services to Keep Your Digital Files Current

Novelty of Conversion Software Registry• Existing file format registries focus on file format

specifications• Catalogues of software focus on software of interest

to a specific community and include information b t t l l d i ti d d i b tabout top level description, vendors and price but

not capabilities to import and export file formatsA fil f t i i t lik Pl t• A file format conversion registry like Planets.org supports 16 software packages, only single-hop conversion paths and couples software to the regconversion paths and couples software to the reg.

• Novelty: CSR provides answers about multi-hop conversion paths from about 70+ softwareconversion paths from about 70 software packages currently

Two-hop conversion path

Page 23: e-Services to Keep Your Digital Files Current

#2: File Format Conversion Engine

• Problem: Execute file format conversions with any available third party softwarewith any available third party software

• Technology: Polyglot version 1, operating on NCSA hardware resourceson NCSA hardware resources, downloadable for private deploymentF t b b d t• Features: web-based access to a computational cloud consisting of

dit h d d i t ll ti fcommodity hardware and installations of third party software with import/export

biliticapabilities

Page 24: e-Services to Keep Your Digital Files Current

Movie of Polyglot

Page 25: e-Services to Keep Your Digital Files Current

Polyglot Design EXTENSIBILITY

Cloud Computing

AUTOMATION

COMPUTATIONAL SCALABILITY

Services to Archivists

Page 26: e-Services to Keep Your Digital Files Current

Comparison of File Format Conversion SystemsSystems

• Some existing file format conversion services• http://www.ps2pdf.com;p p p ;

• Supports only certain conversion types• http://www.zamzar.com

• Supports conversion of document, image, music, video and couple of CAD formats

• http://media-convert.com

• Supports about 20 multi-media formatsD b k Th i ti t t• Drawbacks: The existing systems are not extensible (limited by specific libraries), cannot be downloaded for private use (files with sensitive info)downloaded for private use (files with sensitive info), computational scalability is unknown

Page 27: e-Services to Keep Your Digital Files Current

Format Conversion Extensibility Via Software ReuseSoftware Reuse

• Observation: Nobody has the resources to load every possible file format• Fully supporting the many available formats is an

enormous undertakingIf a file format is closed/proprietary it may be difficult to• If a file format is closed/proprietary it may be difficult to retrieve the data directly from the file

• Vendor file formats sometimes store application feature ppspecific pieces of information that is not supported in other formatsM t ft t i ti / ti f b t f• Most software support importing/exporting of a subset of application domain specific file formats.

• Conclusion: Software reuse and extensibility are the key Co c us o So t a e euse a d e te s b ty a e t e eycharacteristics of file format conversion systems

Page 28: e-Services to Keep Your Digital Files Current

File Format Conversion Extensibility• Extensibility in Polyglot: Software is reused by wrapping

3rd party software while utilizing whatever access the f d k il bl b dd dsoftware vendors make available to embedded

functionality• published Application Programming Interface (API)• published Application Programming Interface (API),

command line and Graphics User Interfaces (GUI)• Novelty: Polyglot provides a single user interface that y yg p g

allows the user to execute multiple software conversion software applications automatically, and over distributed computers that have a license for the software needed tocomputers that have a license for the software needed to do the conversion and/or have the computing resources necessary for the size of the job (computational scalability).

Page 29: e-Services to Keep Your Digital Files Current

#3: File Comparison Engines• Problem: Compare two files and evaluate

information loss due to file format conversion over a set of files in multiple complex file formats

• Technologies:g• Initial prototypes: ModelBrowser (four 3D

comparison metrics); Doc2Learn (one metric across multiple digital objects), Doc2LearnHadoop(computation scalability using Hadoop)

• Work-in-progress: A general API for content-based comparison of any two files - Versus

Page 30: e-Services to Keep Your Digital Files Current

3D Comparison Example (ModelBrowser)

h t l

heart.stl

• Software: Adobe 3D Reviewer• Original File: WRL• Converted Files: STP, STL,

heart.wrl

Converted Files: STP, STL, IGS, U3D

• Comparison Method: Light Fields [Chen, 2003] compares heart stpe ds [C e , 003] co pa essilhouettes from various viewing angles around the objects

heart.stp

Conclusion: Information loss(WRLSTP)=Information loss (WRLSTL)

Page 31: e-Services to Keep Your Digital Files Current

Multiple Object Comparisons (Doc2Learn)

Adobe PDF documents ~ {text, images, vector graphics, ….}

Page 32: e-Services to Keep Your Digital Files Current

Multiple Method Comparisons (Versus)• Software: MS Paint• Original File: TIF• Converted Files: PNG, GIF, JPG, BMP• Comparison Method: Pixel by pixel difference (sum of

Euclidean distances over all pixels)Euclidean distances over all pixels)

User Inputs

Conclusion 1: Information loss(TIFBMP or TIFPNG) =0Conclusion 2: Information loss(TIFGIF) > Information loss(TIFJPG)

Page 33: e-Services to Keep Your Digital Files Current

Information Loss EvaluationSetup:• Inputs: a set of files, a set of software packages, p p g

criteria for defining information loss• Wanted output: information loss ‘score’ per file

format conversionApproach:• Phase I: Find all round-trip conversion paths from a

given file format to the same file format• Phase II: Execute all conversions to obtain

converted files.• Phase III: Compare the original and converted files

Page 34: e-Services to Keep Your Digital Files Current

Information Loss Evaluation: Computational Requirements

• Files: one file in STP file format• Software: Adobe 3D Reviewer, Cyberware PlyTool• Comparison Method: Light Fields [Chen, 2003] • Number of paths: 10 (28 individual conversions)

Phase I: FindPhase II: Execute

Phase III: Compare

Page 35: e-Services to Keep Your Digital Files Current

Summary

Page 36: e-Services to Keep Your Digital Files Current

Information Technology Lessons• Better understanding of preservation and reconstruction of

electronic records in terms of file format conversionsTh d t d l d d f d ti i ti fil• The data model needed for documenting existing file format conversion software

• A framework (test bed) for software reuse andA framework (test bed) for software reuse and extensibility to provide file format conversion services

• The complexity of performing content-based file i d t f i f ti l dcomparison and measurements of information loss due

to file format conversions • The computational cost of file format conversions, fileThe computational cost of file format conversions, file

comparisons and information loss evaluations • The computational scalability of file format conversions

d fil i i ll l i diand file comparisons using parallel processing paradigms

Page 37: e-Services to Keep Your Digital Files Current

The Value for Archivists• Prototype services are freely available to digital preservation

community and provide decision support tools• to select an ‘optimal’ file format to be preserved• to evaluate file format conversion software

to select minimum cost for a chosen file format conversion• to select minimum cost for a chosen file format conversion path

• The framework for conversion software documentation, ,software reuse and functionality extensibility has a major impact on

Effi i ith hi h h ldi• Efficiency with which we manage our holdings• Understanding of the information loss introduced due to

conversions• The cost of updating file format conversion services

Page 38: e-Services to Keep Your Digital Files Current

Development Plans

• Prototype services are open to the public at • https://isda.ncsa.uiuc.edu/NARA/CSR/https://isda.ncsa.uiuc.edu/NARA/CSR/• http://teeve3.ncsa.uiuc.edu/polyglot/convert.php

• Software is open source technology andSoftware is open source technology and downloadable from http://isda.ncsa.uiuc.edu/download/p

• We have been building a second generation of these file format conversion services

• Feedback is very welcome• Questions: Peter Bajcsy –j y

[email protected]