35
Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information Engineering, Nanzan university, Japan Adjunct researcher in Digital Library Section at National Diet

Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Embed Size (px)

Citation preview

Page 1: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Steps Towards Better Digital Archives

Hiroyuki Kawano

Department of Information and Telecommunication,Faculty of Mathematical Sciences and Information Engineering, Nanzan university, Japan

Adjunct researcher inDigital Library Section at National Diet Library (Japan)

Page 2: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Outline : Today’s Talk

Fragile Intelligence on Internet Disappearing Scientific/Artistic/Cultural Contents Statistics of web contents in Japan (by NDL)

Problems of building Digital Archive Technology / Legislation / Organization

Towards Better Digital Archive Technical problems

Distributed crawling programs Huge storage systems using hierarchical architecture

Social problems Intellectual properties (copyright law, creative commons)

Self-introduction

Page 3: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Background of My Research: Mondou

Search results by Mondou

Related keywords providedby association rule mining

Text/Web Mining Mondou ( 1996 )

Relevant keywords provided by text mining association rules

Document clustering Information visualization Discover web communities Distributed and cooperative web robots

Page 4: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Differences between Search Engine and Web ArchiveWeb Search Engine Web Archive

Crawling ・ Freshness by time stamps and informative file types:html, text, pdf, doc and others

・ Accurate crawling of entire web pages stored in target web sites, as rapid as possible

Quality ・ Focusing on special attributes

and descriptions:

- title, meta, hyperlink tags

・ Quality control is strongly required

- Original/Master copies

- Archiving shots management

Search ・ Recall and Precision

・ Results are sometimes influenced by commerciality.

・ Simple and easy query input

・ Difficulties of document searches

- Historical change and heterogeneous keywords

- Evolution of hyperlink structures

Preservation ・ Short time: several months

- Almost of users prefer popular and fresh web pages.

・ Long time: several centuries

as paper, micro film etc.

- migration, transformation

Page 5: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Adjunct researcher (2002-) inDigital Library Section at National Diet Library (Japan)

Page 6: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Roles of NDL KANSAI-kan Collaborative Service between East and West

PhD. Thesis (45%), Journal & Magazine (29%), Reports of Grant-in-Aid for Scientific Research of MEXT(15%), Scientific Reports (7%), Asian Library (4%)

Digital Portal in Japan Digital Library (in Meiji-era, during 1868-1911) (2007/7)

97,000 Titles, about 143,000 Books WARP (Web Archiving Project)

1,499Titles 46 Government Organizations, 1,907 Cooperative Organizations

Dnavi 9,900 directories

Add 1,100 URL/yr Investigate 2,300 URLs deleted among 5,600 URLs

Page 7: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information
Page 8: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Roles of NDL KANSAI-kan Collaborative Service between East and West

PhD. Thesis (45%), Journal & Magazine (29%), Reports of Grant-in-Aid for Scientific Research of MEXT(15%), Scientific Reports (7%), Asian Library (4%)

Digital Portal in Japan Digital Library (in Meiji-era, during 1868-1911) (2007/7)

97,000 Titles, about 143,000 Books WARP (Web Archiving Project)

1,499Titles (E-book, journal, article, white report etc.) 46 Government Organizations, 1,907 Cooperative Organizations

Governmental contents are also edited, modified and deleted… Dnavi

9,900 directories Add 1,100 URL/yr Investigate 2,300 URLs deleted among 5,600 URLs

Page 9: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

WARP (Web Archiving Project) The House of Councilors

Consolidation of cities, organizations, universities etc.

Page 10: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Outline : Today’s Talk

Fragile Intelligence on Internet Disappearing Scientific/Artistic/Cultural Contents Statistics of web contents in Japan (by NDL)

Problems of building Digital Archive Technology / Legislation / Organization

Towards Better Digital Archive Technical problems

Distributed crawling programs Huge storage systems using hierarchical architecture

Social problems Intellectual properties (copyright law, creative commons)

Page 11: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Science 9,000( 1900 )

Science 90,000( 1950 )

Science 0.9million ( 2000 )

2001B.C300

% of Archive30 ~ 50%

Alexander Library0.5 million32TB

Surface Web:14TB (1 billion pages)Deep Web:7.5PB (550 billion pages)

Web (Japan) 0.45billion pages ( 18.4TB )Web (go.jp) 20 million pages ( 1.6TB )

1PB

10PB

100PB

Surface Web:167TBDeep Web:67 ~ 92PB

Web Pages

20052003

Book, reports, others782million

( 50PB=50000TB )

Books ( Public )4.8million ( 308TB )

Scan This Book!http://www.nytimes.com/

Books ( Current )3.20million ( 205TB )

Book ( Unknown )24million ( 1540TB )

Page 12: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Statistics of Web Sites 2001

1 billion pages (Surface Web), 550 billion pages (Deep Web; 7.5PB) http://www.brightplanet.com/technology/deepweb.asp

2002 2 billion pages:

( English:56.4%, Germany:7.7%, French:5.6%, Japanese:4.9% ) http://www.netz-tipp.de/languages.html

2003 167TB (Surface Web), 92PB (Deep Web)

http://www2.sims.berkeley.edu/research/projects/how-much-info-2003

2005, January Searchable Web pages : 11.5 billion Pages in 75 Languages

http://www.cs.uiowa.edu/~asignori/web-size/

Page 13: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Survey Report of Japanese Web Sites (by NDL, 2005)

Web Data HTML Files ・・・ about 44 million files Picture Files ・・・・・ about 55 million files Estimated Total # of Files ・・・・・・・・ 450 million

files Estimated Total Volume of Data ・・・・・・・・・

18.4TB jp domain:182,093 hosts

go.jp hosts (2,336 hosts, 1.28%) Files 4.4% Volume 8.5%

http://www.ndl.go.jp/jp/aboutus/bulkresearch2005summary.html

Page 14: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information
Page 15: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Digital ArchivesOrganization From Characteristics

Internet Archive 1996 Wayback Machine (Fair Use)

Austria, National Lib. 1996/6 Legislation

Sweden, Royal Lib. 1996/9 Legislation

Denmark, Royal Lib. 1997/6 Legislation

Australia, National Lib. 1997/6 Discussion

France, National Lib. 1999 Discussion

USA, Lib. of Congress 2000 NDIIPP

Finland, National Lib. 2000/8 Proposal of Legislation

Britain, Lib. 2001/5 2003, Legislation (non-print material)

China, Lib. 2003/1 WICP “Discussing Legislation for Networked Electric Publishing” ( 2003/5 )

Korea, National Lib. 2006/2 OASIS, Discussing Legislation

National Digital Library is under construction to open in 2008

Germany, National Lib. 2006/6 Legislation

Japan, National Diet Lib. 2002/6 Middle term planning (2004)

Page 16: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Towards Better Digital Archives Preserve Fragile Born-digital Contents

Academic/Scientific/Artistic/Cultural Resources Archive of Digital Information

Technologies of Long Term Preserving Legislation of Long Term Preserving Organization of Long Term Preserving

Organization National libraries for digital preservation projects IIPC (International Internet Preservation

Consortium)

Page 17: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

National Archive Libraryfor preserving Digital Information

National Archive Libraryfor preserving Digital Information

Organization( Mandator

y )Belief

Organization( Mandator

y )Belief

●National Diet Library●National Archives of Japan

●Public/Private Libraries

●NII

●Government

●National Diet Library●National Archives of Japan

●Public/Private Libraries

●NII

●Government

Legislation( Law,

Consensus )Commons

Legislation( Law,

Consensus )Commons●Law of National Diet

Library●Law of Libraries●Law of National Archive●Law of Museums etc.

●Intellectual Properties●Copyright Law●Copyleft/Creative Commons

●Law of National Diet Library

●Law of Libraries●Law of National Archive●Law of Museums etc.

●Intellectual Properties●Copyright Law●Copyleft/Creative Commons

Technologies( Architectur

e )Mission-driven

Technologies( Architectur

e )Mission-driven●Internet Technologies

●Natural Language ( CJK )

including Vietnamese

●Information Retrieval●Database Technologies●Archive Technologies

●Internet Technologies

●Natural Language ( CJK )

including Vietnamese

●Information Retrieval●Database Technologies●Archive Technologies

Page 18: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Various Technical Problems Programs of crawling contents from surface and deep

webs provided by dynamic web services emulation and migration of dynamic content

Heritrix

Collaboration and optimization of distributed systems preserve monotonously increasing digital contents crawling, storages, information retrieval with time-line

Wera (Web ARchive Access), OAIS, DSpace etc.

Metadata formats URI, RDF, MODS (Metadata Object Description Schema)

Page 19: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Various Technical Issues: before Constructing Web Archives Problems of web crawling

Discovery of starting URL Frequency of retrieving Target file extensions Domain, directory, depth Contents in cross-domains Scripting URL

javascript, java, flash etc. Quality control required

No missing pages No imperfect capturing

Imperfection caused by timeout

Cost performance of advanced storage systems Properties of various storage media Archiving units in web sites Compression techniques Differentiating archives Duplication prepared for troubles and

disasters Conservation of originality

Certification of master copy Hyperlink, Coding, Layout, Script etc.

Page 20: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Hidden webs and archiving

Advanced techniques KQML Mediator Wrapper Association rules

Web mining Knowledge and rules

derived from Metadata Repository Web summaries

Web Servers Web Servers Web Servers……

Agents Agents AgentsKQML

Search

Archiving Robots

Web Archiving Systems(Metadata, Site Summaries,

Frequent navigational patterns,Representative web contents)

How do we archive contents stored in hidden webs?

Page 21: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Growth of Storage Market

Trend of Storage Volume : 10 times in 2010 2010: volume of storage 1370PB  (10 times of volume in 2005)  Growth rate 56.9%/year ( IDC Japan )

Storage Market in JapanUnit:\100M, TB ( JEITA )

\13.07M/TB\8.42M/TB

\5.02M/TB

\2.73M/TB

Next DVD:25-30GB

2010Holographic Disc:200GB-1TB

Dell

Others

Hitachi

Page 22: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Architecture of hierarchical storage

First Level Storage (plain files,

full text search)

Second Level Storage(compressed files, partial indexing)

Third Level Storage(archiving multiple-files with compression, low cost devices)

Cache Storage

prefetch

Operational Database of Archiving System(log files of web robots, search queries, navigational patterns)

Page 23: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Guidelines of Metadata http://www.loc.gov/standards/

Page 24: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Various Formats and Standards Resource Description Formats MARC 21 formats - Representation and communication of descriptive metadata

about information items MARCXML - MARC 21 data in an XML structure MODS (Metadata Object Description Standard) - XML markup for selected

metadata from existing MARC 21 records as well as original resource description

MADS (Metadata Authority Description Standard) - XML markup for selected authority data from MARC21 records as well as original authority data

EAD (Encoded Archival Description) - XML markup designed for encoding finding aids

Digital Library Standards METS (Metadata Encoding & Transmission Standard) - Structure for encoding

descriptive, administrative, and structural metadata (www.loc.gov/mets) MIX (NISO Metadata for Images in XML) - XML schema for encoding technical

data elements required to manage digital image collections PREMIS (Preservation Metadata) - A data dictionary and supporting XML

schemas for core preservation metadata needed to support the long-term preservation of digital materials.

Page 25: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Options (metadata) of OPAC and WARPOPAC WARP

Title Title

Authors/Editors Authors

Editors

Location Start URL

Year Duration

Category

Category No. ( NDC, NDLC, LCC, DDC,UDC, GPO ) NDC

Standard No. (ISBN, ISSN, CODEN, UTM, ISRN, ISMN etc. )

ISSN+ISBN

Book ID ( JAPANMARC, USMARC, UKMARC, OCLC etc.)

Management No. Meta ID

Codes (Language, Original Language, Gov., Univ. etc. )Japanese/Western Books, Digital Contents, Music/Video, Ashihara Collection etc.

Collections

NDL Resource

Page 26: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Guideline of NDL Meta Data

Page 27: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Guideline of NDL Meta Data

NDL-DA (NDL-Digital Archive) System is based on OAIS reference model

Information Package consists of Content Information Metadata

Organizing Unit Bibliography, Volume, Number, Article Web Site, Web pages

http://www.ndl.go.jp/jp/standards/da/index.html

Page 28: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information
Page 29: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

OAIS ( Open Archival Information System ) http://www.rlg.org/en/pdfs/rlgnews/news56.pdf

Submission Information Package

Archival Information PackageDissemination Information Package

Descriptive MetadataIs stored separately

Page 30: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

NDL: Meta Data Information Package Metadata

Preserving contents and associated metadata Description Metadata

Bibliography: Title, Publisher, Volume, Number etc. Technical Metadata

CPU, Hardware, Operating System, Software etc. Preservation Metadata

Long-term preservation: Ingest/Migration history etc. Rights Metadata

Permission, Creator, Authority, Audience etc. Control Metadata

Other Data for Preservation/Utilization/Management

Page 31: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

NDL: Meta Data

Information Package Metadata – METS1.6 METS ( Metadata Encoding and Transmission Standard )

Description Metadata– MODS3.2 and NDL-DA Metadata Scheme

MODS3.2 (Metadata Object Description Schema) MODS is a derivative of MARC21, and it is not so complex

Technical Metadata – PREMIS based Scheme Preservation Metadata – PREMIS based Scheme Rights Metadata – PREMIS based Scheme

PREMIS (PREservation Metadata: Implementation Strategies) View Path is “Preservation Layer Model” in DIAS (Digital

Information Archiving System, Netherland) Control Metadata – NDL-DA Metadata Scheme

Page 32: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Sample: Attribute Values typeOfResource

based on MARC21

Text Cartographic notated music sound recording sound recording-musical sound recording-nonmusical still image moving image three dimensional object software, multimedia mixed material

digitalOrigin based on MODS

born digital reformatted digital digitized microfilm digitized other analog

Japanese Kana: script or transliteration<titleInfo>

<title> 国立国会図書館 </title></titleinfo><titleInfo script=”Kana”>

<title> コクリツ コッカイ トショカン</title>

</titleInfo><titleInfo script=”latn”>

<title>kokuritsu kokkai toshokan</title></titleInfo>

Page 33: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Information Package <mets>

Contents <fileSec>/<fileGrp>

Control MD<amdSec>

Structure Map<structMap>

PDF File<file id=“001” amdid=“201 401”>

Technical ・ Preservation MD<techMDid=“201”>

Ritghts MD<rightMDid=“301”>

Preservation ・ Management MD<digiprovMDid=“401”>

Description MD <dmdSecid=“101”>

Bibliographic Unit<div dmdid=“101” amdid=“301”>

<fptr fileid=“001”/></div>

Page 34: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Conclusion Web archive is one of dominant information infrastructure in

digital information society. Technical problems

Distributed crawling, long-term huge storage, advanced IR Social problems

Intellectual properties (copyright law, creative commons)

Huge volume and long-term preserving Distributed crawling programs

Surface and hidden webs, complex web services Huge storage systems using hierarchical architecture

Storage media, archiving formats, compression methods and rates Retrieving mechanism: navigational pattern mining in web archive

Preserving strategies by importance and access frequencies Effective emulation and migration of dynamic contents

Page 35: Steps Towards Better Digital Archives Hiroyuki Kawano Department of Information and Telecommunication, Faculty of Mathematical Sciences and Information

Discussion

Digital Archives Infrastructure of Digital Contents

Problems of Digital Archives Technology : Collaboration of Standardization Legislation : Consensus among Stake Holders Organization : Store/Preservation/Utilization

Towards Better Digital Archives Collaboration for Integrated Digital Archives

Library, National Archive, Museum, University, Laboratory, Company etc.