143
Integrating web archiving in preservation workflows Louise Fauduet, Clément Oury, Sébastien Peyrard

Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Embed Size (px)

DESCRIPTION

Presentada en una sesión de trabajo sobre Archivos Web, en la Biblioteca Nacional de España (BNE), el día 8 de julio de 2013

Citation preview

Page 1: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Integrating web archiving in

preservation workflows

Louise Fauduet, Clément Oury,

Sébastien Peyrard

Page 2: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 2

Objectives of the session

> Present the current issues and solutions of web archives preservation

> Present the main characteristics of a preservation repository, taking “SPAR” as an example

> Explore benefits and issues of the integration of web archives

in a shared digital repository

Page 3: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 3

Preserving web archives

> First and foremost: bit-level preservation

> Logical preservation = preservation of our ability to read and

understand the series of “0” and “1”

> Three kinds of logical preservation

> Preservation of the container format

> Preservation of the contained files

> Preservation of all information necessary to understand web archives

> Web archives preservation: worse case possible?

> Huge amount of data

> Very little information on contained file formats

> Heterogeneity of web archive collections within a same institution

Page 4: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Web archive preservation : characteristics

and issues

Page 5: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 5

> ARC format

> Container format

> Since 1996 :

http://archive.org/web/researcher/Ar

cFileFormat.php

> Groups together data and metadata

> Size arbitrarily limited to 100 Mb

Preserving container formats: the ARC format

Page 6: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 6

W/ARC file designW/ARC file

W/ARC record

Header

Block Ex: HTTP

response, jpeg

file…

Ex: record ID, capture

date, record type,…

Page 7: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 7

The origins: the ARC format

ARC record

URL-record

network_doc

protocol response

object

version block header

filedesc

URL-record-definition

filedesc://IA-001102.arc 0 19960923142103 text/plain 76 1 0 AlexaInternet URL IP-address Archive-date Content-type Archive-length http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202 HTTP/1.0 200 Document follows Date: Mon, 04 Nov 1996 14:21:06 GMT Server: NCSA/1.4.1 Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMT Content-length: 30 <HTML> Hello World!!! </HTML>

version-1-block

Page 8: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 8

Why standardize ARC?

> The ARC format had technical shortcomings

> Only two record types: hard to distinguish data and metadata

> There was no way to uniquely identify a record

> The ARC specifications had formal shortcomings

> It was not perfectly clear and was open to interpretation (=> difficulty to

define what was a valid ARC file when BnF designed a validation tool)

> ARC was an Internet Archive de facto standard, not an internationally

recognized standard

Page 9: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 9

Why standardize ARC?

> Necessity to have a standard format

> Promote the use of a single format by all web archiving institutions (not

only IIPC members)

> Ensure that the format will not be subject to uncontrolled changes

> Allow a clear validation of WARC files

> Foster the development of tools

> Ensure confidence in the long term preservation of web archives

> Integrate web archive formats in the set of international bibliographic

standards

Page 10: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 10

WARC standardization history

> WARC = Web ARChive file format

> Created by IIPC

> WARC is next generation of ARC file format

> ARC format created by the Internet Archive

> Most legacy web archive in ARC

> Original discussion: Sept 2004> First Internet Draft: May 2005

> First ISO Working Draft: Feb 2006

> Final ISO Draft: June 2008

> Final Publication: May 2009

Page 11: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 11

WARC improvements

> A unique identifier for each record

> Eight record types

> warcinfo

> response

> request

> metadata

> revisit

> conversion

> continuation

> Better ways to describe and document the harvesting

process (and the deduplication process)

> Possibility to manage format migrations

Page 12: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 12

Example of Warcinfo record

WARC/1.0

WARC-Type: warcinfo

WARC-Date: 2006-09-19T17:20:14Z

WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39>

Content-Type: application/warc-fields

Content-Length: 381

software: Heritrix 1.12.0 http://crawler.archive.org

hostname: crawling017.archive.org

ip: 207.241.227.234

isPartOf: testcrawl-20050708

description: testcrawl with WARC output

operator: IA_Admin

http-header-user-agent:

Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org)

format: WARC file version 0.17

conformsTo:

http://www.archive.org/documents/WarcFileFormat-0.17.html

Page 13: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 13

Example of request record

WARC/1.0 WARC-Type: request

WARC-Target-URI: http://www.archive.org/images/logoc.jpg

WARC-Date: 2006-09-19T17:20:24Z

Content-Length: 236

WARC-Record-ID: <urn:uuid:4885803b-eebd-4b27-a090-144450c11594>

Content-Type: application/http;msgtype=request

WARC-Concurrent-To: <urn:uuid:92283950-ef2f-4d72-b224-

f54c6ec90bb0>

GET /images/logoc.jpg HTTP/1.0

User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0)

From: [email protected]

Connection: close

Referer: http://www.archive.org/

Host: www.archive.org

Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824

Page 14: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 14

Example of response record

WARC/1.0

WARC-Type: response

WARC-Target-URI: http://www.archive.org/images/logoc.jpg

WARC-Date: 2006-09-19T17:20:24Z

WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2

WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2

WARC-IP-Address: 207.241.233.58

WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>

Content-Type: application/http;msgtype=response

WARC-Identified-Payload-Type: image/jpeg

Content-Length: 1902

HTTP/1.1 200 OK

Date: Tue, 19 Sep 2006 17:18:40 GMT

Server: Apache/2.0.54 (Ubuntu)

Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT

ETag: "3e45-67e-2ed02ec0"

Accept-Ranges: bytes

Content-Length: 1662

Connection: close

Content-Type: image/jpeg

[image/jpeg binary data here]

Page 15: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 15

Example of resource record

WARC/1.0

WARC-Type: resource

WARC-Target-URI: file://var/www/htdoc/images/logoc.jpg

WARC-Date: 2006-09-19T17:20:24Z

WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>

Content-Type: image/jpeg

WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2

WARC-Block-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2

Content-Length: 1662

[image/jpeg binary data here]

Page 16: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 16

Example of metadata record

WARC/1.0

WARC-Type: metadata

WARC-Target-URI: http://www.archive.org/images/logoc.jpg

WARC-Date: 2006-09-19T17:20:24Z

WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-

57494593b943>

WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-

f54c6ec90bb0>

Content-Type: application/warc-fields

WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2

Content-Length: 59

via: http://www.archive.org/

hopsFromSeed: E

fetchTimeMs: 565

Page 17: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 17

Example of revisit record

WARC/1.0

WARC-Type: revisit

WARC-Target-URI: http://www.archive.org/images/logoc.jpg

WARC-Date: 2007-03-06T00:43:35Z

WARC-Profile: http://netpreserve.org/warc/0.17/server-not-modified

WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb>

WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>

Content-Type: message/http

Content-Length: 226

HTTP/1.x 304 Not Modified

Date: Tue, 06 Mar 2007 00:43:35 GMT

Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4

Connection: Keep-Alive

Keep-Alive: timeout=15, max=100

Etag: "3e45-67e-2ed02ec0"

Page 18: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 18

Example of conversion record

WARC/1.0

WARC-Type: conversion

WARC-Target-URI: http://www.archive.org/images/logoc.jpg

WARC-Date: 2016-09-19T19:00:40Z

WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593dddd>

WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>

WARC-Block-Digest: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK

Content-Type: image/neoimg

Content-Length: 934

[image/neoimg binary data here]

Page 19: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 19

Example of continuation record 1/2

WARC/1.0

WARC-Type: response

WARC-Target-URI: http://www.archive.org/images/logoc.jpg

WARC-Date: 2006-09-19T17:20:24Z

WARC-Block-Digest: sha1:2ASS7ZUZY6ND6CCHXETFVJDENAWF7KQ2

WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2

WARC-IP-Address: 207.241.233.58

WARC-Record-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0>

WARC-Segment-Number: 1

Content-Type: application/http;msgtype=response

Content-Length: 1600

HTTP/1.1 200 OK

Date: Tue, 19 Sep 2006 17:18:40 GMT

Server: Apache/2.0.54 (Ubuntu)

Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT

ETag: "3e45-67e-2ed02ec0"

Accept-Ranges: bytes

Content-Length: 1662

Connection: close

Content-Type: image/jpeg

[first 1360 bytes of image/jpeg binary data here]

Page 20: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 20

Example of continuation record 2/2

WARC/1.0

WARC-Type: continuation

WARC-Target-URI: http://www.archive.org/images/logoc.jpg

WARC-Date: 2006-09-19T17:20:24Z

WARC-Block-Digest: sha1:T7HXETFVA92MSS7ZENMFZY6ND6WF7KB7

WARC-Record-ID: <urn:uuid:70653950-a77f-b212-e434-7a7c6ec909ef>

WARC-Segment-Origin-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0>

WARC-Segment-Number: 2

WARC-Segment-Total-Length: 1902

WARC-Identified-Payload-Type: image/jpeg

Content-Length: 302

Page 21: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 21

Integrating web archiving in preservation

workflows : the objectives

> Share development and storage costs with other entities of the library

> Benefit from technology watch performed for other kind of digital documents

> Obtain a global overview of all kinds of the institution’s

digital assets

> And manage all of them in a consistent manner

Page 22: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 22

Integrating web archiving in preservation

workflows : the issues

> Digital collections are highly heterogeneous

> Legal status / preservation status

> Access constraints

> Volume of resources

> Technical characteristics (formats)

> Data and metadata models

> Heterogeneity may occur even for similar kind of documents

> Digitization or harvesting procedures have evolved through times

> Competition may occur between different kind of digital

collections

> For development priorities, ingest priorities, storage or preservation

quality

Page 23: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

A shared digital repository: the SPAR

example

Louise

Page 24: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

24

Digital everywhere

Page 25: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

25

Functional issue : a digital preservation copy

Digitization as a mean for preservation and dissemination

Page 26: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

26

Technical issue : volume

Now

Color (24 bits) – 400dpi –TIFF uncompressed

1 page ~ 80Mb

More than x500 !!!

Then

Black & white – 300dpi –TIFF G4

1 page ~ 200Kb

Page 27: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

27

Business issue : born digital

More and more documents only

in digital form

Page 28: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28

Digital preservation is not alone

SPAR - Infrastructure

SPAR - Realization

Ingest

SPAR

Storage Abstraction Service (SAS)

Administration

Data management

Storage

Access

Preservation planning

Pro

du

ctio

n a

pp

lica

tion

sD

issemin

atio

n a

pp

licatio

ns

Preservation

digitization

wayback

WEB Archiving

….

….

Audiovisual

Page 29: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 29

How to manage diversity?

> Build a central repository to reduce the diversity (media, formats, departments …)

> Relying on best practices and standards

> Key requirements:

> OAIS compliance (ISO 14721:2012)

> modularity and distributivity

> abstraction

> use of well known formats and standards

> use of open-source technical building blocks

Page 30: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 30

A generic model

http://public.ccsds.org/publications/archive/650x0m2.pdf

Page 31: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 31

P

r

e

-

I

n

g

e

s

t

P

r

e

-

I

n

g

e

s

t

A generic repository solution

P

r

e

-

I

n

g

e

s

t

Storage abstraction service

Ingest

Storage

Preservation

planningAdministration

Data managementAccès

SIP DIPmets

rdf

rdf

Infrastructure

Preservation digitization

Web archives

And so on

Page 32: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 32

Preservation workflow

> Granularity at the archival package level, to allow for parallelization

> Each module is independent and interact with defined

interfaces

> Every package follows the same basic workflow with specific

features when needed

Page 33: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 33

Information about a package

Page 34: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

34

And with a little patience…

Infrastructure

2005 2006 2007 2008 2009 2010 2011 2012

WG

RFP

Core part

2004

Study

Other channels

TPS

Admin

AV

Working Groups

WLD

Operations

may 2010

Renewal

2013

New tender

WG

Page 35: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 35

Requirements for the infrastructure

> Openness of the solution

> Openness to multiple environments

> Compatible with multiple providers

> Availability

> Reliability of all parts

> Usage of the system 24h./day, 7d./week, 365d./year

> Maximum down time: 2h per month

> Upgradeability

> Great potential in increase of volume and power

> Coverage of the need for the next 8 years

Page 36: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

36

Backup

storage

Backup

Servers

Backup site

Backup secondary storage

Primary

storage

Secondary storageLookup storageServers

Main site

Backup

Lookup storage

Online

storage

Infrastructure

Page 37: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

37

Primary and backup storage

Oracle StorageTek SL8500

• up to 64 tape drives

• up to 8500 tapes

• up to 8 hand pickers

• up to 32 linked libraries

Primary storage2 libraries

16 PB maximum

Backup storage

2 libraries

16 PB maximum

Page 38: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

38

Tape library in-situ

Page 39: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

39

Medias

Capacity 1.5 TB

Transfer rate 140 MB/s

Primary storage

LTO5

Backup storage

T10000B

Capacity 1 TB

Transfer rate 120 MB/s

(previously: 9840C – 40GB) (previously: T10000A – 500GB)

Page 40: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 40

How to manage diversity, really?

> To deal with the variability and heterogeneity of the data: tracks

> build on the relation between the digital objects and the archival system, independently of any given organization > Preservation digitization

> Audiovisual material

> Negotiated legal deposit (dark Web, regional press)

> Automatic legal deposit (web harvests)

> Administrative production

> Deposit / Third party archiving

> Acquisition / Donation

> Then channels based on technical criteria

> And then: DECISIONS!

Page 41: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 41

Decisions, decisions : Service Level Agreements

> The Service Level Agreements are contracts between the users and SPAR, to offer a more transparent system

> The system is no longer a black box only known by technical

experts

Page 42: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

2012/09/13 42

Workflows driven by the SLA

P

r

e

-

I

n

g

e

s

t

Storage abstraction service

Ingest

Stockage

Preservation Administration

Data managementAccess

SIP

AIP

DIPmets

rdf

rdf

AIP

Which

formats are

allowed?

How copies are

needed, in what

kind of media ?

What is the

maximum size

of a package ?

Do we need to log

each access?

Page 43: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 43

SLA: a reference package

> 3 SLA: Ingest, Preservation, Access

> Formalize in XML the ways of managing the packages

> Those 3 SLA are recorded in a reference package that describes the channel

SLA-I.xml, SLA-P.xml, SLA-A.xml

Mets.xml

Contract.pdf

Page 44: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 44

Decisions, decisions: metadata

Page 45: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 45

Decisions, decisions : levels of granularity

03/07/1882 28/02/1883 01/03/1883

set

group

object

file

02/07/1882

Year 1883Le Matin

Year 1882

01/07/1882

Page 46: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 46

Decisions, decisions: formats (1)

> Define the scope

> For instance: accept files of large size posters directly from the

printers as a substitute for the legal deposit of paper posters

> Negotiation between:

> Producers: what the printers can output

> Librarians: what can be displayed/handled easily

> Preservation experts: what is acceptable given significant properties,

communities, standards…

Page 47: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

2012/05/21 47

Decisions, decisions: formats (2)

**********TIFF

***********PDF/X

***000***QuarkXPress

True to the originalCharacterization toolsStandardsPrevious expertise

Producer

For this purpose, PDF/X chosen as a good compromise between truth to the

original, wide usage and standardization

Page 48: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

2012/05/21 48

Requirements on formats

Kind Description

Stored No technical information Bit stream preservation

Identified Identified format => associated mime typeNo preservation strategy planned by the institution

Known Format identified, documented, with tools => associated schema of technical descriptionPreservation strategy planned by the institution

Managed Identified formatDocumentation and tools owned by the institutionProfile of use defined in the institution

Page 49: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

2012/05/21 49

Format definition: a reference package

Mets.xml: manifest

T000001.tiff: sample

format.xml: machine readable description

format.txt: human description

Page 50: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 50

Characterization?

> Identification. The process of determining the presumptive format of a digital object on the basis of suggestive extrinsic hints and intrinsicsignatures, both internal (e.g. magic number) and external (e.g. file extension).

> Validation. The process of determining the level of conformance to thenormative syntactic and semantic rules defined by the authoritativespecification of the object's format.

> Feature extraction. The process of reporting the intrinsic properties of a digital object significant for purposes of classification, analysis, and use.

> Assessment. The process of determining the level of acceptability of a digital object for a specific purpose on the basis of locally-defined policyrules.

http://www.fao.org/oek/jhove2/digital-preservation-and-jhove2-home/jhove2-tutorial/en/

Page 51: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 51

Tools for identification

> At BnF = getting a MimeType

> MagicMimeTypeIdentifier: from the Aperture project

> Could use : magic number or extension

> Very narrow database

> Libmagic: from the file(1) utility

> Very old library, extensive database, continued to be actively

maintained

Page 52: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 52

Tools for characterization

> At BnF = getting properties, using the proper tool based on MimeType

Tool Output schema

text/* textMD

image/* mix

audio/*

video/*

application/x-ia-arc Jhove2 containerMD

application/gzip Jhove2 containerMD

MPEG-7

MPEG-7

http://bibnum.bnf.fr/containerMD-v1

Page 53: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

2012/05/21 53

The ingest processIngest request

reception

Manifest

Validation

Package search

within SPAR

SIP characteristics

audit

SIP files audit

and characterization

ARK identifier

generation

SET processing

Ingest completion

SIP reception

Audit

ACT_01

ACT_02

ACT_03

ACT_04

ACT_05

ACT_06

ACT_07

ACT_08

ACT_09

Page 54: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

2012/05/21 54

ACT_06 : File processing

ACT_06.1.1 Fileidentification

ACT_06.1.2 Filecharacterization

ACT_06.1.2 Significantproperties extraction

[ Identified file ]

[ Known file ]

ACT_06.2.1 Managed format inSLA ?

ACT_06.2.2 Known formatin SLA ?

ACT_06.2.3 Identifiedformat in SLA ?

[ Managed format ]

[ Accepted ]

[ Undefined ]

[ Rejected ]

[ Undefined ]

ACT_06.2.4 Storedformat in SLA ?

[ Stored file ]

[ Undefined ]

[ Accepted ]

[ Rejected ]

[ Accepted ]

[ Rejected ]

[ Accepted ]

[ Rejected ]

Page 55: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 55

Decisions, decisions: building a package out

of it

Page 56: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 56

> Header

> DmdSec

> AmdSec

> TechMD

> DigiProvMD

> SourceMD

> RightsMD

> FileSec

> StructMap

> Structlink

> BehaviorSec

Decisions, decisions : tweaking METS

Structural metadata: METS

Descriptive and source metadata:

qualified Dublin Core

Provenance metadata: PREMIS

Technical metadata:

depends on the data-objectsMPEG-7

Page 57: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Integrating Web archives to the SPAR

repository

preserving the French web archives

Page 58: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 58

58

Background(1): one mandate, heterogeneous

collections

1996-2005 2002 & 2004 2004-2008 2006-2010 2010-now

70 Tb 0.5 Tb 45 Tb 22 Tb

operator

robot +Alexa bot

2006-08-01: French copyright law entitles BnF to collect the French Internet

150 Tb

Page 59: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 5959

Pre Ingest

Web archives in the scope of SPAR since the beginning but still a need to align with

existing implementation

Background (2): a generic repository solution at BnF

Digitized books

Digitized

audiovisual

documents

web archiving

Pre Ingest

Pre Ingest

Page 60: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

But first things first!

What do we want to preserve?

Page 61: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 61

The harvested files

Here is what I crawled on the Web And how I packaged them in a web

archive container file

HTML

HTML

HTML

HTML

ARC

data

Page 62: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 62

Intellectual information

This is the 2012 French election

collection

This is the daily news collection

HTML

HTML

HTML

HTML

HTML

Page 63: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 63

Provenance information

This was harvested with the Heritrix

tool v1.5.2, using

NetarchiveSuite

This was harvested using HTTrack

HTML

HTML

HTML

HTMLHTML

+

Page 64: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 64

Provenance

information, II

We harvested content suited for

Mozilla Firefox

We respected the robots.txt

The job crashed

We harvested content suited for

Internet Explorer

We ignored the robots.txt

The job went well

HTML

HTML

HTML

HTML

HTML

config config

Page 65: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 65

Provenance

information, III

This was captured on 2012, may 8th

This captured 1098 websites

This produced 105 ARC files

This was captured in 2006

This captured 145 websites

This produced 50 ARC files

HTML

HTML

HTML

HTML

HTML

log reportlog report

Page 66: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Consistent data before SPAR

Cleaning the stuff before preserving it

Page 67: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 67

All data on a single target

1996-2005 2002 & 2004 2004-2008 2006-2010

70 Tb 0.5 Tb 45 Tb 22 Tb

unknown

2010-now

+Alexa bot

67

150 Tb

Page 68: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 68

Aligning collections before ingest:

the NetarchiveSuite target workflow

ARC

data

ARC

metadata

logconfig report

HTML

HTML

HTML

HTML

harvested

files

ARC

data ARC

data

ARC

data ARC

data

+

harvest 1 harvest 2

+

harvest 3

+

This is a collection containing French election websites

Here are the

files we

harvested

They are

included in

web archives

specific files

This was done

with these tools

Page 69: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 69

Aligning collections before ingest

the NetarchiveSuite target workflow

A three-layered model

in SPARHarvest Definition (curator collection)

Harvest Instance (“technical” harvest = job)

ARC file (data or metadata)

Page 70: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 70

An « ARC metadata » samplefiledesc://32-metadata-1.arc 0.0.0.0 20100416092026 text/plain 77

1 0 InternetArchive

URL IP-address Archive-date Content-type Archive-length

metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVersion=1.14.3&harvestid=1&jobid=32 172.20.16.214 20100414095814 text/xml 366

<?xml version="1.0" encoding="UTF-8"?> <harvestInfo> <version>0.2</version> <jobId>32</jobId> <priority>LOWPRIORITY</priority> <harvestNum>0</harvestNum> <origHarvestDefinitionID>1</origHarvestDefinitionID> <maxBytesPerDomain>-1</maxBytesPerDomain> <maxObjectsPerDomain>1000</maxObjectsPerDomain> <orderXMLName>default</orderXMLName> </harvestInfo>

metadata://netarkivet.dk/crawl/setup/order.xml?heritrixVersion=1.14.3&harvestid=1&jobid=32 172.20.16.214 20100414095815 text/xml44775

<?xml version="1.0" encoding="UTF-8"?> <crawl-orderxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">

Page 71: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 71

All data on a single target

1996-2005 2002 & 2004 2004-2008 2006-2010

70 Tb 0.5 Tb 45 Tb 22 Tb

unknown

2010-now

+Alexa bot

71

150 Tb

Page 72: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 72

All data on a single target

Two layers:

- Collection

- ARC files

1996-2005 2010-now

Three layers:

- Harvest Definition

- Harvest instance

- ARC files

Two layers:

- Collection

- ARC files

Page 73: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 73

All data on a single target

2006-2010 2010-now

Four layers:

- Collection

- Harvest division

- Harvest instance

- ARC files

Three layers:

- Harvest Definition

- Harvest instance

- ARC files

Page 74: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

What SPAR needs

Page 75: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 75

Decisions, decisions : levels of granularity

03/07/1882 28/02/1883 01/03/1883

set

group

object

file

02/07/1882

Year 1883Le Matin

Year 1882

01/07/1882 03/07/1882 28/02/1883 01/03/188302/07/1882

Le Matin

01/07/1882

Page 76: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 76

set

group03/07/1882 28/02/1883 01/03/188302/07/1882

Le Matin

01/07/1882

Different layers

AIPAIP

AIPAIP

set

Contains nothing but metadata

Curator information, allows to

group AIPs sharing the same

intellectual content

AIPAIP

Must contain files to be

preserved

Each AIP is an autonomous

unit

AIPAIP AIPAIP AIPAIP

Page 77: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 77

METS and PREMIS to store

all the preserved content

<mets>

<dmdSec>

Intellectual metadata

<amdSec>

Administrative metadata

<fileSec>

List of the files

<structMap>

Structure of the package

<sourceMD>Metadata about the source

used to produce this content

<techMD>Technical metadata

<digiprovMD>Provenance metadata

MPEG-7

> No management of files within files so far

> Neither tools nor XML schema for the technical caracteristics of ARC files

Page 78: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 78

RDF database to query them

> The database is powerful but LIMITED

> Thus, we cannot express all the information for each

harvested file

Page 79: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Mixing all of this very hard…

Page 80: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

« Prometheus, we start the mapping!!! »

Page 81: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 81

PREMIS as a preservation koinè

ObjectObject

EventEvent

AgentAgent

harvestInstance

has harvest

instance

is documented in

hosts report

Outcome extensions

ARC files

report

persons: admins

software

organizations

Harvest event

Page 82: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 82

In other terms…

ARC

data

ARC

metadata

logconfig report

HTML

HTML

HTML

HTML

ARC

data ARC

data

ARC

data ARC

data

+

harvest 1 harvest 2

+

harvest 3

+

This is a collection containing French election websites

config

HTML

HTML

HTML

HTML

Page 83: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 83

ARC

data

In other terms…

ARC

data

ARC

metadataARC

data ARC

data

ARC

data ARC

data …

This is a collection containing French election websites

AIPAIP

AIPAIP

AIPAIP

AIPAIPAIPAIP

AIPAIP

AIPAIPset

ARC

data ARC

data …

AIPAIP

AIPAIP

ARC

dataAIPAIP

AIPAIPAIPAIP

groups

Page 84: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Preserving web archives

Challenge 2: analyze ARC files and content files

Page 85: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 85

Tools to analyze the ARC files

Need to

> identify and validate ARC files

> characterize ARC files (extract information)

> handle GZIP compression

> do at least identification of content file

> for large scale collections ARC

ARC.GZ

?

?

HTML

?

HTML

?

Development of JHOVE2 ARC and GZIP modules

Page 86: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 86

Managing the ARC

structurefiledesc://IA-001102.arc 0 19960923142103 text/plain 76

1 0 AlexaInternet

URL IP-address Archive-date Content-type Archive-length

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<arcmetadata>

<arc:software>Heritrix 1.14.2 http://crawler.archive.org</arc:software>

</arcmetadata>

http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202

HTTP/1.0 200 Document follows

Date: Mon, 04 Nov 1996 14:21:06 GMT

Server: NCSA/1.4.1

Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11

GMT

Content-length: 30

<HTML>

Hello World!!!

</HTML>

filedesc

URL record definition

object

URL record

protocol

response

object

version-

block

header

metadata

object

First

ARC

record

data

object

Page 87: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 87

Problem with

JHOVE2 outputs

> Too verbose: 100 Mb of output for 100 Mb of data (with only

identification of content files)

> Need to aggregate and compact all this information

> Need to handle ARC format peculiarities

> Need for a container file-specific format

containerMD

http://bibnum.bnf.fr/containerMD-v1

Page 88: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 88

Where containerMD fits

<mets>

<dmdSec>

Intellectual metadata

<amdSec>

Administrative metadata

<fileSec>

List of the files

<structMap>

Structure of the package

<sourceMD>Metadata about the source

used to produce this content

<techMD>Technical metadata

<digiprovMD>Provenance metadata

MPEG-7 containerMD

Page 89: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 89

containerMD features

containerMD

root element

containerMD

root element

containercontainerentriesentries

entriesInformationentriesInformation

entryentry

entryentry

entryentry

ARCContainerARCContainer

ARCEntriesARCEntries

ARCRecordARCRecord

ARCRecordARCRecord

ARCRecordARCRecord…

ARC-specificextensions

ARC-specificextensions

aggregated

information

about the

entries

Page 90: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 90

Aggregation example: format information

entry

format: text/html

size: 6026213

entry

format: application/pdf

size: 602621

entry

format: image/tiff

size: 60262132

entry

format: text/html

size: 165165

entriesInformation count=400

format: text/html

count=300

globalSize=40645654

format: application/pdf

count=20

globalSize=265464

etc.

verbose information aggregated information

factorizing

and sum

Page 91: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 91

Now, here is some cool stuff we will be able to

ask SPAR

> Give me all the jobs that crashed last year

> Give me a list of all the file formats per broad crawl…

> and the number of files per format

> and the global size before and after decompression

> remove the error pages

> order them by decreasing number of files

> Give me, for all the newspapers collection

> all the crawls

> order them by date

> the number of harvested files per crawl

Page 92: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 9292

Conclusion and next steps

> A pragmatic approach to handle large-scale and heterogeneous

collections

> The huge mass of data is still an issue

> The benefits of a shared repository

> Cross-domain investigation on preservation strategies (AV material, office

formats, e-books formats…)

> Different policies (SLAs) for different collections

> Improve file format information

> By testing different tools

> By improving format information databases

> International cooperation is a key

Page 93: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 93

At international level: the preservation

working group

> Goals of the PWG

> Exchange information and best practices (WARC format, information packages, etc)

> Promote the development of tools, review specifications and perform tests (WARC tools, Jhove, etc)

> Promote the web archive needs within the digital preservation community and projects

> Working fields

> Objectives and concepts of preserving archived web resources

> Preservation metadata

> Preservation workflows and digital repository functions and requirements

> Preservation strategies (migration, emulation…)

> Web environment technical documentation

> Evaluation of digital preservation tools and gaps towards web archives

> Organizational issues (costs, sustainability, promotion, skills,…)

Page 94: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Web archiving at the British Library

Helen Hockx-Yu

Head of Web Archiving

Page 95: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Overview

> Part 1: Background, history and organisation

> Part 2: Web Archiving Tools (including demos)

> Part 3: Access

> Part 4: Non-print Legal Deposit and future strategy

29th November 2012 Session 7 - Web archiving at the British Library 2

Page 96: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

BL Structure

> BL Board and Executive Team

> e-Strategy and Information Systems (eIS) > IT-based products and services

> Finance and Corporate Services (F&CS) > Money

> Human Resources > People

> Operations & Services (O&S) > Front line services

> Scholarship and Collections (S&C) > Content (Arts and humanities, Social Sciences, Science, Technology & Medicine)

> Strategic Marketing and Communications (SMC) > Brand and reputation

29th November 2012 Session 7 - Web archiving at the British Library 3

Page 97: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Web archiving timeline

29th November 2012 Session 7 - Web archiving at the British Library 4

Page 98: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Current web archiving strategy

> Selective archiving of websites that > reflect the diversity of lives, interests and activities throughout the UK

> contain research value or are of research interest

> feature political, cultural, social and economic events of national interest

> demonstrate innovative use of the web4 areas

> Also prioritise websites at risk and web-only content

> Permission based > Permission to archive, to provide online access and to preserve. Also ask or 3rd

rights clearance

> 30% success rate, 5% explicit refusal (mostly due to 3rd party rights)

> Online access through UK Web Archive

> Expect to crawl at domain level (from April 2013) for Non-print Legal Deposit

29th November 2012 Session 7 - Web archiving at the British Library 5

Page 99: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

The current Web Archiving team

29th November 2012 Session 7 - Web archiving at the British Library 6

Skills Profile > IT > Collection management, digital curation > Management > Communications > Web Archiving

Page 100: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

(Internal Collaboration)

> The Web Archiving Team is involved in the end to end process but work with other departments / teams in the library

29th November 2012 Session 7 - Web archiving at the British Library 7

Department / Team Activity / Support

S&C > Subject specialist group > Curator’s Choice project

Selection, curation

eIS Network, hardware and IT support

O&S Resource Discovery & Research

Corporate level resource discovery http://explore.bl.uk/

CA&D Digital Processing

Cataloguing (special collection level)

SMC Publicity, press release, events

The Legal Deposit Programme Domain crawl capability / process and policy

Page 101: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Curator’s Choice

> Pilot project with a small group of dedicated curators / subject specialists

> Special Collections of curator’s choice. Curators take responsibility for owning, maintaining and growing the collections over time > Evolving Role of Libraries in the UK

> Political Action and Communication

> Slavery and Abolition in the Caribbean

> UK relations with the Low Countries

> 19th Century English Literature

> Oral History in the UK

> Film in the UK

> Energy

29th November 2012 Session 7 - Web archiving at the British Library 8

Page 102: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Web Archiving Advisory Group

> Provide advice and support to the Web Archiving Team

> Act as a ‘critical friend’ to assist in the development of policy and practice.

> Specific advice and support on:

> Purpose, vision and benefits.

> Strategic direction and planning.

> Synergy with internal teams and collaboration with external stakeholders/partners.

> Policy changes and risk management

29th November 2012 Session 7 - Web archiving at the British Library 9

Page 103: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

(External) Collaboration

> UK Web Archiving Consortium (2004-2007): centralised infrastructure and development, distributed collections

> UK Web Archive partners, National Archives, Legal Deposit Libraries (LDLs)

> External Collaborators

> Welcome Library

> Live Art Development Agency

> The Cambridge Innovation Network

> The Women’s Library

> Institute of Historical esearch, University of London

> Individual researchers, specialists

> General public – ca. 20 nominations / week

> National organisations: DPC, JISC

> International: IIPC

29th November 2012 Session 7 - Web archiving at the British Library 10

Page 104: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

JISC UK Web Domain Dataset (1996-2010)

> Collaboration with JISC and the Internet Archive

> UK Web Domain Dataset (1996-2010) – UK websites extracted from the Internet Archive's collection and supported by funding from the JISC

> 35TB research dataset

> No local access to individual websites but access to secondary dataset allowed

> BL has developed visualisations of the dataset

> JISC funded 2 further projects using this dataset > Analytical Access to the Domain Dark Archive

> Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research

29th November 2012 Session 7 - Web archiving at the British Library 11

Page 105: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Web Archiving Tools

> Support key processes: selection, harvesting, storage, access, preservation

> Mostly open source tools, some developed in-house

> New tools / changes to current tools expected when business processes change due to non-print Legal Deposit

29th November 2012 Session 7 - Web archiving at the British Library 12

Page 106: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Selection Tools

> Selection: decide what websites to archive and to include as part of a web archive collection

> Selection and Permission Tool: https://wct.bl.uk/selection/ > Submit selection – real time checking of duplicates, fetching meta tags from live

sites

> Collect metadata

> Add contact details

> Suggest crawl frequency

> Permissions management – send emails, direct users to online licence form, store the completed forms, pass details to WCT (create authorisation record and a pending target)

> Reports

> Twittervane

29th November 2012 Session 7 - Web archiving at the British Library 13

Page 107: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Harvesting Tools

> Harvesting: automated downloading of selected websites using crawler software; quality assurance regarded as an element

> The Web Curator Tool (WCT): https://wct.bl.uk/wct/ > Job scheduling

> Metadata

> Access control

> Harvesting (uses Heritirx)

> QA

29th November 2012 Session 7 - Web archiving at the British Library 14

Page 108: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Quality Assurance

> Placing more emphasis on intellectual content than appearance or behaviour of a website

> Use four aspects to define quality: > Completeness of capture: whether the intended content has been captured as

part of the harvest.

> Intellectual content: whether the intellectual content (as opposed to styling and layout) can be replayed in the Access Tool.

> Behaviour: whether the harvested copy can be replayed including the behaviour present on the live site, such as the ability to browse between links interactively.

> Appearance: look and feel of a website.

> Rely on visual comparison, previous harvests & crawl logs

> Recent development of QA module to allow bulk operation, reduce # of clicks and make QA recommendations

29th November 2012 Session 7 - Web archiving at the British Library 15

Page 109: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Supporting Long-term Preservation

> Storing data in WARCs and metadata in METS > Migrate all legacy data into WARCs

> WCT output WARC files

> Submission Information Package (SIP) profiles for selective and domain crawls > Storing descriptive metadata (eg permission information) & technical metadata

(eg crawl log, crawl configurations, virus scan events)

> Ingest archived websites in the Digital Library System (DLS) > Command line tool generates SIPs

> Providing access from the DLS (in future)

29th November 2012 Session 7 - Web archiving at the British Library 16

Page 110: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Demo (45 minutes)

> Selection and Permission Tool (https://wct.bl.uk/selection/)

> Web Curator Tool (https://wct.bl.uk/wct/)

29th November 2012 Session 7 - Web archiving at the British Library 17

Page 111: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Access

> Currently 3 ways to access the web archive > Online through the UK Web Archive

> Catalogue records (of special collections)

> Keywords search through primo (corporate resource discovery system)

> Conduct researcher survey to understand requirements

>Analytical access

29th November 2012 Session 7 - Web archiving at the British Library 18

Page 112: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Catalogue Records

29th November 2012 Session 7 - Web archiving at the British Library 19

Page 113: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Keyword search through Primo

29th November 2012 Session 7 - Web archiving at the British Library 20

Page 114: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

UK Web Archive

29th November 2012 Session 7 - Web archiving at the British Library 21

> Websites archived by BL and partners since 2004 (65% by BL)

> 122,99 websites, 50,866 instances, 13.6TB WARCs

> Over 100,000 unique visits since 1st April 2012

> Key websites include videos > Full-text, N-gram, title and

URL search > Browse by subject / special

collection, visual browsing

http://www.webarchive.org.uk

Page 115: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Analytical Access

> Shift of focus from the level of single webpages or websites to the entire web archive collection.

> Use web archives as datasets

> Support survey, annotation, contextualisation and visualisation

> Allows discovery of patterns, trends and relationships in inter-linked web pages

> Extracting value from the “haystacks”

> Helps addresses a number of challenging issues > Scalability

> Accessibility of individual websites

> Components missed by crawlers

29th November 2012 Session 7 - Web archiving at the British Library 22

Page 116: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Visualising the UK Web

> http://www.webarchive.org.uk/ukwa/visualisation > N-gram search

> Links analysis

> Format Analysis

> Geo-index

> http://www.webarchive.org.uk/bluebox/ > uses the Memento aggregate TimeGate hosted by lanl.gov

> “resource not in archive” – who else has it?

> Open data > Dataset and APIs for general use

> Enable broader community to re-use, explore and visualise content of web archive

29th November 2012 Session 7 - Web archiving at the British Library 23

Page 117: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Web Archiving Infrastructure

29th November 2012 Session 7 - Web archiving at the British Library 24

Page 118: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Non-print Legal Deposit: Time of change

> Expected to be in place in April 2013 > Access restricted to premises of Legal Deposit Libraries

> Library-wide Legal Deposit Programme to develop capability and end-to-end process

> Web Archiving Team acts as “technical supplier” for a number of projects

> Still need to work out how current (permission-based) selective archiving relates to domain crawl under Legal Deposit > Will we request permissions for online access?

> Will we stop crawling some of the sites we are crawling now and include them in the annual / bi-annual broad domain crawl?

> Who does what?

29th November 2012 Session 7 - Web archiving at the British Library 25

Page 119: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

29th November 2012 Session 7 - Web archiving at the British Library 26

Web Archiving Strategy

26

Domain Crawl

Event S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

Domain harvesting: • Broad

sweep of .uk domain

• Once or twice a year

Events & key sites: • Events of

national interest

• Sites need to be captured frequently

Special Collection: • Focused,

thematic collections

• Support priority subjects

Key sites Event S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

Page 120: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Web  Archiving  Workshop  

Leïla  Medjkoune,  Internet  Memory  IIPC  workshop,  BNF,  Paris,  November  2012  

Page 121: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Internet  Memory  Internet  Memory  Founda/on  (European  Archive)  •  Established  in  2004  in  Amsterdam  and  then  Paris  •  Mission:  Preserve  Web  content  by  building  a  shared  WA  plaJorm  •  Ac/ons:  DisseminaLon,  R&D  and  partnerships  with  research  groups  and  

cultural  insLtuLons  •  Open  Access  Collec/ons:  UK  NaLonal  Archives  &  Parliament,  PRONI,  CERN  

and  The  NaLonal  Library  of  Ireland  

Internet  Memory  Research  •  Spin-­‐off  of  IM  established  in  June  2011  in  Paris  •  Missions:  Operate  large  scale  or  selecLve  crawls  &  develop  new  

technologies  (crawl,  access,  processing  and  extracLon)    

Page 122: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Internet  Memory  Infrastructure    Green  datacenters    Repository  and  data  access  for  large-­‐scale  data  

management:  •  HDFS  (Hadoop  File  System):  Distributed,  fault-­‐tolerant  

file  system  •  Hbase.  A  distributed  key-­‐value  index  

•  Convenient  model  for  temporal  archives  •  MapReduce:  A  distributed  execuLon  framework  

•  Reliable  mechanism  to  run  an  analysis  job  on  very  large  datasets  

 

Page 123: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Internet  Memory  Focused  crawling:  •  Automated  crawls    •  Quality  focused  crawls  :  

–  Video  capture,  Twiaer  crawls  –  ExecuLon  tools  to  overcome  crawling  issues  on  specific  content  

Large  scale  crawling  •  Inhouse  developped  distributed  sobware    •  Scalable  crawler  (10-­‐50  Bn  pages)  •  Also  designed  for  focused  crawl  and  complex  scoping  

 

Page 124: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Research  projects  and  focus  

Web  Archiving  and  Preserva/on  ✓  Living  Web  Archives  (2007-­‐2010)  ✓  Archives  to  Community  MEMories:  

(2010-­‐2013)  ✓  SCAlable  PreservaLon  Environment  

(2010-­‐2013)  

Webscale  data  Archiving  and  Extrac/on  ✓  Living  Knowledge  (2009-­‐2012)  ✓  Longitudinal  AnalyLcs  of  Web  

Archive  data  (2010-­‐2013)  ✓  TrendMiner  (2011-­‐2014)  ✓  DOPA  (2012-­‐2014)  ✓  AnnoMarket  (2012-­‐2014)  

Page 125: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Web  Archiving  project  ?  

OrganisaLonal  challenges:  •  SelecLon/QA:  Librarian  /  Archivist,  Quality  assurance  team,  

Project  manager  •  Content  capture/services  development:  Engineers,  

developers,  technicians  •  Infrastructure  deployment  and  maintenance:  Engineers,  

System  administrators  

➥ Web  Archiving  projects  require  strong  competences  and  experienced  human  resources  combined  with  a  scalable  infrastructure  

Page 126: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

IM  Shared  plaJorm  

Since  its  creaLon  in  2004,  the  Internet  Memory  FoundaLon  works  in  close  collaboraLon  with  partners  insLtuLons  and  research  groups  through  European  projects:  •  To  develop  methods  and  tools  improving  web  

archiving  quality  •  To  grow  its  experLse  and  technological  taskforce  

Page 127: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Archivethe.Net  (1)  

 

•  To  mutualize  knowledge  and  skills  between  insLtuLons  

•  To  share  internal  developments  with  partners  insLtuLons  

•  To  cut  services  and  R&D  costs  

Page 128: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Archivethe.Net  (2)  

•  Archivethe.net is a shared web archiving platform associated to a service.

•  The platform is combining new technology and user needs to ensure a good service quality in terms of reliability and efficiency

•  For whom ? our current partners, our new partners and … for ourselves

Page 129: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Benefits  ?  •  Integrated  web  archiving  process  :  from  selecLon  

to  access  

•  Ongoing  technological  developments  through  specific  or  common  R&D  projects  

•  Dedicated  and  highly  skilled  team  to  follow  partners’  projects  

•  Dedicated  infrastructure  

Page 130: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

How  does  it  work?  (1)  

•  ATN  is  designed  as  a  Saas  (Sobware  as  a  service)    

•  The  plaJorm  offers  a  friendly  user  interface  to  record  partners  web  archiving  orders  

•  A  pipeline  organizes  and  manages  the  producLon  

•  A  QA  team  ensures  the  quality  of  the  archive  to  meet  partners’  requirements    

Page 131: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

How  does  it  work?  (2)  

     

Demo  

Page 132: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

ARCOMEM  Archivist  tool  ?  

Set  and  follow  web  archive  campaigns  •  V1:  A  crawler  cockpit  and  a  search    and  retrieval  applicaLon  Intelligent  content  acquisiLon:  •  Seeds  URLs  •  Keywords  •  Social  web  sites  APIs    •  Social  Media  Categories  (SMC)    

Page 133: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

SARA  

Search  and  retrieval  interface:  •  Advance  search  funcLonaliLes  

•  Filtering  via  faceLng  •  SorLng  by  content  type,  Social  media  plaJorm,  text/image  contextual  informaLon  (event,  enLty,...),  etc.  

 

 

Page 134: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Crawler  Cockpit  Interface      •  Create/select  a  campaign  •  Describe  campaign  (Ltle,  descripLon,  comments,  etc.)  

•  Define  scope:  select  criteria  such  as  language,  keyword,  url,  organisaLon,  etc.  

•  Select  social  media  categories  and  APIs  to  explore  

•  Set  precedence  rules  for  some  content  type  or  source  (images,  videos,  tweets,  news,  etc.)  

Page 135: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Crawler  cockpit  interface  

     

Demo    

Page 136: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

ARCOMEM  Archivist  Tool  V2  

•  Refinement  mode  :  Refine  crawl  parameters  to  improve  crawls  •  Improve  access  applicaLon  (SARA)  :  Preview  funcLon  so  that  the  users  can  review  the  results  of  the  campaign  set  up  

Page 137: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

QA  for  Web  Archives?  

IM  QA  is  based  on:  •  Tools  internally  developed  •  Tools  developed  in  the  context  of  European  projects    •   Automated  processes  •   Knowledge  and  skills  of  our  crawl  engineer  and  QA  teams  

 

Page 138: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

 QA  Methodology  and  tools?    

Methodology  •  Based  upon  crawler  behaviour  •  Based  on  insLtuLons  needs  and  policy  •  Can  be  manual  (visual)  or  “automated”  

•  Can  be  made  at  pre  or  post  crawl  Lme  Tools  •  Open  source  tools  such  as  plugins  ,  proxies,  etc.  •  Internally  developed  tools  (fetchers,  automate  check,  etc.)  •  Bug  trackers  to  record  informaLon  and  communicate  with  

partner  insLtuLons  

Page 139: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

 QA  Methodology  and  tools?    

SCApe:  Scalable  PreservaLon  Environments  •  Automate  visual  QA  to  detect  rendering  issues:  

•  Improve  archives  quality  and  cut  QA  costs  

•  Feed  “preservaLon  watch  and  planning”  tools  •  First  test  made  on  over  400  pairs  of  urls  •  Inhouse  “ExecuLon  plaJorm”  under  deployment  •  Results  and  processes  to  be  disseminated  to  IIPC  members  for  feedback  !  

 

Page 140: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Technical  challenges  

Capture  •  Dynamically  generated  content,  deep  web,  etc.    •  Non  HTTP  protocoles  (e.g.:  RTMP)  •  Social  media  plaJorms,  ...  Access    •  Replicate  live  funcLonaliLes  and  look  &  feel  •  Provide  access  to  very  large  files    

 ➥ Fast  evolving  technologies  ➥ Ephemeral  content  ➥ MulLplicaLon  of  producLon  means:    ➥ Increase  of  user  generated  content  

                                 

Page 141: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Technical  SoluLons    

•  ExecuLon  based  crawling  (vs  parsing)  

•  API  crawling    •  ApplicaLon  aware  

crawling  •  Bespoke  fetchers  ➥  OrchestraLon  of  tools  

 ARCOMEM content acquisition

Page 142: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Technical  SoluLons    Access  tool:  •  Player  replacement:  reproduce  players  funcLonaliLes    

•  Adapt  access  soluLon  to  type  of  content/plaJorms  (generic  soluLons)  

Storage  infrastructure  /  format:  •  Enable  access  to  large  files  •  Fast  access  to  large  amount  of  content  to  facilitate  search  &  retrieval  

Page 143: Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Use  cases  •  Social  media  capture  and  access:  

•  You  Tube    •  Twiaer  •  Flickr,  etc.  

•  Web  Archiving  related  services:    •  RedirecLon  service  •  Memento  •  Legal  issues  with  captured  content    •  Full  text  search    •  etc.