Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

Integrating web archiving in

preservation workflows

Louise Fauduet, Clément Oury,

Sébastien Peyrard

28th November 2012 Session 6 - Integrating web archiving in preservation workflows 2

Objectives of the session

> Present the current issues and solutions of web archives preservation

> Present the main characteristics of a preservation repository, taking “SPAR” as an example

> Explore benefits and issues of the integration of web archives

in a shared digital repository


Preserving web archives

> First and foremost: bit-level preservation

> Logical preservation = preservation of our ability to read and

understand the series of “0” and “1”

> Three kinds of logical preservation

> Preservation of the container format

> Preservation of the contained files

> Preservation of all information necessary to understand web archives

> Web archives preservation: worse case possible?

> Huge amount of data

> Very little information on contained file formats

> Heterogeneity of web archive collections within a same institution

Web archive preservation : characteristics

and issues


> ARC format

> Container format

> Since 1996 :

http://archive.org/web/researcher/Ar

cFileFormat.php

> Groups together data and metadata

> Size arbitrarily limited to 100 Mb

Preserving container formats: the ARC format


W/ARC file designW/ARC file

W/ARC record

Header

Block Ex: HTTP

response, jpeg

file…

Ex: record ID, capture

date, record type,…


The origins: the ARC format

ARC record

URL-record

network_doc

protocol response

object

version block header

filedesc

URL-record-definition

filedesc://IA-001102.arc 0 19960923142103 text/plain 76 1 0 AlexaInternet URL IP-address Archive-date Content-type Archive-length http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202 HTTP/1.0 200 Document follows Date: Mon, 04 Nov 1996 14:21:06 GMT Server: NCSA/1.4.1 Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMT Content-length: 30 <HTML> Hello World!!! </HTML>

version-1-block


Why standardize ARC?

> The ARC format had technical shortcomings

> Only two record types: hard to distinguish data and metadata

> There was no way to uniquely identify a record

> The ARC specifications had formal shortcomings

> It was not perfectly clear and was open to interpretation (=> difficulty to

define what was a valid ARC file when BnF designed a validation tool)

> ARC was an Internet Archive de facto standard, not an internationally

recognized standard


Why standardize ARC?

> Necessity to have a standard format

> Promote the use of a single format by all web archiving institutions (not

only IIPC members)

> Ensure that the format will not be subject to uncontrolled changes

> Allow a clear validation of WARC files

> Foster the development of tools

> Ensure confidence in the long term preservation of web archives

> Integrate web archive formats in the set of international bibliographic

standards


WARC standardization history

> WARC = Web ARChive file format

> Created by IIPC

> WARC is next generation of ARC file format

> ARC format created by the Internet Archive

> Most legacy web archive in ARC

> Original discussion: Sept 2004> First Internet Draft: May 2005

> First ISO Working Draft: Feb 2006

> Final ISO Draft: June 2008

> Final Publication: May 2009


WARC improvements

> A unique identifier for each record

> Eight record types

> warcinfo

> response

> request

> metadata

> revisit

> conversion

> continuation

> Better ways to describe and document the harvesting

process (and the deduplication process)

> Possibility to manage format migrations


Example of Warcinfo record

WARC/1.0

WARC-Type: warcinfo

WARC-Date: 2006-09-19T17:20:14Z

WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39>

Content-Type: application/warc-fields

Content-Length: 381

software: Heritrix 1.12.0 http://crawler.archive.org

hostname: crawling017.archive.org

ip: 207.241.227.234

isPartOf: testcrawl-20050708

description: testcrawl with WARC output

operator: IA_Admin

http-header-user-agent:

Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org)

format: WARC file version 0.17

conformsTo:

http://www.archive.org/documents/WarcFileFormat-0.17.html


Example of request record

WARC/1.0 WARC-Type: request

WARC-Target-URI: http://www.archive.org/images/logoc.jpg

WARC-Date: 2006-09-19T17:20:24Z

Content-Length: 236

WARC-Record-ID: <urn:uuid:4885803b-eebd-4b27-a090-144450c11594>

Content-Type: application/http;msgtype=request

WARC-Concurrent-To: <urn:uuid:92283950-ef2f-4d72-b224-

f54c6ec90bb0>

GET /images/logoc.jpg HTTP/1.0

User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0)

From: [email protected]

Connection: close

Referer: http://www.archive.org/

Host: www.archive.org

Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824


Example of response record

WARC/1.0

WARC-Type: response


WARC-Date: 2006-09-19T17:20:24Z

WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2

WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2

WARC-IP-Address: 207.241.233.58

WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>

Content-Type: application/http;msgtype=response

WARC-Identified-Payload-Type: image/jpeg

Content-Length: 1902

HTTP/1.1 200 OK

Date: Tue, 19 Sep 2006 17:18:40 GMT

Server: Apache/2.0.54 (Ubuntu)

Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT

ETag: "3e45-67e-2ed02ec0"

Accept-Ranges: bytes


Connection: close

Content-Type: image/jpeg

[image/jpeg binary data here]


Example of resource record

WARC/1.0

WARC-Type: resource

WARC-Target-URI: file://var/www/htdoc/images/logoc.jpg

WARC-Date: 2006-09-19T17:20:24Z

WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>



WARC-Block-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2


[image/jpeg binary data here]


Example of metadata record

WARC/1.0

WARC-Type: metadata


WARC-Date: 2006-09-19T17:20:24Z

WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-

57494593b943>

WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-

f54c6ec90bb0>

Content-Type: application/warc-fields

WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2

Content-Length: 59

via: http://www.archive.org/

hopsFromSeed: E

fetchTimeMs: 565


Example of revisit record

WARC/1.0

WARC-Type: revisit


WARC-Date: 2007-03-06T00:43:35Z

WARC-Profile: http://netpreserve.org/warc/0.17/server-not-modified

WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb>

WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>

Content-Type: message/http

Content-Length: 226

HTTP/1.x 304 Not Modified

Date: Tue, 06 Mar 2007 00:43:35 GMT

Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4

Connection: Keep-Alive

Keep-Alive: timeout=15, max=100

Etag: "3e45-67e-2ed02ec0"


Example of conversion record

WARC/1.0

WARC-Type: conversion


WARC-Date: 2016-09-19T19:00:40Z

WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593dddd>

WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>

WARC-Block-Digest: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK

Content-Type: image/neoimg

Content-Length: 934

[image/neoimg binary data here]


Example of continuation record 1/2

WARC/1.0

WARC-Type: response


WARC-Date: 2006-09-19T17:20:24Z

WARC-Block-Digest: sha1:2ASS7ZUZY6ND6CCHXETFVJDENAWF7KQ2


WARC-IP-Address: 207.241.233.58

WARC-Record-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0>

WARC-Segment-Number: 1

Content-Type: application/http;msgtype=response


HTTP/1.1 200 OK

Date: Tue, 19 Sep 2006 17:18:40 GMT

Server: Apache/2.0.54 (Ubuntu)

Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT

ETag: "3e45-67e-2ed02ec0"

Accept-Ranges: bytes


Connection: close


[first 1360 bytes of image/jpeg binary data here]


Example of continuation record 2/2

WARC/1.0

WARC-Type: continuation


WARC-Date: 2006-09-19T17:20:24Z

WARC-Block-Digest: sha1:T7HXETFVA92MSS7ZENMFZY6ND6WF7KB7

WARC-Record-ID: <urn:uuid:70653950-a77f-b212-e434-7a7c6ec909ef>

WARC-Segment-Origin-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0>

WARC-Segment-Number: 2

WARC-Segment-Total-Length: 1902

WARC-Identified-Payload-Type: image/jpeg

Content-Length: 302


Integrating web archiving in preservation

workflows : the objectives

> Share development and storage costs with other entities of the library

> Benefit from technology watch performed for other kind of digital documents

> Obtain a global overview of all kinds of the institution’s

digital assets

> And manage all of them in a consistent manner


Integrating web archiving in preservation

workflows : the issues

> Digital collections are highly heterogeneous

> Legal status / preservation status

> Access constraints

> Volume of resources

> Technical characteristics (formats)

> Data and metadata models

> Heterogeneity may occur even for similar kind of documents

> Digitization or harvesting procedures have evolved through times

> Competition may occur between different kind of digital

collections

> For development priorities, ingest priorities, storage or preservation

quality

A shared digital repository: the SPAR

example

Louise

24

Digital everywhere

25

Functional issue : a digital preservation copy

Digitization as a mean for preservation and dissemination

26

Technical issue : volume

Now

Color (24 bits) – 400dpi –TIFF uncompressed

1 page ~ 80Mb

More than x500 !!!

Then

Black & white – 300dpi –TIFF G4

1 page ~ 200Kb

27

Business issue : born digital

More and more documents only

in digital form

28

Digital preservation is not alone

SPAR - Infrastructure

SPAR - Realization

Ingest

SPAR

Storage Abstraction Service (SAS)

Administration

Data management

Storage

Access

Preservation planning

Pro

du

ctio

n a

pp

lica

tion

sD

issemin

atio

n a

pp

licatio

ns

Preservation

digitization

…

wayback

WEB Archiving

….

….

…

Audiovisual


How to manage diversity?

> Build a central repository to reduce the diversity (media, formats, departments …)

> Relying on best practices and standards

> Key requirements:

> OAIS compliance (ISO 14721:2012)

> modularity and distributivity

> abstraction

> use of well known formats and standards

> use of open-source technical building blocks


A generic model

http://public.ccsds.org/publications/archive/650x0m2.pdf


P

r

e

-

I

n

g

e

s

t

P

r

e

-

I

n

g

e

s

t

A generic repository solution

P

r

e

-

I

n

g

e

s

t

Storage abstraction service

Ingest

Storage

Preservation

planningAdministration

Data managementAccès

SIP DIPmets

rdf

rdf

Infrastructure

Preservation digitization

Web archives

And so on


Preservation workflow

> Granularity at the archival package level, to allow for parallelization

> Each module is independent and interact with defined

interfaces

> Every package follows the same basic workflow with specific

features when needed


Information about a package

34

And with a little patience…

Infrastructure

2005 2006 2007 2008 2009 2010 2011 2012

WG

RFP

Core part

2004

Study

Other channels

TPS

Admin

AV

Working Groups

WLD

Operations

may 2010

Renewal

2013

New tender

WG


Requirements for the infrastructure

> Openness of the solution

> Openness to multiple environments

> Compatible with multiple providers

> Availability

> Reliability of all parts

> Usage of the system 24h./day, 7d./week, 365d./year

> Maximum down time: 2h per month

> Upgradeability

> Great potential in increase of volume and power

> Coverage of the need for the next 8 years

36

Backup

storage

Backup

Servers

Backup site

Backup secondary storage

Primary

storage

Secondary storageLookup storageServers

Main site

Backup

Lookup storage

Online

storage

Infrastructure

37

Primary and backup storage

Oracle StorageTek SL8500

• up to 64 tape drives

• up to 8500 tapes

• up to 8 hand pickers

• up to 32 linked libraries

Primary storage2 libraries

16 PB maximum

Backup storage

2 libraries

16 PB maximum

38

Tape library in-situ

39

Medias

Capacity 1.5 TB

Transfer rate 140 MB/s

Primary storage

LTO5

Backup storage

T10000B

Capacity 1 TB

Transfer rate 120 MB/s

(previously: 9840C – 40GB) (previously: T10000A – 500GB)


How to manage diversity, really?

> To deal with the variability and heterogeneity of the data: tracks

> build on the relation between the digital objects and the archival system, independently of any given organization > Preservation digitization

> Audiovisual material

> Negotiated legal deposit (dark Web, regional press)

> Automatic legal deposit (web harvests)

> Administrative production

> Deposit / Third party archiving

> Acquisition / Donation

> Then channels based on technical criteria

> And then: DECISIONS!


Decisions, decisions : Service Level Agreements

> The Service Level Agreements are contracts between the users and SPAR, to offer a more transparent system

> The system is no longer a black box only known by technical

experts

2012/09/13 42

Workflows driven by the SLA

P

r

e

-

I

n

g

e

s

t

Storage abstraction service

Ingest

Stockage

Preservation Administration

Data managementAccess

SIP

AIP

DIPmets

rdf

rdf

AIP

Which

formats are

allowed?

How copies are

needed, in what

kind of media ?

What is the

maximum size

of a package ?

Do we need to log

each access?


SLA: a reference package

> 3 SLA: Ingest, Preservation, Access

> Formalize in XML the ways of managing the packages

> Those 3 SLA are recorded in a reference package that describes the channel

SLA-I.xml, SLA-P.xml, SLA-A.xml

Mets.xml

Contract.pdf


Decisions, decisions: metadata


Decisions, decisions : levels of granularity

03/07/1882 28/02/1883 01/03/1883

set

group

object

file

02/07/1882

Year 1883Le Matin

Year 1882

01/07/1882


Decisions, decisions: formats (1)

> Define the scope

> For instance: accept files of large size posters directly from the

printers as a substitute for the legal deposit of paper posters

> Negotiation between:

> Producers: what the printers can output

> Librarians: what can be displayed/handled easily

> Preservation experts: what is acceptable given significant properties,

communities, standards…

2012/05/21 47

Decisions, decisions: formats (2)

**********TIFF

***********PDF/X

***000***QuarkXPress

True to the originalCharacterization toolsStandardsPrevious expertise

Producer

For this purpose, PDF/X chosen as a good compromise between truth to the

original, wide usage and standardization

2012/05/21 48

Requirements on formats

Kind Description

Stored No technical information Bit stream preservation

Identified Identified format => associated mime typeNo preservation strategy planned by the institution

Known Format identified, documented, with tools => associated schema of technical descriptionPreservation strategy planned by the institution

Managed Identified formatDocumentation and tools owned by the institutionProfile of use defined in the institution

2012/05/21 49

Format definition: a reference package

Mets.xml: manifest

T000001.tiff: sample

format.xml: machine readable description

format.txt: human description


Characterization?

> Identification. The process of determining the presumptive format of a digital object on the basis of suggestive extrinsic hints and intrinsicsignatures, both internal (e.g. magic number) and external (e.g. file extension).

> Validation. The process of determining the level of conformance to thenormative syntactic and semantic rules defined by the authoritativespecification of the object's format.

> Feature extraction. The process of reporting the intrinsic properties of a digital object significant for purposes of classification, analysis, and use.

> Assessment. The process of determining the level of acceptability of a digital object for a specific purpose on the basis of locally-defined policyrules.

http://www.fao.org/oek/jhove2/digital-preservation-and-jhove2-home/jhove2-tutorial/en/


Tools for identification

> At BnF = getting a MimeType

> MagicMimeTypeIdentifier: from the Aperture project

> Could use : magic number or extension

> Very narrow database

> Libmagic: from the file(1) utility

> Very old library, extensive database, continued to be actively

maintained


Tools for characterization

> At BnF = getting properties, using the proper tool based on MimeType

Tool Output schema

text/* textMD

image/* mix

audio/*

video/*

application/x-ia-arc Jhove2 containerMD

application/gzip Jhove2 containerMD

MPEG-7

MPEG-7

http://bibnum.bnf.fr/containerMD-v1

2012/05/21 53

The ingest processIngest request

reception

Manifest

Validation

Package search

within SPAR

SIP characteristics

audit

SIP files audit

and characterization

ARK identifier

generation

SET processing

Ingest completion

SIP reception

Audit

ACT_01

ACT_02

ACT_03

ACT_04

ACT_05

ACT_06

ACT_07

ACT_08

ACT_09

2012/05/21 54

ACT_06 : File processing

ACT_06.1.1 Fileidentification

ACT_06.1.2 Filecharacterization

ACT_06.1.2 Significantproperties extraction

[ Identified file ]

[ Known file ]

ACT_06.2.1 Managed format inSLA ?

ACT_06.2.2 Known formatin SLA ?

ACT_06.2.3 Identifiedformat in SLA ?

[ Managed format ]

[ Accepted ]

[ Undefined ]

[ Rejected ]

[ Undefined ]

ACT_06.2.4 Storedformat in SLA ?

[ Stored file ]

[ Undefined ]

[ Accepted ]

[ Rejected ]

[ Accepted ]

[ Rejected ]

[ Accepted ]

[ Rejected ]


Decisions, decisions: building a package out

of it


> Header

> DmdSec

> AmdSec

> TechMD

> DigiProvMD

> SourceMD

> RightsMD

> FileSec

> StructMap

> Structlink

> BehaviorSec

Decisions, decisions : tweaking METS

Structural metadata: METS

Descriptive and source metadata:

qualified Dublin Core

Provenance metadata: PREMIS

Technical metadata:

depends on the data-objectsMPEG-7

Integrating Web archives to the SPAR

repository

preserving the French web archives


58

Background(1): one mandate, heterogeneous

collections

1996-2005 2002 & 2004 2004-2008 2006-2010 2010-now

70 Tb 0.5 Tb 45 Tb 22 Tb

operator

robot +Alexa bot

2006-08-01: French copyright law entitles BnF to collect the French Internet

150 Tb


Pre Ingest

Web archives in the scope of SPAR since the beginning but still a need to align with

existing implementation

Background (2): a generic repository solution at BnF

Digitized books

Digitized

audiovisual

documents

web archiving

Pre Ingest

Pre Ingest

But first things first!

What do we want to preserve?


The harvested files

Here is what I crawled on the Web And how I packaged them in a web

archive container file

HTML

HTML

HTML

HTML

ARC

data


Intellectual information

This is the 2012 French election

collection

This is the daily news collection

HTML

HTML

HTML

HTML

HTML


Provenance information

This was harvested with the Heritrix

tool v1.5.2, using

NetarchiveSuite

This was harvested using HTTrack

HTML

HTML

HTML

HTMLHTML

+


Provenance

information, II

We harvested content suited for

Mozilla Firefox

We respected the robots.txt

The job crashed

We harvested content suited for

Internet Explorer

We ignored the robots.txt

The job went well

HTML

HTML

HTML

HTML

HTML

config config


Provenance

information, III

This was captured on 2012, may 8th

This captured 1098 websites

This produced 105 ARC files

This was captured in 2006

This captured 145 websites

This produced 50 ARC files

HTML

HTML

HTML

HTML

HTML

log reportlog report

Consistent data before SPAR

Cleaning the stuff before preserving it


All data on a single target

1996-2005 2002 & 2004 2004-2008 2006-2010

70 Tb 0.5 Tb 45 Tb 22 Tb

unknown

2010-now

+Alexa bot

67

150 Tb


Aligning collections before ingest:

the NetarchiveSuite target workflow

ARC

data

ARC

metadata

logconfig report

HTML

HTML

HTML

HTML

harvested

files

ARC

data ARC

data

ARC

data ARC

data

+

harvest 1 harvest 2

+

harvest 3

+

…

…

This is a collection containing French election websites

Here are the

files we

harvested

They are

included in

web archives

specific files

This was done

with these tools


Aligning collections before ingest

the NetarchiveSuite target workflow

A three-layered model

in SPARHarvest Definition (curator collection)

Harvest Instance (“technical” harvest = job)

ARC file (data or metadata)


An « ARC metadata » samplefiledesc://32-metadata-1.arc 0.0.0.0 20100416092026 text/plain 77

1 0 InternetArchive

URL IP-address Archive-date Content-type Archive-length

metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVersion=1.14.3&harvestid=1&jobid=32 172.20.16.214 20100414095814 text/xml 366

<?xml version="1.0" encoding="UTF-8"?> <harvestInfo> <version>0.2</version> <jobId>32</jobId> <priority>LOWPRIORITY</priority> <harvestNum>0</harvestNum> <origHarvestDefinitionID>1</origHarvestDefinitionID> <maxBytesPerDomain>-1</maxBytesPerDomain> <maxObjectsPerDomain>1000</maxObjectsPerDomain> <orderXMLName>default</orderXMLName> </harvestInfo>

metadata://netarkivet.dk/crawl/setup/order.xml?heritrixVersion=1.14.3&harvestid=1&jobid=32 172.20.16.214 20100414095815 text/xml44775

<?xml version="1.0" encoding="UTF-8"?> <crawl-orderxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">

…



1996-2005 2002 & 2004 2004-2008 2006-2010

70 Tb 0.5 Tb 45 Tb 22 Tb

unknown

2010-now

+Alexa bot

71

150 Tb



Two layers:

- Collection

- ARC files

1996-2005 2010-now

Three layers:

- Harvest Definition

- Harvest instance

- ARC files

Two layers:

- Collection

- ARC files



2006-2010 2010-now

Four layers:

- Collection

- Harvest division

- Harvest instance

- ARC files

Three layers:

- Harvest Definition

- Harvest instance

- ARC files

What SPAR needs


Decisions, decisions : levels of granularity

03/07/1882 28/02/1883 01/03/1883

set

group

object

file

02/07/1882

Year 1883Le Matin

Year 1882

01/07/1882 03/07/1882 28/02/1883 01/03/188302/07/1882

Le Matin

01/07/1882


set

group03/07/1882 28/02/1883 01/03/188302/07/1882

Le Matin

01/07/1882

Different layers

AIPAIP

AIPAIP

set

Contains nothing but metadata

Curator information, allows to

group AIPs sharing the same

intellectual content

AIPAIP

Must contain files to be

preserved

Each AIP is an autonomous

unit

AIPAIP AIPAIP AIPAIP


METS and PREMIS to store

all the preserved content

<mets>

<dmdSec>

Intellectual metadata

<amdSec>

Administrative metadata

<fileSec>

List of the files

<structMap>

Structure of the package

<sourceMD>Metadata about the source

used to produce this content

<techMD>Technical metadata

<digiprovMD>Provenance metadata

MPEG-7

> No management of files within files so far

> Neither tools nor XML schema for the technical caracteristics of ARC files


RDF database to query them

> The database is powerful but LIMITED

> Thus, we cannot express all the information for each

harvested file

Mixing all of this very hard…

« Prometheus, we start the mapping!!! »


PREMIS as a preservation koinè

ObjectObject

EventEvent

AgentAgent

harvestInstance

has harvest

instance

is documented in

hosts report

Outcome extensions

ARC files

report

persons: admins

software

organizations

Harvest event


In other terms…

ARC

data

ARC

metadata

logconfig report

HTML

HTML

HTML

HTML

ARC

data ARC

data

ARC

data ARC

data

+

harvest 1 harvest 2

+

harvest 3

+

…

…


config

HTML

HTML

HTML

HTML

…


ARC

data

In other terms…

ARC

data

ARC

metadataARC

data ARC

data

ARC

data ARC

data …

…


AIPAIP

AIPAIP

AIPAIP

AIPAIPAIPAIP

AIPAIP

AIPAIPset

ARC

data ARC

data …

AIPAIP

AIPAIP

ARC

dataAIPAIP

AIPAIPAIPAIP

groups

Preserving web archives

Challenge 2: analyze ARC files and content files


Tools to analyze the ARC files

Need to

> identify and validate ARC files

> characterize ARC files (extract information)

> handle GZIP compression

> do at least identification of content file

> for large scale collections ARC

ARC.GZ

?

?

HTML

?

HTML

?

Development of JHOVE2 ARC and GZIP modules


Managing the ARC

structurefiledesc://IA-001102.arc 0 19960923142103 text/plain 76

1 0 AlexaInternet

URL IP-address Archive-date Content-type Archive-length

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<arcmetadata>

<arc:software>Heritrix 1.14.2 http://crawler.archive.org</arc:software>

</arcmetadata>

http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202

HTTP/1.0 200 Document follows

Date: Mon, 04 Nov 1996 14:21:06 GMT

Server: NCSA/1.4.1

Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11

GMT

Content-length: 30

<HTML>

Hello World!!!

</HTML>

filedesc

URL record definition

object

URL record

protocol

response

object

version-

block

header

metadata

object

First

ARC

record

data

object


Problem with

JHOVE2 outputs

> Too verbose: 100 Mb of output for 100 Mb of data (with only

identification of content files)

> Need to aggregate and compact all this information

> Need to handle ARC format peculiarities

> Need for a container file-specific format

containerMD

http://bibnum.bnf.fr/containerMD-v1


Where containerMD fits

<mets>

<dmdSec>

Intellectual metadata

<amdSec>

Administrative metadata

<fileSec>

List of the files

<structMap>

Structure of the package

<sourceMD>Metadata about the source

used to produce this content

<techMD>Technical metadata

<digiprovMD>Provenance metadata

MPEG-7 containerMD


containerMD features

containerMD

root element

containerMD

root element

containercontainerentriesentries

entriesInformationentriesInformation

entryentry

entryentry

entryentry

ARCContainerARCContainer

ARCEntriesARCEntries

ARCRecordARCRecord

ARCRecordARCRecord

ARCRecordARCRecord…

ARC-specificextensions

ARC-specificextensions

aggregated

information

about the

entries


Aggregation example: format information

entry

format: text/html

size: 6026213

entry

format: application/pdf

size: 602621

entry

format: image/tiff

size: 60262132

entry

format: text/html

size: 165165

…

entriesInformation count=400

format: text/html

count=300

globalSize=40645654

format: application/pdf

count=20

globalSize=265464

etc.

verbose information aggregated information

factorizing

and sum


Now, here is some cool stuff we will be able to

ask SPAR

> Give me all the jobs that crashed last year

> Give me a list of all the file formats per broad crawl…

> and the number of files per format

> and the global size before and after decompression

> remove the error pages

> order them by decreasing number of files

> Give me, for all the newspapers collection

> all the crawls

> order them by date

> the number of harvested files per crawl


Conclusion and next steps

> A pragmatic approach to handle large-scale and heterogeneous

collections

> The huge mass of data is still an issue

> The benefits of a shared repository

> Cross-domain investigation on preservation strategies (AV material, office

formats, e-books formats…)

> Different policies (SLAs) for different collections

> Improve file format information

> By testing different tools

> By improving format information databases

> International cooperation is a key


At international level: the preservation

working group

> Goals of the PWG

> Exchange information and best practices (WARC format, information packages, etc)

> Promote the development of tools, review specifications and perform tests (WARC tools, Jhove, etc)

> Promote the web archive needs within the digital preservation community and projects

> Working fields

> Objectives and concepts of preserving archived web resources

> Preservation metadata

> Preservation workflows and digital repository functions and requirements

> Preservation strategies (migration, emulation…)

> Web environment technical documentation

> Evaluation of digital preservation tools and gaps towards web archives

> Organizational issues (costs, sustainability, promotion, skills,…)

Web archiving at the British Library

Helen Hockx-Yu

Head of Web Archiving

Overview

> Part 1: Background, history and organisation

> Part 2: Web Archiving Tools (including demos)

> Part 3: Access

> Part 4: Non-print Legal Deposit and future strategy

29th November 2012 Session 7 - Web archiving at the British Library 2

BL Structure

> BL Board and Executive Team

> e-Strategy and Information Systems (eIS) > IT-based products and services

> Finance and Corporate Services (F&CS) > Money

> Human Resources > People

> Operations & Services (O&S) > Front line services

> Scholarship and Collections (S&C) > Content (Arts and humanities, Social Sciences, Science, Technology & Medicine)

> Strategic Marketing and Communications (SMC) > Brand and reputation


Web archiving timeline


Current web archiving strategy

> Selective archiving of websites that > reflect the diversity of lives, interests and activities throughout the UK

> contain research value or are of research interest

> feature political, cultural, social and economic events of national interest

> demonstrate innovative use of the web4 areas

> Also prioritise websites at risk and web-only content

> Permission based > Permission to archive, to provide online access and to preserve. Also ask or 3rd

rights clearance

> 30% success rate, 5% explicit refusal (mostly due to 3rd party rights)

> Online access through UK Web Archive

> Expect to crawl at domain level (from April 2013) for Non-print Legal Deposit


The current Web Archiving team


Skills Profile > IT > Collection management, digital curation > Management > Communications > Web Archiving

(Internal Collaboration)

> The Web Archiving Team is involved in the end to end process but work with other departments / teams in the library


Department / Team Activity / Support

S&C > Subject specialist group > Curator’s Choice project

Selection, curation

eIS Network, hardware and IT support

O&S Resource Discovery & Research

Corporate level resource discovery http://explore.bl.uk/

CA&D Digital Processing

Cataloguing (special collection level)

SMC Publicity, press release, events

The Legal Deposit Programme Domain crawl capability / process and policy

http://explore.bl.uk/

Curator’s Choice

> Pilot project with a small group of dedicated curators / subject specialists

> Special Collections of curator’s choice. Curators take responsibility for owning, maintaining and growing the collections over time > Evolving Role of Libraries in the UK

> Political Action and Communication

> Slavery and Abolition in the Caribbean

> UK relations with the Low Countries

> 19th Century English Literature

> Oral History in the UK

> Film in the UK

> Energy


Web Archiving Advisory Group

> Provide advice and support to the Web Archiving Team

> Act as a ‘critical friend’ to assist in the development of policy and practice.

> Specific advice and support on:

> Purpose, vision and benefits.

> Strategic direction and planning.

> Synergy with internal teams and collaboration with external stakeholders/partners.

> Policy changes and risk management


(External) Collaboration

> UK Web Archiving Consortium (2004-2007): centralised infrastructure and development, distributed collections

> UK Web Archive partners, National Archives, Legal Deposit Libraries (LDLs)

> External Collaborators

> Welcome Library

> Live Art Development Agency

> The Cambridge Innovation Network

> The Women’s Library

> Institute of Historical esearch, University of London

> Individual researchers, specialists

> General public – ca. 20 nominations / week

> National organisations: DPC, JISC

> International: IIPC


JISC UK Web Domain Dataset (1996-2010)

> Collaboration with JISC and the Internet Archive

> UK Web Domain Dataset (1996-2010) – UK websites extracted from the Internet Archive's collection and supported by funding from the JISC

> 35TB research dataset

> No local access to individual websites but access to secondary dataset allowed

> BL has developed visualisations of the dataset

> JISC funded 2 further projects using this dataset > Analytical Access to the Domain Dark Archive

> Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research


http://domaindarkarchive.blogspot.co.uk/

http://www.oii.ox.ac.uk/research/projects/?id=88

Web Archiving Tools

> Support key processes: selection, harvesting, storage, access, preservation

> Mostly open source tools, some developed in-house

> New tools / changes to current tools expected when business processes change due to non-print Legal Deposit


Selection Tools

> Selection: decide what websites to archive and to include as part of a web archive collection

> Selection and Permission Tool: https://wct.bl.uk/selection/ > Submit selection – real time checking of duplicates, fetching meta tags from live

sites

> Collect metadata

> Add contact details

> Suggest crawl frequency

> Permissions management – send emails, direct users to online licence form, store the completed forms, pass details to WCT (create authorisation record and a pending target)

> Reports

> Twittervane


https://wct.bl.uk/selection/

http://netpreserve.org/sites/default/files/resources/Hockx-Yu.pdf

Harvesting Tools

> Harvesting: automated downloading of selected websites using crawler software; quality assurance regarded as an element

> The Web Curator Tool (WCT): https://wct.bl.uk/wct/ > Job scheduling

> Metadata

> Access control

> Harvesting (uses Heritirx)

> QA


https://wct.bl.uk/wct/

Quality Assurance

> Placing more emphasis on intellectual content than appearance or behaviour of a website

> Use four aspects to define quality: > Completeness of capture: whether the intended content has been captured as

part of the harvest.

> Intellectual content: whether the intellectual content (as opposed to styling and layout) can be replayed in the Access Tool.

> Behaviour: whether the harvested copy can be replayed including the behaviour present on the live site, such as the ability to browse between links interactively.

> Appearance: look and feel of a website.

> Rely on visual comparison, previous harvests & crawl logs

> Recent development of QA module to allow bulk operation, reduce # of clicks and make QA recommendations


Supporting Long-term Preservation

> Storing data in WARCs and metadata in METS > Migrate all legacy data into WARCs

> WCT output WARC files

> Submission Information Package (SIP) profiles for selective and domain crawls > Storing descriptive metadata (eg permission information) & technical metadata

(eg crawl log, crawl configurations, virus scan events)

> Ingest archived websites in the Digital Library System (DLS) > Command line tool generates SIPs

> Providing access from the DLS (in future)


Demo (45 minutes)

> Selection and Permission Tool (https://wct.bl.uk/selection/)

> Web Curator Tool (https://wct.bl.uk/wct/)


https://wct.bl.uk/selection/

https://wct.bl.uk/wct/

Access

> Currently 3 ways to access the web archive > Online through the UK Web Archive

> Catalogue records (of special collections)

> Keywords search through primo (corporate resource discovery system)

> Conduct researcher survey to understand requirements

>Analytical access


Catalogue Records


Keyword search through Primo


UK Web Archive


> Websites archived by BL and partners since 2004 (65% by BL)

> 122,99 websites, 50,866 instances, 13.6TB WARCs

> Over 100,000 unique visits since 1st April 2012

> Key websites include videos > Full-text, N-gram, title and

URL search > Browse by subject / special

collection, visual browsing

http://www.webarchive.org.uk

http://www.webarchive.org.uk/

Analytical Access

> Shift of focus from the level of single webpages or websites to the entire web archive collection.

> Use web archives as datasets

> Support survey, annotation, contextualisation and visualisation

> Allows discovery of patterns, trends and relationships in inter-linked web pages

> Extracting value from the “haystacks”

> Helps addresses a number of challenging issues > Scalability

> Accessibility of individual websites

> Components missed by crawlers


Visualising the UK Web

> http://www.webarchive.org.uk/ukwa/visualisation > N-gram search

> Links analysis

> Format Analysis

> Geo-index

> http://www.webarchive.org.uk/bluebox/ > uses the Memento aggregate TimeGate hosted by lanl.gov

> “resource not in archive” – who else has it?

> Open data > Dataset and APIs for general use

> Enable broader community to re-use, explore and visualise content of web archive


http://www.webarchive.org.uk/ukwa/visualisation

http://www.webarchive.org.uk/bluebox/

http://lanl.gov/

Web Archiving Infrastructure


Non-print Legal Deposit: Time of change

> Expected to be in place in April 2013 > Access restricted to premises of Legal Deposit Libraries

> Library-wide Legal Deposit Programme to develop capability and end-to-end process

> Web Archiving Team acts as “technical supplier” for a number of projects

> Still need to work out how current (permission-based) selective archiving relates to domain crawl under Legal Deposit > Will we request permissions for online access?

> Will we stop crawling some of the sites we are crawling now and include them in the annual / bi-annual broad domain crawl?

> Who does what?



Web Archiving Strategy

26

Domain Crawl

Event S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

Domain harvesting: • Broad

sweep of .uk domain

• Once or twice a year

Events & key sites: • Events of

national interest

• Sites need to be captured frequently

Special Collection: • Focused,

thematic collections

• Support priority subjects

Key sites Event S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

Web Archiving Workshop

Leïla Medjkoune, Internet Memory IIPC workshop, BNF, Paris, November 2012

Internet Memory Internet Memory Founda/on (European Archive) •  Established in 2004 in Amsterdam and then Paris •  Mission: Preserve Web content by building a shared WA plaJorm •  Ac/ons: DisseminaLon, R&D and partnerships with research groups and

cultural insLtuLons •  Open Access Collec/ons: UK NaLonal Archives & Parliament, PRONI, CERN

and The NaLonal Library of Ireland

Internet Memory Research •  Spin-‐off of IM established in June 2011 in Paris •  Missions: Operate large scale or selecLve crawls & develop new

technologies (crawl, access, processing and extracLon)

Internet Memory Infrastructure   Green datacenters   Repository and data access for large-‐scale data

management: •  HDFS (Hadoop File System): Distributed, fault-‐tolerant

file system •  Hbase. A distributed key-‐value index

•  Convenient model for temporal archives •  MapReduce: A distributed execuLon framework

•  Reliable mechanism to run an analysis job on very large datasets

Internet Memory Focused crawling: •  Automated crawls •  Quality focused crawls :

–  Video capture, Twiaer crawls –  ExecuLon tools to overcome crawling issues on specific content

Large scale crawling •  Inhouse developped distributed sobware •  Scalable crawler (10-‐50 Bn pages) •  Also designed for focused crawl and complex scoping

Research projects and focus

Web Archiving and Preserva/on ✓  Living Web Archives (2007-‐2010) ✓  Archives to Community MEMories:

(2010-‐2013) ✓  SCAlable PreservaLon Environment

(2010-‐2013)

Webscale data Archiving and Extrac/on ✓  Living Knowledge (2009-‐2012) ✓  Longitudinal AnalyLcs of Web

Archive data (2010-‐2013) ✓  TrendMiner (2011-‐2014) ✓  DOPA (2012-‐2014) ✓  AnnoMarket (2012-‐2014)

Web Archiving project ?

OrganisaLonal challenges: •  SelecLon/QA: Librarian / Archivist, Quality assurance team,

Project manager •  Content capture/services development: Engineers,

developers, technicians •  Infrastructure deployment and maintenance: Engineers,

System administrators

➥ Web Archiving projects require strong competences and experienced human resources combined with a scalable infrastructure

IM Shared plaJorm

Since its creaLon in 2004, the Internet Memory FoundaLon works in close collaboraLon with partners insLtuLons and research groups through European projects: •  To develop methods and tools improving web

archiving quality •  To grow its experLse and technological taskforce

Archivethe.Net (1)

•  To mutualize knowledge and skills between insLtuLons

•  To share internal developments with partners insLtuLons

•  To cut services and R&D costs

Archivethe.Net (2)

•  Archivethe.net is a shared web archiving platform associated to a service.

•  The platform is combining new technology and user needs to ensure a good service quality in terms of reliability and efficiency

•  For whom ? our current partners, our new partners and … for ourselves

Benefits ? •  Integrated web archiving process : from selecLon

to access

•  Ongoing technological developments through specific or common R&D projects

•  Dedicated and highly skilled team to follow partners’ projects

•  Dedicated infrastructure

How does it work? (1)

•  ATN is designed as a Saas (Sobware as a service)

•  The plaJorm offers a friendly user interface to record partners web archiving orders

•  A pipeline organizes and manages the producLon

•  A QA team ensures the quality of the archive to meet partners’ requirements

How does it work? (2)

Demo

ARCOMEM Archivist tool ?

Set and follow web archive campaigns •  V1: A crawler cockpit and a search and retrieval applicaLon Intelligent content acquisiLon: •  Seeds URLs •  Keywords •  Social web sites APIs •  Social Media Categories (SMC)

SARA

Search and retrieval interface: •  Advance search funcLonaliLes

•  Filtering via faceLng •  SorLng by content type, Social media plaJorm, text/image contextual informaLon (event, enLty,...), etc.

Crawler Cockpit Interface •  Create/select a campaign •  Describe campaign (Ltle, descripLon, comments, etc.)

•  Define scope: select criteria such as language, keyword, url, organisaLon, etc.

•  Select social media categories and APIs to explore

•  Set precedence rules for some content type or source (images, videos, tweets, news, etc.)

Crawler cockpit interface

Demo

ARCOMEM Archivist Tool V2

• Refinement mode : Refine crawl parameters to improve crawls • Improve access applicaLon (SARA) : Preview funcLon so that the users can review the results of the campaign set up

QA for Web Archives?

IM QA is based on: •  Tools internally developed •  Tools developed in the context of European projects •  Automated processes •  Knowledge and skills of our crawl engineer and QA teams

QA Methodology and tools?

Methodology •  Based upon crawler behaviour •  Based on insLtuLons needs and policy •  Can be manual (visual) or “automated”

•  Can be made at pre or post crawl Lme Tools •  Open source tools such as plugins , proxies, etc. •  Internally developed tools (fetchers, automate check, etc.) •  Bug trackers to record informaLon and communicate with

partner insLtuLons

QA Methodology and tools?

SCApe: Scalable PreservaLon Environments •  Automate visual QA to detect rendering issues:

•  Improve archives quality and cut QA costs

•  Feed “preservaLon watch and planning” tools •  First test made on over 400 pairs of urls •  Inhouse “ExecuLon plaJorm” under deployment •  Results and processes to be disseminated to IIPC members for feedback !

Technical challenges

Capture •  Dynamically generated content, deep web, etc. •  Non HTTP protocoles (e.g.: RTMP) •  Social media plaJorms, ... Access •  Replicate live funcLonaliLes and look & feel •  Provide access to very large files

➥ Fast evolving technologies ➥ Ephemeral content ➥ MulLplicaLon of producLon means: ➥ Increase of user generated content

Technical SoluLons

•  ExecuLon based crawling (vs parsing)

•  API crawling •  ApplicaLon aware

crawling •  Bespoke fetchers ➥  OrchestraLon of tools

ARCOMEM content acquisition

Technical SoluLons Access tool: •  Player replacement: reproduce players funcLonaliLes

•  Adapt access soluLon to type of content/plaJorms (generic soluLons)

Storage infrastructure / format: •  Enable access to large files •  Fast access to large amount of content to facilitate search & retrieval

Use cases •  Social media capture and access:

•  You Tube •  Twiaer •  Flickr, etc.

•  Web Archiving related services: •  RedirecLon service •  Memento •  Legal issues with captured content •  Full text search •  etc.