View
4.491
Download
0
Embed Size (px)
DESCRIPTION
Presentada en una sesión de trabajo sobre Archivos Web, en la Biblioteca Nacional de España (BNE), el día 8 de julio de 2013
Citation preview
Integrating web archiving in
preservation workflows
Louise Fauduet, Clément Oury,
Sébastien Peyrard
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 2
Objectives of the session
> Present the current issues and solutions of web archives preservation
> Present the main characteristics of a preservation repository, taking “SPAR” as an example
> Explore benefits and issues of the integration of web archives
in a shared digital repository
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 3
Preserving web archives
> First and foremost: bit-level preservation
> Logical preservation = preservation of our ability to read and
understand the series of “0” and “1”
> Three kinds of logical preservation
> Preservation of the container format
> Preservation of the contained files
> Preservation of all information necessary to understand web archives
> Web archives preservation: worse case possible?
> Huge amount of data
> Very little information on contained file formats
> Heterogeneity of web archive collections within a same institution
Web archive preservation : characteristics
and issues
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 5
> ARC format
> Container format
> Since 1996 :
http://archive.org/web/researcher/Ar
cFileFormat.php
> Groups together data and metadata
> Size arbitrarily limited to 100 Mb
Preserving container formats: the ARC format
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 6
W/ARC file designW/ARC file
W/ARC record
Header
Block Ex: HTTP
response, jpeg
file…
Ex: record ID, capture
date, record type,…
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 7
The origins: the ARC format
ARC record
URL-record
network_doc
protocol response
object
version block header
filedesc
URL-record-definition
filedesc://IA-001102.arc 0 19960923142103 text/plain 76 1 0 AlexaInternet URL IP-address Archive-date Content-type Archive-length http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202 HTTP/1.0 200 Document follows Date: Mon, 04 Nov 1996 14:21:06 GMT Server: NCSA/1.4.1 Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMT Content-length: 30 <HTML> Hello World!!! </HTML>
version-1-block
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 8
Why standardize ARC?
> The ARC format had technical shortcomings
> Only two record types: hard to distinguish data and metadata
> There was no way to uniquely identify a record
> The ARC specifications had formal shortcomings
> It was not perfectly clear and was open to interpretation (=> difficulty to
define what was a valid ARC file when BnF designed a validation tool)
> ARC was an Internet Archive de facto standard, not an internationally
recognized standard
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 9
Why standardize ARC?
> Necessity to have a standard format
> Promote the use of a single format by all web archiving institutions (not
only IIPC members)
> Ensure that the format will not be subject to uncontrolled changes
> Allow a clear validation of WARC files
> Foster the development of tools
> Ensure confidence in the long term preservation of web archives
> Integrate web archive formats in the set of international bibliographic
standards
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 10
WARC standardization history
> WARC = Web ARChive file format
> Created by IIPC
> WARC is next generation of ARC file format
> ARC format created by the Internet Archive
> Most legacy web archive in ARC
> Original discussion: Sept 2004> First Internet Draft: May 2005
> First ISO Working Draft: Feb 2006
> Final ISO Draft: June 2008
> Final Publication: May 2009
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 11
WARC improvements
> A unique identifier for each record
> Eight record types
> warcinfo
> response
> request
> metadata
> revisit
> conversion
> continuation
> Better ways to describe and document the harvesting
process (and the deduplication process)
> Possibility to manage format migrations
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 12
Example of Warcinfo record
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2006-09-19T17:20:14Z
WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39>
Content-Type: application/warc-fields
Content-Length: 381
software: Heritrix 1.12.0 http://crawler.archive.org
hostname: crawling017.archive.org
ip: 207.241.227.234
isPartOf: testcrawl-20050708
description: testcrawl with WARC output
operator: IA_Admin
http-header-user-agent:
Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org)
format: WARC file version 0.17
conformsTo:
http://www.archive.org/documents/WarcFileFormat-0.17.html
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 13
Example of request record
WARC/1.0 WARC-Type: request
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
Content-Length: 236
WARC-Record-ID: <urn:uuid:4885803b-eebd-4b27-a090-144450c11594>
Content-Type: application/http;msgtype=request
WARC-Concurrent-To: <urn:uuid:92283950-ef2f-4d72-b224-
f54c6ec90bb0>
GET /images/logoc.jpg HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0)
From: [email protected]
Connection: close
Referer: http://www.archive.org/
Host: www.archive.org
Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 14
Example of response record
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
WARC-IP-Address: 207.241.233.58
WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: application/http;msgtype=response
WARC-Identified-Payload-Type: image/jpeg
Content-Length: 1902
HTTP/1.1 200 OK
Date: Tue, 19 Sep 2006 17:18:40 GMT
Server: Apache/2.0.54 (Ubuntu)
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "3e45-67e-2ed02ec0"
Accept-Ranges: bytes
Content-Length: 1662
Connection: close
Content-Type: image/jpeg
[image/jpeg binary data here]
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 15
Example of resource record
WARC/1.0
WARC-Type: resource
WARC-Target-URI: file://var/www/htdoc/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: image/jpeg
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
WARC-Block-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
Content-Length: 1662
[image/jpeg binary data here]
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 16
Example of metadata record
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-
57494593b943>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-
f54c6ec90bb0>
Content-Type: application/warc-fields
WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
Content-Length: 59
via: http://www.archive.org/
hopsFromSeed: E
fetchTimeMs: 565
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 17
Example of revisit record
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2007-03-06T00:43:35Z
WARC-Profile: http://netpreserve.org/warc/0.17/server-not-modified
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: message/http
Content-Length: 226
HTTP/1.x 304 Not Modified
Date: Tue, 06 Mar 2007 00:43:35 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4
Connection: Keep-Alive
Keep-Alive: timeout=15, max=100
Etag: "3e45-67e-2ed02ec0"
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 18
Example of conversion record
WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2016-09-19T19:00:40Z
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593dddd>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
WARC-Block-Digest: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK
Content-Type: image/neoimg
Content-Length: 934
[image/neoimg binary data here]
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 19
Example of continuation record 1/2
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:2ASS7ZUZY6ND6CCHXETFVJDENAWF7KQ2
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
WARC-IP-Address: 207.241.233.58
WARC-Record-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0>
WARC-Segment-Number: 1
Content-Type: application/http;msgtype=response
Content-Length: 1600
HTTP/1.1 200 OK
Date: Tue, 19 Sep 2006 17:18:40 GMT
Server: Apache/2.0.54 (Ubuntu)
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "3e45-67e-2ed02ec0"
Accept-Ranges: bytes
Content-Length: 1662
Connection: close
Content-Type: image/jpeg
[first 1360 bytes of image/jpeg binary data here]
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 20
Example of continuation record 2/2
WARC/1.0
WARC-Type: continuation
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:T7HXETFVA92MSS7ZENMFZY6ND6WF7KB7
WARC-Record-ID: <urn:uuid:70653950-a77f-b212-e434-7a7c6ec909ef>
WARC-Segment-Origin-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0>
WARC-Segment-Number: 2
WARC-Segment-Total-Length: 1902
WARC-Identified-Payload-Type: image/jpeg
Content-Length: 302
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 21
Integrating web archiving in preservation
workflows : the objectives
> Share development and storage costs with other entities of the library
> Benefit from technology watch performed for other kind of digital documents
> Obtain a global overview of all kinds of the institution’s
digital assets
> And manage all of them in a consistent manner
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 22
Integrating web archiving in preservation
workflows : the issues
> Digital collections are highly heterogeneous
> Legal status / preservation status
> Access constraints
> Volume of resources
> Technical characteristics (formats)
> Data and metadata models
> Heterogeneity may occur even for similar kind of documents
> Digitization or harvesting procedures have evolved through times
> Competition may occur between different kind of digital
collections
> For development priorities, ingest priorities, storage or preservation
quality
A shared digital repository: the SPAR
example
Louise
24
Digital everywhere
25
Functional issue : a digital preservation copy
Digitization as a mean for preservation and dissemination
26
Technical issue : volume
Now
Color (24 bits) – 400dpi –TIFF uncompressed
1 page ~ 80Mb
More than x500 !!!
Then
Black & white – 300dpi –TIFF G4
1 page ~ 200Kb
27
Business issue : born digital
More and more documents only
in digital form
28
Digital preservation is not alone
SPAR - Infrastructure
SPAR - Realization
Ingest
SPAR
Storage Abstraction Service (SAS)
Administration
Data management
Storage
Access
Preservation planning
Pro
du
ctio
n a
pp
lica
tion
sD
issemin
atio
n a
pp
licatio
ns
Preservation
digitization
…
wayback
WEB Archiving
….
….
…
Audiovisual
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 29
How to manage diversity?
> Build a central repository to reduce the diversity (media, formats, departments …)
> Relying on best practices and standards
> Key requirements:
> OAIS compliance (ISO 14721:2012)
> modularity and distributivity
> abstraction
> use of well known formats and standards
> use of open-source technical building blocks
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 30
A generic model
http://public.ccsds.org/publications/archive/650x0m2.pdf
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 31
P
r
e
-
I
n
g
e
s
t
P
r
e
-
I
n
g
e
s
t
A generic repository solution
P
r
e
-
I
n
g
e
s
t
Storage abstraction service
Ingest
Storage
Preservation
planningAdministration
Data managementAccès
SIP DIPmets
rdf
rdf
Infrastructure
Preservation digitization
Web archives
And so on
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 32
Preservation workflow
> Granularity at the archival package level, to allow for parallelization
> Each module is independent and interact with defined
interfaces
> Every package follows the same basic workflow with specific
features when needed
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 33
Information about a package
34
And with a little patience…
Infrastructure
2005 2006 2007 2008 2009 2010 2011 2012
WG
RFP
Core part
2004
Study
Other channels
TPS
Admin
AV
Working Groups
WLD
Operations
may 2010
Renewal
2013
New tender
WG
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 35
Requirements for the infrastructure
> Openness of the solution
> Openness to multiple environments
> Compatible with multiple providers
> Availability
> Reliability of all parts
> Usage of the system 24h./day, 7d./week, 365d./year
> Maximum down time: 2h per month
> Upgradeability
> Great potential in increase of volume and power
> Coverage of the need for the next 8 years
36
Backup
storage
Backup
Servers
Backup site
Backup secondary storage
Primary
storage
Secondary storageLookup storageServers
Main site
Backup
Lookup storage
Online
storage
Infrastructure
37
Primary and backup storage
Oracle StorageTek SL8500
• up to 64 tape drives
• up to 8500 tapes
• up to 8 hand pickers
• up to 32 linked libraries
Primary storage2 libraries
16 PB maximum
Backup storage
2 libraries
16 PB maximum
38
Tape library in-situ
39
Medias
Capacity 1.5 TB
Transfer rate 140 MB/s
Primary storage
LTO5
Backup storage
T10000B
Capacity 1 TB
Transfer rate 120 MB/s
(previously: 9840C – 40GB) (previously: T10000A – 500GB)
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 40
How to manage diversity, really?
> To deal with the variability and heterogeneity of the data: tracks
> build on the relation between the digital objects and the archival system, independently of any given organization > Preservation digitization
> Audiovisual material
> Negotiated legal deposit (dark Web, regional press)
> Automatic legal deposit (web harvests)
> Administrative production
> Deposit / Third party archiving
> Acquisition / Donation
> Then channels based on technical criteria
> And then: DECISIONS!
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 41
Decisions, decisions : Service Level Agreements
> The Service Level Agreements are contracts between the users and SPAR, to offer a more transparent system
> The system is no longer a black box only known by technical
experts
2012/09/13 42
Workflows driven by the SLA
P
r
e
-
I
n
g
e
s
t
Storage abstraction service
Ingest
Stockage
Preservation Administration
Data managementAccess
SIP
AIP
DIPmets
rdf
rdf
AIP
Which
formats are
allowed?
How copies are
needed, in what
kind of media ?
What is the
maximum size
of a package ?
Do we need to log
each access?
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 43
SLA: a reference package
> 3 SLA: Ingest, Preservation, Access
> Formalize in XML the ways of managing the packages
> Those 3 SLA are recorded in a reference package that describes the channel
SLA-I.xml, SLA-P.xml, SLA-A.xml
Mets.xml
Contract.pdf
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 44
Decisions, decisions: metadata
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 45
Decisions, decisions : levels of granularity
03/07/1882 28/02/1883 01/03/1883
set
group
object
file
02/07/1882
Year 1883Le Matin
Year 1882
01/07/1882
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 46
Decisions, decisions: formats (1)
> Define the scope
> For instance: accept files of large size posters directly from the
printers as a substitute for the legal deposit of paper posters
> Negotiation between:
> Producers: what the printers can output
> Librarians: what can be displayed/handled easily
> Preservation experts: what is acceptable given significant properties,
communities, standards…
2012/05/21 47
Decisions, decisions: formats (2)
**********TIFF
***********PDF/X
***000***QuarkXPress
True to the originalCharacterization toolsStandardsPrevious expertise
Producer
For this purpose, PDF/X chosen as a good compromise between truth to the
original, wide usage and standardization
2012/05/21 48
Requirements on formats
Kind Description
Stored No technical information Bit stream preservation
Identified Identified format => associated mime typeNo preservation strategy planned by the institution
Known Format identified, documented, with tools => associated schema of technical descriptionPreservation strategy planned by the institution
Managed Identified formatDocumentation and tools owned by the institutionProfile of use defined in the institution
2012/05/21 49
Format definition: a reference package
Mets.xml: manifest
T000001.tiff: sample
format.xml: machine readable description
format.txt: human description
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 50
Characterization?
> Identification. The process of determining the presumptive format of a digital object on the basis of suggestive extrinsic hints and intrinsicsignatures, both internal (e.g. magic number) and external (e.g. file extension).
> Validation. The process of determining the level of conformance to thenormative syntactic and semantic rules defined by the authoritativespecification of the object's format.
> Feature extraction. The process of reporting the intrinsic properties of a digital object significant for purposes of classification, analysis, and use.
> Assessment. The process of determining the level of acceptability of a digital object for a specific purpose on the basis of locally-defined policyrules.
http://www.fao.org/oek/jhove2/digital-preservation-and-jhove2-home/jhove2-tutorial/en/
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 51
Tools for identification
> At BnF = getting a MimeType
> MagicMimeTypeIdentifier: from the Aperture project
> Could use : magic number or extension
> Very narrow database
> Libmagic: from the file(1) utility
> Very old library, extensive database, continued to be actively
maintained
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 52
Tools for characterization
> At BnF = getting properties, using the proper tool based on MimeType
Tool Output schema
text/* textMD
image/* mix
audio/*
video/*
application/x-ia-arc Jhove2 containerMD
application/gzip Jhove2 containerMD
MPEG-7
MPEG-7
http://bibnum.bnf.fr/containerMD-v1
2012/05/21 53
The ingest processIngest request
reception
Manifest
Validation
Package search
within SPAR
SIP characteristics
audit
SIP files audit
and characterization
ARK identifier
generation
SET processing
Ingest completion
SIP reception
Audit
ACT_01
ACT_02
ACT_03
ACT_04
ACT_05
ACT_06
ACT_07
ACT_08
ACT_09
2012/05/21 54
ACT_06 : File processing
ACT_06.1.1 Fileidentification
ACT_06.1.2 Filecharacterization
ACT_06.1.2 Significantproperties extraction
[ Identified file ]
[ Known file ]
ACT_06.2.1 Managed format inSLA ?
ACT_06.2.2 Known formatin SLA ?
ACT_06.2.3 Identifiedformat in SLA ?
[ Managed format ]
[ Accepted ]
[ Undefined ]
[ Rejected ]
[ Undefined ]
ACT_06.2.4 Storedformat in SLA ?
[ Stored file ]
[ Undefined ]
[ Accepted ]
[ Rejected ]
[ Accepted ]
[ Rejected ]
[ Accepted ]
[ Rejected ]
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 55
Decisions, decisions: building a package out
of it
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 56
> Header
> DmdSec
> AmdSec
> TechMD
> DigiProvMD
> SourceMD
> RightsMD
> FileSec
> StructMap
> Structlink
> BehaviorSec
Decisions, decisions : tweaking METS
Structural metadata: METS
Descriptive and source metadata:
qualified Dublin Core
Provenance metadata: PREMIS
Technical metadata:
depends on the data-objectsMPEG-7
Integrating Web archives to the SPAR
repository
preserving the French web archives
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 58
58
Background(1): one mandate, heterogeneous
collections
1996-2005 2002 & 2004 2004-2008 2006-2010 2010-now
70 Tb 0.5 Tb 45 Tb 22 Tb
operator
robot +Alexa bot
2006-08-01: French copyright law entitles BnF to collect the French Internet
150 Tb
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 5959
Pre Ingest
Web archives in the scope of SPAR since the beginning but still a need to align with
existing implementation
Background (2): a generic repository solution at BnF
Digitized books
Digitized
audiovisual
documents
web archiving
Pre Ingest
Pre Ingest
But first things first!
What do we want to preserve?
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 61
The harvested files
Here is what I crawled on the Web And how I packaged them in a web
archive container file
HTML
HTML
HTML
HTML
ARC
data
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 62
Intellectual information
This is the 2012 French election
collection
This is the daily news collection
HTML
HTML
HTML
HTML
HTML
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 63
Provenance information
This was harvested with the Heritrix
tool v1.5.2, using
NetarchiveSuite
This was harvested using HTTrack
HTML
HTML
HTML
HTMLHTML
+
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 64
Provenance
information, II
We harvested content suited for
Mozilla Firefox
We respected the robots.txt
The job crashed
We harvested content suited for
Internet Explorer
We ignored the robots.txt
The job went well
HTML
HTML
HTML
HTML
HTML
config config
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 65
Provenance
information, III
This was captured on 2012, may 8th
This captured 1098 websites
This produced 105 ARC files
This was captured in 2006
This captured 145 websites
This produced 50 ARC files
HTML
HTML
HTML
HTML
HTML
log reportlog report
Consistent data before SPAR
Cleaning the stuff before preserving it
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 67
All data on a single target
1996-2005 2002 & 2004 2004-2008 2006-2010
70 Tb 0.5 Tb 45 Tb 22 Tb
unknown
2010-now
+Alexa bot
67
150 Tb
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 68
Aligning collections before ingest:
the NetarchiveSuite target workflow
ARC
data
ARC
metadata
logconfig report
HTML
HTML
HTML
HTML
harvested
files
ARC
data ARC
data
ARC
data ARC
data
+
harvest 1 harvest 2
+
harvest 3
+
…
…
This is a collection containing French election websites
Here are the
files we
harvested
They are
included in
web archives
specific files
This was done
with these tools
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 69
Aligning collections before ingest
the NetarchiveSuite target workflow
A three-layered model
in SPARHarvest Definition (curator collection)
Harvest Instance (“technical” harvest = job)
ARC file (data or metadata)
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 70
An « ARC metadata » samplefiledesc://32-metadata-1.arc 0.0.0.0 20100416092026 text/plain 77
1 0 InternetArchive
URL IP-address Archive-date Content-type Archive-length
metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVersion=1.14.3&harvestid=1&jobid=32 172.20.16.214 20100414095814 text/xml 366
<?xml version="1.0" encoding="UTF-8"?> <harvestInfo> <version>0.2</version> <jobId>32</jobId> <priority>LOWPRIORITY</priority> <harvestNum>0</harvestNum> <origHarvestDefinitionID>1</origHarvestDefinitionID> <maxBytesPerDomain>-1</maxBytesPerDomain> <maxObjectsPerDomain>1000</maxObjectsPerDomain> <orderXMLName>default</orderXMLName> </harvestInfo>
metadata://netarkivet.dk/crawl/setup/order.xml?heritrixVersion=1.14.3&harvestid=1&jobid=32 172.20.16.214 20100414095815 text/xml44775
<?xml version="1.0" encoding="UTF-8"?> <crawl-orderxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
…
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 71
All data on a single target
1996-2005 2002 & 2004 2004-2008 2006-2010
70 Tb 0.5 Tb 45 Tb 22 Tb
unknown
2010-now
+Alexa bot
71
150 Tb
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 72
All data on a single target
Two layers:
- Collection
- ARC files
1996-2005 2010-now
Three layers:
- Harvest Definition
- Harvest instance
- ARC files
Two layers:
- Collection
- ARC files
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 73
All data on a single target
2006-2010 2010-now
Four layers:
- Collection
- Harvest division
- Harvest instance
- ARC files
Three layers:
- Harvest Definition
- Harvest instance
- ARC files
What SPAR needs
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 75
Decisions, decisions : levels of granularity
03/07/1882 28/02/1883 01/03/1883
set
group
object
file
02/07/1882
Year 1883Le Matin
Year 1882
01/07/1882 03/07/1882 28/02/1883 01/03/188302/07/1882
Le Matin
01/07/1882
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 76
set
group03/07/1882 28/02/1883 01/03/188302/07/1882
Le Matin
01/07/1882
Different layers
AIPAIP
AIPAIP
set
Contains nothing but metadata
Curator information, allows to
group AIPs sharing the same
intellectual content
AIPAIP
Must contain files to be
preserved
Each AIP is an autonomous
unit
AIPAIP AIPAIP AIPAIP
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 77
METS and PREMIS to store
all the preserved content
<mets>
<dmdSec>
Intellectual metadata
<amdSec>
Administrative metadata
<fileSec>
List of the files
<structMap>
Structure of the package
<sourceMD>Metadata about the source
used to produce this content
<techMD>Technical metadata
<digiprovMD>Provenance metadata
MPEG-7
> No management of files within files so far
> Neither tools nor XML schema for the technical caracteristics of ARC files
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 78
RDF database to query them
> The database is powerful but LIMITED
> Thus, we cannot express all the information for each
harvested file
Mixing all of this very hard…
« Prometheus, we start the mapping!!! »
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 81
PREMIS as a preservation koinè
ObjectObject
EventEvent
AgentAgent
harvestInstance
has harvest
instance
is documented in
hosts report
Outcome extensions
ARC files
report
persons: admins
software
organizations
Harvest event
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 82
In other terms…
ARC
data
ARC
metadata
logconfig report
HTML
HTML
HTML
HTML
ARC
data ARC
data
ARC
data ARC
data
+
harvest 1 harvest 2
+
harvest 3
+
…
…
This is a collection containing French election websites
config
HTML
HTML
HTML
HTML
…
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 83
ARC
data
In other terms…
ARC
data
ARC
metadataARC
data ARC
data
ARC
data ARC
data …
…
This is a collection containing French election websites
AIPAIP
AIPAIP
AIPAIP
AIPAIPAIPAIP
AIPAIP
AIPAIPset
ARC
data ARC
data …
AIPAIP
AIPAIP
ARC
dataAIPAIP
AIPAIPAIPAIP
groups
Preserving web archives
Challenge 2: analyze ARC files and content files
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 85
Tools to analyze the ARC files
Need to
> identify and validate ARC files
> characterize ARC files (extract information)
> handle GZIP compression
> do at least identification of content file
> for large scale collections ARC
ARC.GZ
?
?
HTML
?
HTML
?
Development of JHOVE2 ARC and GZIP modules
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 86
Managing the ARC
structurefiledesc://IA-001102.arc 0 19960923142103 text/plain 76
1 0 AlexaInternet
URL IP-address Archive-date Content-type Archive-length
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<arcmetadata>
<arc:software>Heritrix 1.14.2 http://crawler.archive.org</arc:software>
</arcmetadata>
http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202
HTTP/1.0 200 Document follows
Date: Mon, 04 Nov 1996 14:21:06 GMT
Server: NCSA/1.4.1
Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11
GMT
Content-length: 30
<HTML>
Hello World!!!
</HTML>
filedesc
URL record definition
object
URL record
protocol
response
object
version-
block
header
metadata
object
First
ARC
record
data
object
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 87
Problem with
JHOVE2 outputs
> Too verbose: 100 Mb of output for 100 Mb of data (with only
identification of content files)
> Need to aggregate and compact all this information
> Need to handle ARC format peculiarities
> Need for a container file-specific format
containerMD
http://bibnum.bnf.fr/containerMD-v1
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 88
Where containerMD fits
<mets>
<dmdSec>
Intellectual metadata
<amdSec>
Administrative metadata
<fileSec>
List of the files
<structMap>
Structure of the package
<sourceMD>Metadata about the source
used to produce this content
<techMD>Technical metadata
<digiprovMD>Provenance metadata
MPEG-7 containerMD
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 89
containerMD features
containerMD
root element
containerMD
root element
containercontainerentriesentries
entriesInformationentriesInformation
entryentry
entryentry
entryentry
ARCContainerARCContainer
ARCEntriesARCEntries
ARCRecordARCRecord
ARCRecordARCRecord
ARCRecordARCRecord…
ARC-specificextensions
ARC-specificextensions
aggregated
information
about the
entries
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 90
Aggregation example: format information
entry
format: text/html
size: 6026213
entry
format: application/pdf
size: 602621
entry
format: image/tiff
size: 60262132
entry
format: text/html
size: 165165
…
entriesInformation count=400
format: text/html
count=300
globalSize=40645654
format: application/pdf
count=20
globalSize=265464
etc.
verbose information aggregated information
factorizing
and sum
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 91
Now, here is some cool stuff we will be able to
ask SPAR
> Give me all the jobs that crashed last year
> Give me a list of all the file formats per broad crawl…
> and the number of files per format
> and the global size before and after decompression
> remove the error pages
> order them by decreasing number of files
> Give me, for all the newspapers collection
> all the crawls
> order them by date
> the number of harvested files per crawl
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 9292
Conclusion and next steps
> A pragmatic approach to handle large-scale and heterogeneous
collections
> The huge mass of data is still an issue
> The benefits of a shared repository
> Cross-domain investigation on preservation strategies (AV material, office
formats, e-books formats…)
> Different policies (SLAs) for different collections
> Improve file format information
> By testing different tools
> By improving format information databases
> International cooperation is a key
28th November 2012 Session 6 - Integrating web archiving in preservation workflows 93
At international level: the preservation
working group
> Goals of the PWG
> Exchange information and best practices (WARC format, information packages, etc)
> Promote the development of tools, review specifications and perform tests (WARC tools, Jhove, etc)
> Promote the web archive needs within the digital preservation community and projects
> Working fields
> Objectives and concepts of preserving archived web resources
> Preservation metadata
> Preservation workflows and digital repository functions and requirements
> Preservation strategies (migration, emulation…)
> Web environment technical documentation
> Evaluation of digital preservation tools and gaps towards web archives
> Organizational issues (costs, sustainability, promotion, skills,…)
Web archiving at the British Library
Helen Hockx-Yu
Head of Web Archiving
Overview
> Part 1: Background, history and organisation
> Part 2: Web Archiving Tools (including demos)
> Part 3: Access
> Part 4: Non-print Legal Deposit and future strategy
29th November 2012 Session 7 - Web archiving at the British Library 2
BL Structure
> BL Board and Executive Team
> e-Strategy and Information Systems (eIS) > IT-based products and services
> Finance and Corporate Services (F&CS) > Money
> Human Resources > People
> Operations & Services (O&S) > Front line services
> Scholarship and Collections (S&C) > Content (Arts and humanities, Social Sciences, Science, Technology & Medicine)
> Strategic Marketing and Communications (SMC) > Brand and reputation
29th November 2012 Session 7 - Web archiving at the British Library 3
Web archiving timeline
29th November 2012 Session 7 - Web archiving at the British Library 4
Current web archiving strategy
> Selective archiving of websites that > reflect the diversity of lives, interests and activities throughout the UK
> contain research value or are of research interest
> feature political, cultural, social and economic events of national interest
> demonstrate innovative use of the web4 areas
> Also prioritise websites at risk and web-only content
> Permission based > Permission to archive, to provide online access and to preserve. Also ask or 3rd
rights clearance
> 30% success rate, 5% explicit refusal (mostly due to 3rd party rights)
> Online access through UK Web Archive
> Expect to crawl at domain level (from April 2013) for Non-print Legal Deposit
29th November 2012 Session 7 - Web archiving at the British Library 5
The current Web Archiving team
29th November 2012 Session 7 - Web archiving at the British Library 6
Skills Profile > IT > Collection management, digital curation > Management > Communications > Web Archiving
(Internal Collaboration)
> The Web Archiving Team is involved in the end to end process but work with other departments / teams in the library
29th November 2012 Session 7 - Web archiving at the British Library 7
Department / Team Activity / Support
S&C > Subject specialist group > Curator’s Choice project
Selection, curation
eIS Network, hardware and IT support
O&S Resource Discovery & Research
Corporate level resource discovery http://explore.bl.uk/
CA&D Digital Processing
Cataloguing (special collection level)
SMC Publicity, press release, events
The Legal Deposit Programme Domain crawl capability / process and policy
Curator’s Choice
> Pilot project with a small group of dedicated curators / subject specialists
> Special Collections of curator’s choice. Curators take responsibility for owning, maintaining and growing the collections over time > Evolving Role of Libraries in the UK
> Political Action and Communication
> Slavery and Abolition in the Caribbean
> UK relations with the Low Countries
> 19th Century English Literature
> Oral History in the UK
> Film in the UK
> Energy
29th November 2012 Session 7 - Web archiving at the British Library 8
Web Archiving Advisory Group
> Provide advice and support to the Web Archiving Team
> Act as a ‘critical friend’ to assist in the development of policy and practice.
> Specific advice and support on:
> Purpose, vision and benefits.
> Strategic direction and planning.
> Synergy with internal teams and collaboration with external stakeholders/partners.
> Policy changes and risk management
29th November 2012 Session 7 - Web archiving at the British Library 9
(External) Collaboration
> UK Web Archiving Consortium (2004-2007): centralised infrastructure and development, distributed collections
> UK Web Archive partners, National Archives, Legal Deposit Libraries (LDLs)
> External Collaborators
> Welcome Library
> Live Art Development Agency
> The Cambridge Innovation Network
> The Women’s Library
> Institute of Historical esearch, University of London
> Individual researchers, specialists
> General public – ca. 20 nominations / week
> National organisations: DPC, JISC
> International: IIPC
29th November 2012 Session 7 - Web archiving at the British Library 10
JISC UK Web Domain Dataset (1996-2010)
> Collaboration with JISC and the Internet Archive
> UK Web Domain Dataset (1996-2010) – UK websites extracted from the Internet Archive's collection and supported by funding from the JISC
> 35TB research dataset
> No local access to individual websites but access to secondary dataset allowed
> BL has developed visualisations of the dataset
> JISC funded 2 further projects using this dataset > Analytical Access to the Domain Dark Archive
> Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research
29th November 2012 Session 7 - Web archiving at the British Library 11
Web Archiving Tools
> Support key processes: selection, harvesting, storage, access, preservation
> Mostly open source tools, some developed in-house
> New tools / changes to current tools expected when business processes change due to non-print Legal Deposit
29th November 2012 Session 7 - Web archiving at the British Library 12
Selection Tools
> Selection: decide what websites to archive and to include as part of a web archive collection
> Selection and Permission Tool: https://wct.bl.uk/selection/ > Submit selection – real time checking of duplicates, fetching meta tags from live
sites
> Collect metadata
> Add contact details
> Suggest crawl frequency
> Permissions management – send emails, direct users to online licence form, store the completed forms, pass details to WCT (create authorisation record and a pending target)
> Reports
> Twittervane
29th November 2012 Session 7 - Web archiving at the British Library 13
Harvesting Tools
> Harvesting: automated downloading of selected websites using crawler software; quality assurance regarded as an element
> The Web Curator Tool (WCT): https://wct.bl.uk/wct/ > Job scheduling
> Metadata
> Access control
> Harvesting (uses Heritirx)
> QA
29th November 2012 Session 7 - Web archiving at the British Library 14
Quality Assurance
> Placing more emphasis on intellectual content than appearance or behaviour of a website
> Use four aspects to define quality: > Completeness of capture: whether the intended content has been captured as
part of the harvest.
> Intellectual content: whether the intellectual content (as opposed to styling and layout) can be replayed in the Access Tool.
> Behaviour: whether the harvested copy can be replayed including the behaviour present on the live site, such as the ability to browse between links interactively.
> Appearance: look and feel of a website.
> Rely on visual comparison, previous harvests & crawl logs
> Recent development of QA module to allow bulk operation, reduce # of clicks and make QA recommendations
29th November 2012 Session 7 - Web archiving at the British Library 15
Supporting Long-term Preservation
> Storing data in WARCs and metadata in METS > Migrate all legacy data into WARCs
> WCT output WARC files
> Submission Information Package (SIP) profiles for selective and domain crawls > Storing descriptive metadata (eg permission information) & technical metadata
(eg crawl log, crawl configurations, virus scan events)
> Ingest archived websites in the Digital Library System (DLS) > Command line tool generates SIPs
> Providing access from the DLS (in future)
29th November 2012 Session 7 - Web archiving at the British Library 16
Demo (45 minutes)
> Selection and Permission Tool (https://wct.bl.uk/selection/)
> Web Curator Tool (https://wct.bl.uk/wct/)
29th November 2012 Session 7 - Web archiving at the British Library 17
Access
> Currently 3 ways to access the web archive > Online through the UK Web Archive
> Catalogue records (of special collections)
> Keywords search through primo (corporate resource discovery system)
> Conduct researcher survey to understand requirements
>Analytical access
29th November 2012 Session 7 - Web archiving at the British Library 18
Catalogue Records
29th November 2012 Session 7 - Web archiving at the British Library 19
Keyword search through Primo
29th November 2012 Session 7 - Web archiving at the British Library 20
UK Web Archive
29th November 2012 Session 7 - Web archiving at the British Library 21
> Websites archived by BL and partners since 2004 (65% by BL)
> 122,99 websites, 50,866 instances, 13.6TB WARCs
> Over 100,000 unique visits since 1st April 2012
> Key websites include videos > Full-text, N-gram, title and
URL search > Browse by subject / special
collection, visual browsing
http://www.webarchive.org.uk
Analytical Access
> Shift of focus from the level of single webpages or websites to the entire web archive collection.
> Use web archives as datasets
> Support survey, annotation, contextualisation and visualisation
> Allows discovery of patterns, trends and relationships in inter-linked web pages
> Extracting value from the “haystacks”
> Helps addresses a number of challenging issues > Scalability
> Accessibility of individual websites
> Components missed by crawlers
29th November 2012 Session 7 - Web archiving at the British Library 22
Visualising the UK Web
> http://www.webarchive.org.uk/ukwa/visualisation > N-gram search
> Links analysis
> Format Analysis
> Geo-index
> http://www.webarchive.org.uk/bluebox/ > uses the Memento aggregate TimeGate hosted by lanl.gov
> “resource not in archive” – who else has it?
> Open data > Dataset and APIs for general use
> Enable broader community to re-use, explore and visualise content of web archive
29th November 2012 Session 7 - Web archiving at the British Library 23
Web Archiving Infrastructure
29th November 2012 Session 7 - Web archiving at the British Library 24
Non-print Legal Deposit: Time of change
> Expected to be in place in April 2013 > Access restricted to premises of Legal Deposit Libraries
> Library-wide Legal Deposit Programme to develop capability and end-to-end process
> Web Archiving Team acts as “technical supplier” for a number of projects
> Still need to work out how current (permission-based) selective archiving relates to domain crawl under Legal Deposit > Will we request permissions for online access?
> Will we stop crawling some of the sites we are crawling now and include them in the annual / bi-annual broad domain crawl?
> Who does what?
29th November 2012 Session 7 - Web archiving at the British Library 25
29th November 2012 Session 7 - Web archiving at the British Library 26
Web Archiving Strategy
26
Domain Crawl
Event S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Domain harvesting: • Broad
sweep of .uk domain
• Once or twice a year
Events & key sites: • Events of
national interest
• Sites need to be captured frequently
Special Collection: • Focused,
thematic collections
• Support priority subjects
Key sites Event S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Web Archiving Workshop
Leïla Medjkoune, Internet Memory IIPC workshop, BNF, Paris, November 2012
Internet Memory Internet Memory Founda/on (European Archive) • Established in 2004 in Amsterdam and then Paris • Mission: Preserve Web content by building a shared WA plaJorm • Ac/ons: DisseminaLon, R&D and partnerships with research groups and
cultural insLtuLons • Open Access Collec/ons: UK NaLonal Archives & Parliament, PRONI, CERN
and The NaLonal Library of Ireland
Internet Memory Research • Spin-‐off of IM established in June 2011 in Paris • Missions: Operate large scale or selecLve crawls & develop new
technologies (crawl, access, processing and extracLon)
Internet Memory Infrastructure Green datacenters Repository and data access for large-‐scale data
management: • HDFS (Hadoop File System): Distributed, fault-‐tolerant
file system • Hbase. A distributed key-‐value index
• Convenient model for temporal archives • MapReduce: A distributed execuLon framework
• Reliable mechanism to run an analysis job on very large datasets
Internet Memory Focused crawling: • Automated crawls • Quality focused crawls :
– Video capture, Twiaer crawls – ExecuLon tools to overcome crawling issues on specific content
Large scale crawling • Inhouse developped distributed sobware • Scalable crawler (10-‐50 Bn pages) • Also designed for focused crawl and complex scoping
Research projects and focus
Web Archiving and Preserva/on ✓ Living Web Archives (2007-‐2010) ✓ Archives to Community MEMories:
(2010-‐2013) ✓ SCAlable PreservaLon Environment
(2010-‐2013)
Webscale data Archiving and Extrac/on ✓ Living Knowledge (2009-‐2012) ✓ Longitudinal AnalyLcs of Web
Archive data (2010-‐2013) ✓ TrendMiner (2011-‐2014) ✓ DOPA (2012-‐2014) ✓ AnnoMarket (2012-‐2014)
Web Archiving project ?
OrganisaLonal challenges: • SelecLon/QA: Librarian / Archivist, Quality assurance team,
Project manager • Content capture/services development: Engineers,
developers, technicians • Infrastructure deployment and maintenance: Engineers,
System administrators
➥ Web Archiving projects require strong competences and experienced human resources combined with a scalable infrastructure
IM Shared plaJorm
Since its creaLon in 2004, the Internet Memory FoundaLon works in close collaboraLon with partners insLtuLons and research groups through European projects: • To develop methods and tools improving web
archiving quality • To grow its experLse and technological taskforce
Archivethe.Net (1)
• To mutualize knowledge and skills between insLtuLons
• To share internal developments with partners insLtuLons
• To cut services and R&D costs
Archivethe.Net (2)
• Archivethe.net is a shared web archiving platform associated to a service.
• The platform is combining new technology and user needs to ensure a good service quality in terms of reliability and efficiency
• For whom ? our current partners, our new partners and … for ourselves
Benefits ? • Integrated web archiving process : from selecLon
to access
• Ongoing technological developments through specific or common R&D projects
• Dedicated and highly skilled team to follow partners’ projects
• Dedicated infrastructure
How does it work? (1)
• ATN is designed as a Saas (Sobware as a service)
• The plaJorm offers a friendly user interface to record partners web archiving orders
• A pipeline organizes and manages the producLon
• A QA team ensures the quality of the archive to meet partners’ requirements
How does it work? (2)
Demo
ARCOMEM Archivist tool ?
Set and follow web archive campaigns • V1: A crawler cockpit and a search and retrieval applicaLon Intelligent content acquisiLon: • Seeds URLs • Keywords • Social web sites APIs • Social Media Categories (SMC)
SARA
Search and retrieval interface: • Advance search funcLonaliLes
• Filtering via faceLng • SorLng by content type, Social media plaJorm, text/image contextual informaLon (event, enLty,...), etc.
Crawler Cockpit Interface • Create/select a campaign • Describe campaign (Ltle, descripLon, comments, etc.)
• Define scope: select criteria such as language, keyword, url, organisaLon, etc.
• Select social media categories and APIs to explore
• Set precedence rules for some content type or source (images, videos, tweets, news, etc.)
Crawler cockpit interface
Demo
ARCOMEM Archivist Tool V2
• Refinement mode : Refine crawl parameters to improve crawls • Improve access applicaLon (SARA) : Preview funcLon so that the users can review the results of the campaign set up
QA for Web Archives?
IM QA is based on: • Tools internally developed • Tools developed in the context of European projects • Automated processes • Knowledge and skills of our crawl engineer and QA teams
QA Methodology and tools?
Methodology • Based upon crawler behaviour • Based on insLtuLons needs and policy • Can be manual (visual) or “automated”
• Can be made at pre or post crawl Lme Tools • Open source tools such as plugins , proxies, etc. • Internally developed tools (fetchers, automate check, etc.) • Bug trackers to record informaLon and communicate with
partner insLtuLons
QA Methodology and tools?
SCApe: Scalable PreservaLon Environments • Automate visual QA to detect rendering issues:
• Improve archives quality and cut QA costs
• Feed “preservaLon watch and planning” tools • First test made on over 400 pairs of urls • Inhouse “ExecuLon plaJorm” under deployment • Results and processes to be disseminated to IIPC members for feedback !
Technical challenges
Capture • Dynamically generated content, deep web, etc. • Non HTTP protocoles (e.g.: RTMP) • Social media plaJorms, ... Access • Replicate live funcLonaliLes and look & feel • Provide access to very large files
➥ Fast evolving technologies ➥ Ephemeral content ➥ MulLplicaLon of producLon means: ➥ Increase of user generated content
Technical SoluLons
• ExecuLon based crawling (vs parsing)
• API crawling • ApplicaLon aware
crawling • Bespoke fetchers ➥ OrchestraLon of tools
ARCOMEM content acquisition
Technical SoluLons Access tool: • Player replacement: reproduce players funcLonaliLes
• Adapt access soluLon to type of content/plaJorms (generic soluLons)
Storage infrastructure / format: • Enable access to large files • Fast access to large amount of content to facilitate search & retrieval
Use cases • Social media capture and access:
• You Tube • Twiaer • Flickr, etc.
• Web Archiving related services: • RedirecLon service • Memento • Legal issues with captured content • Full text search • etc.