Upload
stuart-powers
View
216
Download
0
Embed Size (px)
Citation preview
netarkivet
RESAW seminar, Dec 2-3, 2013
Day 1
Who are we today
□ Birgit N. Henriksen, head of digital preservation, KB
□ Bjarne Andersen, head of digital preservation, SB
□ Eld Zierau, developer and researcher, KB
□ Ditte Laursen, curator and researcher, SB
□ Henrik Smith-Sivertsen, researcher, KB
Organization
□ a virtual center (SB/KB – IT development, IT operation, Collection department)
□ steering committee□ daily manager□ editorial advisory board
Collection policy
□ Legal deposit law 2005: ”Materials made public via electronic communication network”
□ Danish materials Websites on the .dk TLD Websites minded on a Danish audience /
written in Danish Websites about Danish people (Hans
Christian Andersen etc.) More or less any site of interest to Denmark
Collection strategies
□ 4 strategies■ 4 annual snapshots (KB)
□ ensure the wide picture
■ Selective harvesting of 80 domains (SB)□ ensure frequently updated websites
■ Event-harvesting of 2-3 national events per year (KB/SB)□ 2013: Teachers’ lockout, International Melodi
Grandprix, Danish local elections, Election of the pope (IIPC) …
■ Special havests (KB/SB), ie. wikileaks, kriseinfo.dk, nyalliance.dk …
Collection strategiescoverage
time
snapshotselective
event
special
Access
□ The archive contains sensitive personal data, therefore the entire archive is considered sensitive■ only researchers including PhD students can be
granted access□ if research on sensitive personal data, the Data
Protection Agency assesses the application□ if not, the library assesses the application□ the Copyright Act defines research as being from
PhD level and up□ the Privacy Act defines research as something with a
’scientific purpose’
□ Netarkivet is working on a wider access■ for students and for the general public■ small corpus
Use of the archive
□ Only a handful active researchers■ no user friendly way of accessing the archive■ lack of knowledge about the archive■ new kind of data source
□ Research projects – examples■ dr.dk’s history 1996-2006■ the history of internet newspapers■ the mediation of art in the network society■ the digital music revolution – the case of Sys Bjerre■ Danish parlimentary elections 2007-2011…
Technical setup
□NetarchiveSuite (open source)□44 servers, 260 running java apps□WayBack-machine□Batch-jobs□Full-text indexing experiments□ARC/WARC
Some numbers
□ Total: 414 TB – 13 billion objects Snapshots: 353 TB Selective: 47 TB Events: 13 TB
□ One snapshot: approx. 30 TB (2006: 9 TB)
Current challenges
□ wider access□ better access (free text search)□ inclusion of older net collections □ collection of websites with restricted access□ advanced websites, ie. with sound/video/live
interaction (chat, virtual worlds …)□ electronic communication networks ≠ the web □ long-term preservation□ documentation
2013-2014
Tools search - free text indexes harvesting - the use of Heritrix3 and Live
Archiving proxy
Infrastructure web archives as part of a research infrastructure access to archived material using Persistant
Identifiers
Archiving methods capturing online games automatic methods to locate relevant Danish web
materials outside the Danish TLD .dk
Ongoing activites related to RESAW’s topics
□ API improvement / so-called service layer
□ corpus building□ documentation□ full-text search□ statistics
□ legal aspects (ie. broader access, data mining policy)
What is the RESAW project in 10 years?
□ a very strong partner to IIPC□ common infrastructure across
borders (ERIC / ESFRI status)□ coordinated european collection
building