Voyage to the new world
Dr. Adam Farquhar
Head of Digital Library Technology
Head of Digital Scholarship (from April 2011)
The British Library
Chair, Open Planets Foundation
President, DataCite
NFAIS 2011, Philadelphia 1-Mar-2011
The British Library: some facts and figures
Helping people
advance knowledge to
enrich lives
GIA Funding 08/09:
£94.8m operational,
£12m capital
Other funding secured 07/08:
c.£33m
National library of the UK.
Serves researchers, business,
libraries, education & the general
public
Collection includes over 2m
sound recordings, 5m reports, theses
and conference papers, the world’s
largest patents collection (c.50m)
3 main sites in London and
Yorkshire. Circa 2,000 staff
Business and IP Centre:
Providing inspiration, and enabling
protection of creative capital and
business development
Generates value to the UK
economy each year of 4.4 times
public funding
Collection fills over 600km of
shelving and grows at 11km per year
70 Tb of digital material through
voluntary deposit
British Library Act 1972
National centre for reference, study, bibliographical and other information services, in relation both to scientific and
technological matters, and to the humanities.
Science and Innovation Investment Framework 2004-2014, H.M. Treasury (2004)
UK research base must have ready and efficient access to information of all kinds – such as experimental data sets,
journals, theses, conference proceedings and patents. This is the life blood of research and innovation.
The largest document supply
service in the world. Secure
e-delivery and ‘just in time’
digitisation enables desktop
delivery within 2 hours
The Expanding Digital Universe
Content holding
organisations project a
twenty-five-fold rise, from a
median of less than 20TB
now to over 500TB in 2019
IDC projects
Doubling every 18
months
700 exabytes in 2010
source:
“Are You Ready? Assessing European Organisations’
Preparations forDigital Preservation”
Planets Deliverable D7b, November 2009
source:
“The Diverse and Exploding Digital Universe”
IDC White Paper, March 2008
http://www.emc.com/collateral/analyst-reports/diverse-
exploding-digital-universe.pdf
Information storage – 196 BC
Carrier
Solid material (granodiorite)
114cm x 72cm x 28cm
760 kg
Encoding
Human-readable characters
Three language scripts (hieroglyphic, demotic, ancient Greek)
Access
Human, capable of reading (at least) one of the scripts
Information storage – 2010 AD
Carrier
Hardware
Storage medium (hard disk, optical disc, …)
Rendering environment (display, printer, …)
Software
Operating system
Applications (browser, editor, …)
Encoding
Machine-readable: Binary data
Human-readable: Characters
Access
Human, capable of reading
We need software
We need rendering facilities
We need knowledge of how to operate the hardware and software
6
Losing just one bit can be catastrophic
source: Manfred Thaller, UzK
7
Content Preservation: Understand a digital file
SOI
APP0 JFIF
1.2
APP13 IPTC
APP2 ICC
DQT
SOF0 183x512
DRI
DHT
SOS
ECS0
RST0
ECS1
RST1
ECS2…
ffd8ffe000104a46494600010201
008300830000ffed0fb050686f74
6f73686f7020332e30003842494d
03e90a5072696e7420496e666f00
0000007800000000004800480000
000002f40240ffeeffee03060252
0347052803fc0002000000480048
0000000002d80228000100000064
000000010003030300000001270f
0001000100000000000000000000
0000600800190190000000000000
0000000000000000000000000000
0000000000000000000000003842
494d03ed0a5265736f6c7574696f
6e0000000010008313a3000200…
source: Stephen Abram, California Digital Library
Working together to preserve content in the UK
StP
BSp
NLW
NLS
Ox
Ca
TCD
JANET
Access Gateway
Storage Node
BL Digital Library System
Resilient
No single point of failure
Geographically distributed
Multi-site
Multi-organisation
Recoverable
Background integrity
checking
Self-healing from alternate
sites
Authentic
Time-stamped digital
signatures
Tamper-resistant/evident
hardware
Scalable
Underlying design principle
Scale horizontally /
vertically
Use commodity storage
9
Working together to preserve content
The Open Planets
Foundation helps members
meet their digital preservation
challenges
OPF Stewardship:
Assure development & maintenance of a comprehensive DP suite
Leverage R&D results
Help to mature prototypes & demonstrators
Support Open Source initiatives
Partner with university sector to develop DP curriculum
Promote commercial partnerships to grow the DP marketplace
Charter
Members
Affiliate
Members
Part
ner
Init
iati
ves
Te
ch
no
log
y &
Se
rvic
e P
rovid
ers
OPF
SCAlable Preservation Environments
SCAPE addresses critical challenges that have been identified by the Commission and key stakeholders
SCAPE results will enable organisations to
Keep pace with the rapid growth of digital collection
Use a highly scalable architecture and reduce human intervention
Ensure that their preservation actions have been effective
Fulfill their increasing regulatory obligations
Reduce the risks to their digital material
SCAPE will
Transform the manner in which we safeguard digital content
Increase confidence in the long-term accessibility and integrity of collections
At the European, national, and organisational levels
Change how we design and build digital repositories
SCAPE started 1-Feb-2011
11
12
Good science relies on good data
13
Currently…
No effective way to link
between articles and
datasets
No widely used method to
identify datasets
No widely used method to
cite datasets
Articles
Underlying
data
14
As a result…
Datasets are:
Difficult to discover
Difficult to access
In danger of being lost
Sharing data on request is not effective
Wicherts et al (2006) requested data from 141 articles in
psychology
“6 months later, after … 400 emails, [sending] detailed
descriptions of our study aims, approvals of our ethical
committee, signed assurances not to share data with
others, and even our full resumes…” only 27% of authors
complied
Campbell et al. (2002) surveyed geneticists
Most frequent reason for withholding data was the effort
required to share it (80%).
28% were unable confirm published findings because of
data withholding
Modified from: Todd Vision, U North Carolina
16
DataCite
Support researchers by enabling them to locate, identify, and cite research datasets with confidence
Support data centres by providing persistent identifiers for datasets, workflows and standards for data publication
Support publishers by enabling research articles to be linked to the underlying data
DataCite : Data Centres :: CrossRef : Publishers
17
Digital Object Identifiers (DOIs) offer a solution
A DOI is a unique identifier, similar in concept to an ISBN.
Mostly widely used identifier for scientific articles
Researchers, authors, publishers know how to use them
Put datasets on the same playing field as articles
DataCite
Dataset
Yancheva et al (2007). Analyses
on sediment of Lake Maar.
PANGAEA.
doi:10.1594/PANGAEA.587840
URLs are not persistent
(e.g. Wren JD: URL decay in MEDLINE- a 4-year
follow-up study. Bioinformatics. 2008, Jun
1;24(11):1381-5).
Thanks for your attention!
Infrastructure for digital
content
Save the bits
Save the content
Work together through
OPF
Infrastructure for
scholarly communication
Citation is key
DataCite
More information
www.bl.uk/dp
www.bl.uk/datasets
www.openplanetsfounda
tion.org
www.datacite.org
18Adam {.} Farquhar {@} BL.UK