Upload
vincent-smith
View
200
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presented by Vince Smith at the pro-iBiosphere meeting in Berlin, 21-23 May 2013.
Citation preview
Digitised collections:Toward a digital strategy forfor the NHM, London
Vince Smith
Workshop 3, pro-iBiosphere, Berlin23 May 2013
Digital Ambition: NHM Science Strategy 2013-2017
A New Voyage of Discovery
Three Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagement
Five Challenges1. The Digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skills
Resources & funding
Measuring success
data.nhm.ac.uk/globe/
A New Voyage of Discovery
Three Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagement
Five Challenges1. The Digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skills
Resources & funding
Measuring success
Digital Ambition: NHM Science Strategy 2013-2017
Scientific impact 1,000 papers in leading journalsDigital access 20M specimens available digitallyEngagement 1M face-to-face engagementsCollections Globally important collectionsDiagnostic tools Diagnostic tools for key groupsDeep time Timeline of key transitionsScience & society Articulate of the role of scienceUK network Act as a national museumEarth sciences Earth Sciences CentreFunding £10M for Five Challenge Areas
Overview
1. Existing digital content, sources & formats• Research data• Collections data
2. Making collections data digital• Priorities• Protocols & pathfinder activities• Crowdsourcing transcription
3. Aggregation & delivery• The NHM data portal• Data visualisation, data sub-portals
4. Identifiers, links & interoperability• DataCite DOIs• Third party aggregators• Portal API’s, download & analytical functions
5. Timeline & constraints• Data policies• Next steps
Digitisation activities
Data portal
NHM Research Outputs
• 49 papers, 45 available online(4 print only or behind pay walls)
• 9 had supplementary data files• 39 papers with tables, charts & other data
o >1000 sequenceso 826 figureso 76 tableso 1 genome
• No collective view of these data (37 journals)• No consistent way of citing NHM data• No consistent mechanism to access data• Effectively invisible at the institutional level
One Month of NHM Science group papers
Data via Carolyn Lowry e-mail, 13th Feb. 2013
1. Existing digital content
NHM Collections Outputs: data
• Huge investment in NHM collection management system• ≠ Imaging• Most research projects need spatio-temporal records• Different requirements for different purposes
NHM COLLECTIONS April 2013
Collection area Estimate no of specimens
No. records in database
% collection in database
% records with location info
Botany 6,000,000 626,000 ~ 10% 96%Entomology 32,000,000 316,000 <1% 68%Mineralogy 500,000 422,000 ~ 95% 79%Palaeontology 9,000,000 342,000 ~ 3% 89%Zoology 28,000,000 1,131,000 ~ 60% via lots) 69%TOTAL 76,000,000 2,837,000 3% (23% )
1. Existing digital content
• Many, many imaging projects (highly fragmented)• Circa 40 TB for major collections (excluding library)• 120,000 images in KE EMu (many others not in KE!)• Circa 250,000 via NHM Photo unit (limited metadata)
Collection area No. image files Disk spaceBotany 140,133 35,302Entomology 529,106 3,172Mineralogy 14,000 6Palaeontology 122,548 993Zoology 12,975 1,598TOTAL 818,762 41,070
NHM Collections Outputs: images1. Existing digital content
Current data formats
• Darwin Core Archive (DwCA) & extensions (collections)• Circa 2020 fields mapped to 50 fields to generate archive• Images mainly JPG & TIFF• Metadata using EML & Genesis II standard• Research data files in a wide array of formats (blob files)
Nexus (character data and Newick formatted phylogenetic trees)
Non-NHM specimen lists (as Darwin Core Archive files)
PhyloXML (an XML standard for representing phylogenetic trees)
Output from the Imaging and Analysis Centre (Micro CT datafile formats)
NeXML (an XML standard for representing character data)
Collections of images from digitisation projects (as a collection of links or a zipped archive)
Sequence trace files (.scf sequence chromatogram format files) Environmental sequence files
Taxon checklists (as Darwin Core Archive files) Collection level descriptions
1. Existing digital content
• Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.
• Tiered approach, different needs for different collections• Low hanging fruit (2D objects e.g. herb. sheets & slides)
2. Making collections data digitalDigitisation Priorities
• Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.
• Tiered approach, different needs for different collections• Low hanging fruit (2D objects e.g. herb. sheets & slides)• Linked to strategic collaborations & financial opportunities
o e.g RBG Kew, RBG Edinburgh, Nat. Mum. Wales, Hunterian etc.
• Priorities dictate order – we plan to do it all (eventually)!
2. Making collections data digitalDigitisation Priorities
• Exercise to develop digitisation protocols across collectiono Slides, spirit, herbarium sheets, pinned, multispecimen/drawer
• Protocols mapped to high level collections descriptions• Workflow software supporting rapid digitisation (to KE & DAMS)
2. Making collections data digitalDigitisation Protocols
• Exercise to develop digitisation protocols across collectiono Slides, spirit, herbarium sheets, pinned, multispecimen/drawer
• Protocols mapped to high level collections descriptions• Workflow software supporting rapid digitisation (to KE & DAMS)• Pathfinder activities for less well understood projects
o Entomological dry material (30 M specimens)- iCollections (specimen-by-specimen) approach- SatScan (drawer level multi-specimen) approach
2. Making collections data digitalDigitisation Protocols
• Specimen-by-specimen, traditional, dedicated 6 person team• Digitising British Isles Lepidoptera collection• ~500,000 specimens, 5,000 drawers• Re-curation & specimen imaging• Complete label information including georeferencing• For use in Climate Change initiative
2. Making collections data digitaliCollections Initiative
• 4-6 people over 3 years, work broken into small tasks by teams• Average imaging rate 163 specimen/day*person• Averaging >3min per specimen (prep., imaging & databasing) • >£1/specimen• BUT: 6,800 person years for the entire collection
2. Making collections data digitaliCollections Initiative
• Drawer level digitisation, segmented down to specimens• Very fast imaging, no specimen handling, just one view• No label information, but some data extracted from drawer• Specimens retrospectively cropped & annotated
2. Making collections data digitalSatScan Initiative
• Drawer level digitisation, segmented down to specimens• Very fast imaging, no specimen handling, just one view• No label information, but some data extracted from drawer• Specimens retrospectively cropped & annotated
2. Making collections data digitalSatScan Initiative
• Dedicated specimen-level rapid annotation software
2. Making collections data digitalSatScan Initiative
Crowdsourcing & Transcription
• We have a massive transcription problem• Experiments via Notes-from-Nature (a Zooniverse project)
• Transcribing the NHM ornithological accession registers
• Wikimedian in Residence (Wikisource transcription)• 4 Month project, including specimen label transcription
2. Making collections data digital
data.nhm.ac.uk• A focus for deposition and discovery of major NHM data sets• Promote innovation though re-use of museum data• Open Access, at a dedicated subdomain of the NHM website• Started Jan. 2013 (3 years), consultation throughout 2012
NHM Data Portal
Functional components of the data portal
3. Aggregation & Delivery
Search
Datasets matching
criteria
Individual dataset
Results
Browse & searchcriteria
Advanced display options
• Dataset registry, for dataset discovery, modeled on data.gov.uk• Uses CKAN, an open-source data portal software platform
3. Aggregation & DeliveryNHM Data Portal: Registry
Metadata about the dataset
Name
Geographic scope
Tags
“Social”
Authors
License
Download
Developer tools
TechnicalInfo.
(extracted from data
file)
• Dataset metadata discovery
3. Aggregation & DeliveryNHM Data Portal: Registry
• Simple datasets upload workflow for non-collections data
1. Name the dataset 2. Upload / link
the data file
3. Describe the data file
4. Theme & tag
5. Add additional resources
6. Temporal coverage
7. Geographic coverage
8. Save & finish
3. Aggregation & DeliveryNHM Data Portal: Dataset upload
Zoomable map
Applied filters
Toggle map, table & stats views
Search, download & display optionsNo. records
No. Georef. records
• Dedicated interface to visualise & explore major datasets• Focused on collections data, based on Canadensys.net, uses CartoDB
3. Aggregation & DeliveryNHM Data Portal: Data visualisation
Collections views
Statistical summary
Specimen record views
Data field mappings
Summary preview
Full record
Tables
Download
3. Aggregation & DeliveryNHM Data Portal: Data visualisation
• Using DataCite DOIs in the data portal• datasets (2014) & specimens (2015)
• Unique, persistent and resolvable identifiers• Easy to cite, alias existing specimen identifiers• Conform to minimum DataCite requirements
• Landing page, min. metadata standard, fee, min. 10 yr. contract, DOI (pre)fixes
NHM Data Portal & DataCite
Breaks us out of the biodiversity data silo
4. Identifiers, links & interoperability
• Content within the NHM data portal will be highly accessibleo Collections harvestable (e.g. by GBIF as a DwCA)o Download DwCAs on any search faceto Wide set of API’s available of datasets (part of CKAN)
• Sub-portals (selected content, themed by topic)o e.g Virtual Herbarium, NHM Science initiatives, geographic regions
• Analytical interface planned for 2015 (but not specified)
Data Aggregation, APIs & download4. Identifiers, links & interoperability
• Data portal will be “open-by-default”• Ambiguity in what this means & top down schizophrenia• Conflicting mandates on open access & revenue opportunities• Lots of guidance available, will use to form a common policy• A cross institutional policy would be useful (but challenging)
Data Policies & Next Steps5. Timeline & constraints
Jan 2013 Jan 2014 Jan 2015 Jan 2016
Requirements& dataset discovery
Private alpha Stable public beta
Full release & sub-portals
Internal feedback, data visualisation & DOIs
Subportals & analytical tools
Project start
NHM Data portal timeline
Next 6 months• More documentation (PID and Tech Spec)• Consultation and advocacy (internal and external)• Data mapping from KE EMu and software testing• Development
o website wireframe designo drafting data visualisation subcontracto Construction of private alpha release
5. Timeline & constraintsData Policies & Next Steps
Jan 2013 2014 2018
Path-finding & Programme
development
Private alpha Stable public beta
20 Million!!Project start
NHM digitisation timeline
Next 6 months• Initial conclusions from path-finding digitisation activities• Initial grant funding bids developed• Advocacy, outreach & development of a digitisation “programme”• Investigate possibilities for gallery development• Develop crowdsourcing strategy
2015 2016 2017
Major funding applications & a new gallery?
Digitisie… Digitisie… Digitisie…
5. Timeline & constraintsData Policies & Next Steps
QUESTIONS
Digitisation Priorities
• Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.
Poacea
e
Brassic
acea
e
Solan
acea
e
Rubiacea
e
Anacard
iacea
e
Arecac
eae
Malvac
eae
Cucurbita
ceae
Grossular
iacea
e
Aquifolia
ceae
Juglandac
eae
Apiacea
e
Aspara
gace
ae
Pedali
acea
e
Laurac
eae
Convolvu
lacea
e
Oleace
ae
Bromeliac
eae
Lecy
thidacea
e0
100200300400500600700
Crop Wild Relatives (accepted taxa only)
2. Making collections data digital
• Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.
• Tiered approach, different needs for different collections
Nick Poole, UK Collections Trust
2. Making collections data digitalDigitisation Priorities