View
229
Download
0
Category
Tags:
Preview:
Citation preview
Publication of facility investigations
Brian Matthews
Scientific Information GroupScientific Computing Department
STFC Rutherford Appleton Laboratory
brian.matthews@stfc.ac.uk
Scientific computing develop and operate computing infrastructure - HPC, PB Datastore, s/w, data management…
Funds and operates large scale science for UK Research base - physics, astronomy - chemistry, materials
ESO: Alma Array
STFC
Major Science Facilities
Big Science Particle Physics - exploring the very small Space Science - exploring the very large
Small ScienceUnderstanding the world around us at a molecular levelLasers, Neutron & Light Source – ISIS & Diamond
Facilities Support
Big Facilities for Small Science
Diamond
ISIS
CLF
Science at STFC Facilities
data
ComputingAnalysisModelling
knowledge
beamsample Imaging
detector
Neutrons and photons Provide complementary views of matter:
Photons “see” electric charge – high atomic number nuclei
Neutrons “see” nucleons – especially hydrogen atoms
The science we do - Structure of materials
Fitting experimental data to model
Bioactive glass for bone growth
Structure of cholesterol in crude oil
Hydrogen storage for zero emission vehicles
Magnetic moments in electronic storage
• ~30,000 user visitors each year in Europe: – physics, chemistry, biology,
medicine, – energy, environmental,
materials, culture– pharmaceuticals,
petrochemicals, microelectronics
Longitudinal strain in aircraft wing
Diffraction pattern from sample
Visit facility on research campus
Place sample in beam
• Billions of € of investment– c. £400M for DLS– + running costs
• Over 5.000 high impact publications per year in Europe
– But so far no integrated data repositories
– Lacking sustainability & traceability
• Similar architecture use for DLS
• Scaling is a constant concern
• Data rates keep increasing• 70TB per month
and rising
• Tailored ICAT• Reengineered
StorageD
duodesk
DLS Proposal Entry
http://duo.diamond.ac.uk/propman
2
ICAT
External lookup data:/home/oracle/
external_tables/dls33
JOB: icat_dls_propagationON: orisa.icatdls
FREQUENCY: 1 hourDB LINK: duodesk.dl.ac.uk
ACTION: Pull data from DuoDesk to ICAT
JOB: icat_dls_propogationON: orisa.icatdls
FREQUENCY: 30 minsACTION: Load lookup data into ICAT
IDMAN
CDR
JOB: SSO - SYNCRONISATION PRODON: orisa.sso
FREQUENCY: daily at 08:45DB LINK: cdr.esc.rl.ac.uk
ACTION: Pull data from CDR to IDMAN
SSO-MyProxy
vintela
GDA
valid user check
XML Ingest
StorageD
SRB Scriptsb1-storage1
Atlas Data Store
DATA PORTAL/ICAT API
Active Directory
KDC
Certificate
JOB: cron scriptON: sso-myproxy
FREQUENCY: daily at 09:18ACTION: Pull data from IDMAN to
gridmap file (mapping FedID to DN)
CA
Kerberos Token
FedID/Password
FedID/Password
Check FedID/Password
Kerberos Authentication
SRB containers Transfer data to tape
User User
SQL
Scommands
User
75
1
Diamond e-Infrastructure
8
13
12
15
19
17
16
18
28
2726
21
20
JOB: icatdls33_propagationON: orisa.icatdls33
FREQUENCY: 30 minsACTION: Push data to iKittens
DArc
lustre
EDNA MX/DNA Drop file
MX: strategy for data collection
Drop file
22
data
data
data
23
24
25
29
Local Beamline lustre Client
24
UNIX Group created for Visit/Users File to linux administrator
30
ISPyB
14
Picture location
DUO Desk Applications
4
Federal ID
iKitten Databases iKitten Databases
I12I03I02 B22B18B16I22I20I19I18I16I15I11I07I06I04 I24 B23
iKitten Databases
11
iKitten Databases
Proposals
Once awarded beamtime at ISIS, an entry will be created in ICAT that describes your proposed experiment.
Experiment
Data collected from your experiment will be indexed by ICAT (with additional experimental conditions) and made available to your experimental team
Analysed Data
You will have the capability to upload any desired analysed data and associate it with your experiments.
Publication
Using ICAT you will also be able to associate publications to your experiment and even reference data from your publications.
B-lactoglobulin protein interfacial structureE
xam
ple
IS
IS P
rop
osa
l
GEM – High intensity, high resolution neutron diffractometer
H2-(zeolite) vibrational frequencies vs polarising
potential of cations
Central Facility
• Secure access to user’s data
• Flexible data searching
• Scalable and extensible architecture
• Integration with analysis tools
• Access to high-performance resources
• Linking to other scientific outputs
• Data policy awarehttp://code.google.com/p/
icatproject/
Investigation
Publication KeywordTopic
SampleSample
ParameterDataset
Dataset Parameter
Datafile
Datafile Parameter
Investigator
Related Datafile
Parameter
Authorisation
Core Scientific Metadata Model (CSMD)
The Core Metadata model forms the information model for ICAT.
Designed to describe facilities based experiments in Structural Science.
TopCat
DOI’s for Data Publication
Is this enough?• What we have so far is good for:
– us to manage data– users to access their own data– citation of raw data
• But – Traceability and Validation?– Reuse of the data?
• Need to make context more explicit– Focussing on the dataset is the wrong subject of
discourse
Support the wider Facilities Lifecycle
Proposal
Approval
SchedulingExperiment
Data storage
Record Publication
Scientist submits
application for beamtime
Facility committee approves
applicationFacility registers,
trains, and schedules
scientist’s visit
Scientists visits, facility run’s experiment
Subsequent publication
registered with facility
Raw data filtered, and stored
Data analysis
Tools for processing made
available
As in PanData-ODI – D6.1 (which has much more detail)
Publishing Investigations• So what we want is a record of EXPERIMENTS not data.
• Thus want the record of the context– The experimental intention and actors – The instruments and configurations used– The sample – The environmental parameters and context – The Raw Data
• Thus we want to publish a record of the whole INVESTIGATION– Can get most of this this from what we have
• The Investigation becomes a “first class” research object– Published– Identified and treated as a single entity– Cited and credited– Record of the output of the facility
• Analogous to a Journal Article– Investigation as the unit of discourse for scientific facilities.
• But also as an access point for validation and reuse– Because we have a record of what actually happened.
Our DataCite entries are in fact Investigations (red is for “data” notion, and green is for “investigation”)
“DataCite abuse”As we have seen, we use DataCite for Investigations, with Datasets
only referred from them.
Other data curators sometimes use DataCite for Publications (“documents”) that contain data: http://data.datacite.org/10.7480/OA
So “data” DOIs tend to resolve either into Investigations or Publications
• Extend the Resource Type
• Also may not want to have a landing page for all DOIs
Research Objects• Represent the “investigation” as a Research Object
– Research Objects (ROs) are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. Their goal is to create a class of artifacts that can encapsulate our digital knowledge and provide a mechanism for sharing and discovering assets of reusable research and scientific knowledge
• www.researchobject.org and elsewhere (WorkFlow4Ever)
• Represent Investigation as a Research Object– Build a graph structure for the links in the research object.– Using an RDF representation, URIs– Publish as a linked data object
Bechhofer, et. al. Why Linked Data is Not Enough for Scientists, Proceedings of the 10th IEEE e-Science Conference, Brisbane, Australia (2010) http://eprints.ecs.soton.ac.uk/21587/5/research-objects-final.pdf
Arif Shaon, Sarah Callaghan, Bryan Lawrence, Brian Matthews. Opening up Climate Research: a linked data approach to publishing data provenance 7th Int Digital Curation Conference (2011).
RDF representation of CSMD model <!-- csmd:Investigation --> <owl:Class rdf:about="csmd:Investigation"> <rdfs:label>Investigation</rdfs:label> <rdfs:comment>An investigation or experiment</rdfs:comment> </owl:Class> <!-- csmd:Facility --> <owl:Class rdf:about="csmd:Facility"> <rdfs:label>Facility</rdfs:label> <rdfs:comment>An experimental facility</rdfs:comment> </owl:Class> <!-- csmd:Dataset --> <owl:Class rdf:about="csmd:Dataset"> <rdfs:label>Dataset</rdfs:label> <rdfs:comment>A collection of data files and part of an investigation</rdfs:comment> </owl:Class> <!-- csmd:Datafile --> <owl:Class rdf:about="csmd:Datafile"> <rdfs:label>Datafile</rdfs:label> <rdfs:comment>A data file</rdfs:comment> </owl:Class>
After proposal: Initialise the Research Object
Investigation #n
DOI:STFC.xxx.n
:instrument
:investigator
:n a csmd:Investigation ; csmd:investigation_doi doi:stfc.xxx.n csmd:investigation_investigationUser :iu1 ; csmd:investigation_instrument :inst1 .
:iu1 a csmd:investigationUser ; csmd:investigationUser_user :u1 .
:u1 a csmd:User .
:inst1 a csmd:Instrument .
After the experimentExperimental Data Metadata
Investigation #n
DOI:STFC.xxx.n
:dataset
:instrument
:investigator
• Own metadata format (CSMD)• More or less what ICAT currently supports• Adds extra details on parameters, datasets, formats etc.
:sample
Data Storage
Linking Publication into Investigation
Raw Data Repository
Publication Repository
:dataset
:publication
:publication
:investigator
cito:citescito:cites
Investigation #n
DOI:STFC.xxx.n
:instrument :sample
Publication Store
Raw Data Repository
Derived Data Repository
Publication Repository
:dataset
:publication
:publication
:investigatorInvestigation
#nDOI:STFC.xxx.
n
:instrument :sample
• Note that derived data could be on a different site
:relatedDataset
Linking the derived data into the Investigation
Linking the software into the Investigation
:dataset
:relatedDataset
:publication
:publication
:investigator
• W3C Prov ontology• Assume that the software is in a repository
SoftwarePackage 1
cito:cites
cito:cites
:inputDataset
:outputDataset
:application
Software Repository
Investigation #n
DOI:STFC.xxx.n
:instrument :sample
Generate Landing page from RO
Setting the Boundary: It depends on your Point of View
Investigations
Extended Publication
E-Portfolio
Setting a boundary : OAI-ORE
Preserving Investigations
• Now becomes preserving the research object.– Preserving a linked data graph– Persistency of identifiers– Managing integrity of external artefacts.– Link checking– Copying and mirrorign – checking consistency
• Representation Information to give more context on the objects– And on the aggregate as a whole
• PDI (Provenance, Integrity etc) on the whole aggregate object – As well as components
Adding Preservation Information – Rep Info for various items
:dataset
:relatedDataset
:publication
:publication
:investigator
• Would probably be more• Work into a RepInfo Repository• Would also have a RepInfo Network
:applicationInvestigation #n
DOI:STFC.xxx.n
:instrument :sample
Instrument description(website)
Raw data format description (e.g.
NeXus)
Parameter description (e.g.
NXDL, Con Vocab)
Software classification
Software description
Sample description
Analysed data format description
Publication format description
Adding Preservation Information – Rep Info for the whole aggregate
:dataset
:relatedDataset
:publication
:publication
:investigator:applicationInvestigation
#nDOI:STFC.xxx
.n
:instrument :sample
Software classification
CSMD Vocabulary description
Summary• Investigation appropriate unit of discourse for facilities science
– Publishable, Citable, Reportable– Can be used as a vehicle for validation and reuse
• Basic principles of building research objects for facilities science– Follow research lifecycle– Consider Investigation a RO “seed”– Apply Linked Data principles– Re-use existing vocabularies and ontologies– Share ROs via recognizable data formats and APIs
• Applicable beyond Facilities– Other analogous objects:– “experiments”, “observations”, “studies”
• The subject of preservation– How do we maintain the integrity of Investigation objects?
Thank You
Questions?
brian.matthews@stfc.ac.uk
www.e-science.stfc.ac.uk
Recommended