Measurement Data Archive – Project Highlights GEC12 Nov 2011

Measurement Data Archive – Project Highlights

GEC12Nov 2011

Giridhar ManepalliCorporation for National Research Initiatives

http://www.cnri.reston.va.us/

Why Archive?

• The obvious: for use by others or by yourself in the future

• The Fourth Paradigm• Data-intensive science• Emergent phenomena

• Funding bodies increasingly asking for data plans

• Citations from journal articles to data sets on the rise

• Consistent archiving standards enhance the use of data over time and within a domain

CNRIWorkspace

Measurement Data Archive

Experimenter YExperimenter X

Workspace

Archive

Internet

Key:1. Experiment Initiated

1 1

Slice

Workspace

Slice= Data Model TBD

DO = Digital Object

= Prototype

Measurement Data Template

10510.0.1/0-L2NucmlnZW5p

Object A

Run 1

Logs

Run 2

Logs

Metadata 2

PublicJournals

Internet

2

3

3

DO

2

2. Measurement Data Collected

3

3. Measurement Data Archived

4

4. Archived Data Referenced

5

5. Archived Data Retrieved

Current Usage

• Early adopters in GENI:• OnTimeMeasure - Ohio State University• INSTOOLS - University of Kentucky

• Possible usage in other projects:• DARPA Transformative Apps program for

managing mobile apps related data• Internal to CNRI for sharing documents and

presentations across groups

Next Steps – I&M Standpoint

• Revisit the protocols for pushing data into workspace• Associate metadata with data effectively

• Where does the metadata live?• How is it associated with data? At what level of

granularity is it specified?• Support GENI and I&M schemes of authentication,

authorization, metadata enforcement, etc.• Allow multiple workspace deployments• Identify the process to push data from workspace into

the archive• Should metadata be enforced before data is pushed into

the archive? • How is the data serialized in the archive?• How is data visibility managed in the archive?

Next Steps – GENI-wide

• Extend services offered by the archive beyond data storage• Developed a visualization service prototype to

demonstrate automatic visualization of data for DataCite

• Designed a theoretical model for enforcing terms & conditions, licenses, etc. prior to disseminating data

• Goal: Expand archive into an eco-system to entice communities into using it

• Use archive for experiments, not just for I&M

SUITE OF SERVICES

Science Times

Article TitleData ID

Archive ServicesSuite of extensible services end users can leverage by following the ID.

Ohio UniversityVDC Experiment

Experimenter

OtherExperiments

Other Experimenters

Stores & Retrieves Data

Visualization

Archive

I Agree

Terms:…

License Enforcement

I Agree

Terms:…

I Agree

Terms:…

Data Set Dissemination

1010011010101….

1010011010101….

1010011010101….

Data Processing

1. User follows Data ID into the Archive.

1

2

2. User is redirected to requested Archive Service.

Measurement Data Archive – Project Highlights

GEC12Nov 2011

Giridhar ManepalliCorporation for National Research Initiatives

http://www.cnri.reston.va.us/

Related Slides

Prototype Limitations

• Only one workspace service is deployed• Multiple workspaces, within and outside GENI

networks, can be hosted that push data to the archive

• Authentication and authorization model is simple and redundant

• Should conform and use one scheme across GENI (or at least across I&M)

• No metadata standard applied• I&M metadata requirements must be applied

once identified

What is Metadata and Why Do I Need It?

• Lots of miscommunication because• Metadata is not a type of data• Metadata is a type of relationship between two pieces of data

• Needed for Understanding and Finding• Understanding (sometimes called Descriptive MD)

• How do I parse this?• How do I interpret this?

• Finding (sometimes called Subject MD)• Finding one item in a population of 10 is easy• Finding one item in a population of 1M is impossible w/o some

some way to distinguish them• Generally requires a human in the loop at some level

• Sometimes the object is self-describing (journal article)• Automatic indexing/classification works for some domains

Why is Metadata Hard?

• To be effective it must be consistent, and consistently applied, within a given domain• What is the scope of the domain?• What aspects of the object need to be described?• What is the vocabulary, is it open or closed?

• Even within a defined domain, there are many points of view• Especially true for any sort of subject description• May have to allow for multiple metadata records for a single

described object• Spending time on creating good metadata is Good For You

• The best sources for good metadata are the creators/owners of the described object, but they may lack interest and training

• Some types of metadata are difficult to automate, e.g., good title• Keep it simple – trade consistency and coverage for depth

Misc Points

• Precision and Recall useful concepts in searching• Precision: % of search results are on target• Recall: % of the correct result set did my search

retrieve• Desirable tradeoff is situational

• Consider University Libraries as reliable archive holders• Variety of approaches to managing a useful vocabulary

of terms• Controlled vocabulary: set of terms – use these

instead of slight variations• Taxonomy: parent-child relationships• Ontologies: introduce other types of relationships

Documents

Measurement Data Archive – Project Highlights GEC12 Nov 2011