Upload
melvina-hartnett
View
24
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Measurement Data Archive – Project Highlights GEC12 Nov 2011 Giridhar Manepalli Corporation for National Research Initiatives http:// www.cnri.reston.va.us /. Why Archive?. The obvious: for use by others or by yourself in the future The Fourth Paradigm Data-intensive science - PowerPoint PPT Presentation
Citation preview
Measurement Data Archive – Project Highlights
GEC12Nov 2011
Giridhar ManepalliCorporation for National Research Initiatives
http://www.cnri.reston.va.us/
Why Archive?
• The obvious: for use by others or by yourself in the future
• The Fourth Paradigm• Data-intensive science• Emergent phenomena
• Funding bodies increasingly asking for data plans
• Citations from journal articles to data sets on the rise
• Consistent archiving standards enhance the use of data over time and within a domain
CNRIWorkspace
Measurement Data Archive
Experimenter YExperimenter X
Workspace
Archive
Internet
Key:1. Experiment Initiated
1 1
Slice
Workspace
Slice= Data Model TBD
DO = Digital Object
= Prototype
Measurement Data Template
10510.0.1/0-L2NucmlnZW5p
Object A
Run 1
Logs
Run 2
Logs
Metadata 2
PublicJournals
Internet
2
3
3
DO
2
2. Measurement Data Collected
3
3. Measurement Data Archived
4
4. Archived Data Referenced
5
5. Archived Data Retrieved
Current Usage
• Early adopters in GENI:• OnTimeMeasure - Ohio State University• INSTOOLS - University of Kentucky
• Possible usage in other projects:• DARPA Transformative Apps program for
managing mobile apps related data• Internal to CNRI for sharing documents and
presentations across groups
Next Steps – I&M Standpoint
• Revisit the protocols for pushing data into workspace• Associate metadata with data effectively
• Where does the metadata live?• How is it associated with data? At what level of
granularity is it specified?• Support GENI and I&M schemes of authentication,
authorization, metadata enforcement, etc.• Allow multiple workspace deployments• Identify the process to push data from workspace into
the archive• Should metadata be enforced before data is pushed into
the archive? • How is the data serialized in the archive?• How is data visibility managed in the archive?
Next Steps – GENI-wide
• Extend services offered by the archive beyond data storage• Developed a visualization service prototype to
demonstrate automatic visualization of data for DataCite
• Designed a theoretical model for enforcing terms & conditions, licenses, etc. prior to disseminating data
• Goal: Expand archive into an eco-system to entice communities into using it
• Use archive for experiments, not just for I&M
SUITE OF SERVICES
Science Times
Article TitleData ID
Archive ServicesSuite of extensible services end users can leverage by following the ID.
Ohio UniversityVDC Experiment
Experimenter
OtherExperiments
Other Experimenters
Stores & Retrieves Data
Visualization
Archive
I Agree
Terms:…
License Enforcement
I Agree
Terms:…
I Agree
Terms:…
Data Set Dissemination
1010011010101….
1010011010101….
1010011010101….
Data Processing
1. User follows Data ID into the Archive.
1
2
2. User is redirected to requested Archive Service.
Measurement Data Archive – Project Highlights
GEC12Nov 2011
Giridhar ManepalliCorporation for National Research Initiatives
http://www.cnri.reston.va.us/
Related Slides
Prototype Limitations
• Only one workspace service is deployed• Multiple workspaces, within and outside GENI
networks, can be hosted that push data to the archive
• Authentication and authorization model is simple and redundant
• Should conform and use one scheme across GENI (or at least across I&M)
• No metadata standard applied• I&M metadata requirements must be applied
once identified
What is Metadata and Why Do I Need It?
• Lots of miscommunication because• Metadata is not a type of data• Metadata is a type of relationship between two pieces of data
• Needed for Understanding and Finding• Understanding (sometimes called Descriptive MD)
• How do I parse this?• How do I interpret this?
• Finding (sometimes called Subject MD)• Finding one item in a population of 10 is easy• Finding one item in a population of 1M is impossible w/o some
some way to distinguish them• Generally requires a human in the loop at some level
• Sometimes the object is self-describing (journal article)• Automatic indexing/classification works for some domains
Why is Metadata Hard?
• To be effective it must be consistent, and consistently applied, within a given domain• What is the scope of the domain?• What aspects of the object need to be described?• What is the vocabulary, is it open or closed?
• Even within a defined domain, there are many points of view• Especially true for any sort of subject description• May have to allow for multiple metadata records for a single
described object• Spending time on creating good metadata is Good For You
• The best sources for good metadata are the creators/owners of the described object, but they may lack interest and training
• Some types of metadata are difficult to automate, e.g., good title• Keep it simple – trade consistency and coverage for depth
Misc Points
• Precision and Recall useful concepts in searching• Precision: % of search results are on target• Recall: % of the correct result set did my search
retrieve• Desirable tradeoff is situational
• Consider University Libraries as reliable archive holders• Variety of approaches to managing a useful vocabulary
of terms• Controlled vocabulary: set of terms – use these
instead of slight variations• Taxonomy: parent-child relationships• Ontologies: introduce other types of relationships