Metadata Semantics and the Earth System Curator Rocky Dunlap Earth System Curator Georgia Tech

Metadata Semantics and the Earth System Curator

Rocky Dunlap

Earth System Curator

Georgia Tech

Earth System Curator 3 year NSF funded project Funded Collaborators:

Cecelia DeLuca (NCAR, PI) Balaji (GFDL, Co-PI) Don Middleton (NCAR, Co-PI) Chris Hill (MIT, Co-PI) Spencer Rugaber (Ga Tech, Co-PI) Leo Mark (Ga Tech) Julien Chastang (NCAR) Sergey Nikonov (GFDL) Angela Navarro (Ga Tech) Me (Ga Tech)

Also working with: Lois and Katherine (NMM) Sophie Valcke (PRISM/OASIS) Others...

Curator Doctrine

Currently a gap in the way we treat models and datasets (are they really so different?)

Best description of a dataset is a comprehensive description of the model run that created the dataset (+ post processing)

Model components are data objects for exchange Metadata-centric view

Don’t start with a dataset and try to find the metadata... Start with good metadata that leads you to the datasets you want—even if they don’t yet exist! (No, really, that’s how we think.)

Haiku are a valid form of model metadata

Earth System Curator Applications (Proofs of Concept)

Catalog of modeling components along with comprehensive metadata CDP Curator (Michael B., Don, Luca, Julien)

Demonstrate compatibility checking of components Primarily “technical” compatibility: platforms,

compilers, required fields, field data types, calendar/time

Demonstrate auto-generation of coupler component based on metadata

Demonstrate automation of workflow tasks Model assembly, execution, archive, post-

processing

Schema Development Fun

To accomplish these goals, we need:Comprehensive descriptions of climate

models: model metadataIncludes both “semantic” and “syntactic”

elements (“discovery” vs. “use”)• Semantic: component name, type, owner,

description, source code location, component architecture of model, platform, framework

• Syntactic: parameter settings, input datasets, boundary conditions, coupling details, grid coordinates

Lots of schemata...

Component (NMM) Potential Model (NMM/Curator) Model (NMM) PMIOD/SMIOC (PRISM coupling spec) CRE/Curator Complete (workflow) Application (NMM) Gridspec

Reminiscing on Metadata Development

Observations: (It seems) much of the community is in

support of metadata development• Although there are different opinions on levels of

comprehensiveness People using metadata for different reasons:

• Annotate large datasets for retrieval• Inform analysis tools• Archiving of modeling components• Automation of workflow (runtime environ.)• Exchange datasets

Each application requires different (but often overlapping) metadata

How should we think about schemata?

Schemata are typically written for applications: I have a particular task I want to accomplish What metadata do I need to accomplish it?

Write a schema. But...

Now we have lots of schemata sitting around• They may contain overlapping information• Different ways of expressing the same information• Each schema is used for a small number of tasks and

understood by a small number of applications• May need to reference elements in another schema,

or aggregate elements from multiple schemata

A Unified View of Metadata

Given all of the current metadata development efforts, Curator is promoting a unified view of metadataMetadata reuse must be a priorityMetadata aggregation is key: schemata

built (generated!) from repository of existing metadata elements (let’s call them types)

We must think conceptually first and then syntactically—ideally, all groups will agree at both levels

What’s In a Schema?

XML Schema (e.g., gridspec.xsd)

XML Type

GridTile

ContactRegion

Boundary

GridDescriptor

These are syntactic and conceptual constructs

Re-using schema elements

How do I best use/re-use metadata elements from (multiple) schema(ta) to accomplish my particular application?

You need:A conceptual understanding of the “types”

(concepts) in the schema GlossaryThe syntactic representation of that type

(so you can actually use it in implementations) XML Type Library

WEARE

HERE

Multi-Schema Semantic Glossary

Community-wide glossary of metadata types/concepts from multiple schemata

Concepts aggregated into a centralized glossary Schema authors and users can get

explanations/definitions of metadata elements. Examples:

What does the contact_region tag mean in the Gridspec schema?

What goes under the intent tag in the PMIOD? What is a potential model anyway?

Multi-Schema Semantic Glossary

For each metadata concept provide:Human-readable definitionSource schemaExample usageChange notes/provenanceSemantic relationships with other concepts

(e.g., broader than, narrower than, part of, parent of, synonym, etc.)

Glossary Design

Schema authors embed descriptions directly inside each XML schemaKeep the human-readable definitions close to

the formal syntactic definitionsWhen schema is updated, it is easy to

update glossary Glossary entries from distributed schemata

are harvested (nightly?) and placed into centralized glossary (alternatively, live access?)

Simple interface allows users to query glossary for concepts

Glossary Design

Simple Knowledge Organization Systems (SKOS) data model for glossary entrieshttp://www.w3.org/2004/02/skos/SKOS supports knowledge organization

systems like glossaries, thesauri, taxonomies, etc.

RDF based – move the community toward languages with higher semantics (eventually get down to dataset level)

Sample SKOS RDF (Basic)

<skos:Concept rdf:about="http://.../schema/1.0#PotentialModel"> <skos:prefLabel>potential model</skos:prefLabel> <skos:definition>

A set of components at the source code level that can potentially form an executable model....

</skos:definition> </skos:Concept>

Where should glossary entries be stored?

Example Annotated Schema

...<xsd:complexType name=“PotentialModel"> <xsd:annotation> <xsd:documentation> <skos:Concept rdf:about="http://.../schema/1.0#PotentialModel"> <skos:prefLabel>potential model</skos:prefLabel> <skos:definition>

A set of components at the source code level that can potentially form an executable model.

</skos:definition> </skos:Concept> </xsd:documentation> </xsd:annotation>  <xsd:complexType>...

Sample SKOS RDF Triples

esc:PotentialModel

skos:Concept

‘potential model’

‘A set of components at the source code level that can potentially form

an executable model. ’

rdf:type

skos:prefLabel

skos:definition

Other SKOS Fields<skos:Concept rdf:about="http://purl.oclc.org/NMM/Model/011/#model"> <skos:prefLabel>model</skos:prefLabel> <skos:definition> The root element of a NMM Model description. There is one model per xml file. This model can have one or more related component configurations. </skos:definition> <skos:altLabel>simulation</skos:altLabel> <skos:altLabel>job</skos:altLabel> <skos:altLabel>run</skos:altLabel> <skos:example>UK Met Office Unified Model</skos:example> <skos:related rdf:resource=" http://...NMMPotentialModel/1.0/#PotentialModel"/> <skos:changeNote rdf:parseType="Resource"> <rdf:value>The label 'model' was changed from NMM_Model.</rdf:value> <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/"> <foaf:Person xmlns:foaf="http://xmlns.com/foaf/0.1/"> <foaf:name>Katherine Bouton</foaf:name> <foaf:mbox rdf:resource="mailto:..."/> </foaf:Person> </dc:creator> <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-02-02</dc:date> </skos:changeNote> <dc:source rdf:resource="http://purl.oclc.org/NMM/Model"/></skos:Concept>

Semantic Relationships

esc:PotentialModel

nmm:Component

skosx:childOf

skos:related

nmm:Model

skosx:childOf

prism:Modelskos:synonym

skos:synonym

Putting it all Together

Namespace Schemata (e.g.,

NMM, Curator-NMM, Gridspec, ESG)

Marked up with glossary metadata (terms, definitions,

relationships)

Aggregate Glossary RDF

Joseki RDF Server

Glossary Web Application

Tomcat (www.earthsystemcurator.org/glossary)

Client Web Browser

SPARQL Queries

Glossary metadata harvested

nightly

Search for terms, view relationships,

etc.

1

2 3 45

More info:

http://glossary.earthsystemcurator.org/http://www.earthsystemcurator.org/index.php?option=com_content&task=view&id=54&Itemid=84

Glossary Interface

Search

Schemata to Include

Concept List

Concept Details

Links to related concepts

Syntactic Metadata Re-use

So, if we agree on the concepts, what about the syntax? (i.e., XML representation)

Concept = XML Type How do we share XML types from multiple

schemata across the community? One idea: XML Type Library (or Catalog or

Repository) “Preliminary Research” This is NOT the same thing as a single complex

schema that describes everything – types are first class objects and can be manipulated individually

How does an XML Type Library work?

Operations (web service?)Submit an XML typeGet a list of all typesQuery for typesValidate a type (Is my XML

fragment a valid X?)Type membership (What

types does my XML fragment fit?)

Generate an XML Schema

How does an XML Type Library work?

What metadata is available per type?Definition (e.g., XML Schema complexType)

SKOS Glossary entry (for queries)Example usage scenariosDependencies on other typesVersioning metadataAvailable operations/web services

• “If you have an XML fragment of type X, you can use the following services...”

Use Case: Submit Type

<xsd:complexType name=“PotentialModel"> <xsd:annotation> <xsd:documentation> <skos:Concept rdf:about="http://.../schema/1.0#PotentialModel"> <skos:prefLabel>potential model</skos:prefLabel> <skos:definition>A set of components at the source code... </skos:definition> </skos:Concept> </xsd:documentation> </xsd:annotation>  <xsd:complexType>





ExistingSchemata Extract Types

Submit toType Library

Use Case: Validation

Type Library

<horizontal_coord_system type=“cartesian”> <x_axis>...</x_axis> <y_axis>...</y_axis></horizontal_coord_system>

XML Fragment

Validate“Valid” or“Invalid”

Use Case: Find Services

Type Library

<horizontal_coord_system type=“cartesian”> <x_axis>...</x_axis> <y_axis>...</y_axis></horizontal_coord_system>

XML Fragment Find Services

Interpolate_Service()Extract_Variable()Massage_Data()Another_Operation()

List of available services based on type of fragment

Some Conclusions

With large amount of metadata activity already in progress, metadata re-use must be a priority

Conceptual understanding is essentialAdoption of a glossary of concepts

Syntactic agreement is desirableConcepts assigned concrete XML

types and stored in a library

Some Haiku

Retile the ShowerTessellated MosaicFirst Write a Gridspec

Forever summerquestions and answersCurator complete

Potential ModelLike a cool autumn breezePotentially mad

Extra Slides...

Example Gridspec Applications

Not written for one particular application – general grid metadata has many potential uses IPCC Model Documentation table Moving variables to common grid for analysis Regridding vertical from 24 to 40 levels

There are two levels: conceptual and syntactic – ideally, we would agree at both of these levels! If we only have conceptual agreement—we can still

interoperate, but must do transformations

Type Reuse ScenarioFull Schema

Partial Schemata

Application: NARCCAP Vertical Interpolation

Gridspec.xsd

Partial Schema

Description of vertical coordinate scheme

Metadata required for NARCCAP experiment: interpolate from 24 to 40 vertical levels}

Schema Aggregation Scenario

Schema A Schema B Schema C Schema D

XML Type

Application Schema

Application: Component Compatibility Checking

NMM Component GridspecCoupling Spec (PMIOD)

Application Schema

Technical details (e.g., supported platforms)

Required coupling fields Horizontal grid

descriptor

All metadata required for compatibility checking of two components}

Documents

Metadata Semantics and the Earth System Curator Rocky Dunlap Earth System Curator Georgia Tech