ODM present and futures for internal discussion 5 September 2007 Ilya Zaslavsky, Dave Valentine, Tom Whitenack, Catharine van Ingen

ODM present and futures

for internal discussion 5 September 2007

Ilya Zaslavsky, Dave Valentine, Tom Whitenack, Catharine van Ingen

Preamble

• We’ve now had experience using the first version of the ODM information model including the WaterOneFlow services, WaterML schema and ODM database schema.

• We’ve learned a lot.• It’s time to start work on the next version while

continuing to support the testbed and other users• Before beginning that work, let’s look at what we’ve

learned and what has changed.

PART I:LESSONS LEARNED

ODM is an Information Model• ODM defines a canonical information model and semantics

for hydrologic observations • ODM also provides a relational implementation of the model,

tuned to locally-collected observations typically under control of a PI

• There are many application scenarios, for which ODM information model shall be useful including:– data discovery via site catalogs, – data transformations and versioning within DB,– managing streaming data, – long-term preservation of observations data, – community annotation of hydrologic data and model results,…

• Within the same application scenario the data content may differ from canonical ODM recommendations

Learning: The Good News• Semantics behind ODM is accepted by the

community.• ODM services are standardized. • ODM service creation and tuning is simple.• Pairing a data model with a relational database has

provided a general storage schema for observations.– Enables data curation and archive.

• The testbeds have enough to get started. – Services from key agencies, DASH, and Hydroseek are up

and running. – We will learn a lot from the early community use.

Learning: The Bad News• The ODM schema is –complex-.

– Even we get confused as to what the various schema elements mean. – Many scenarios need only a subset of the schema. For example, data discovery

doesn’t need the DataValues table and real time data loading doesn’t need Samples.

• The ODM variable structure is confusing. – Even we get confused as to how to fill out (or search on) the various category or

quality fields. – Unit conversions, temporal granularity and other data attributes seem hidden.

• ODM lacks collection versioning objects and other provenance mechanisms. – Present ODM structure is central around the DataValues table. Unless a database

has DataValues, we can’t mix discovery information (site and variables) from different sources in the same database or service. For example, today we need separate services for NWIS daily and NWIS realtime.

– Collections are difficult, and possibility at the wrong granularity. Relating individual data points can be confusing, and such relationships will not scale to very large data sets.

– ODM tries to work around idiosyncrasies of individual collections – but this needs to be matched by adding provenance info, to stay true to sources

– In “some” scenarios, we need to keep several versions of the data, with all transformations and roll-back abilities

Enabling Community Development• To date, CUAHSI HIS has been developed by relatively few people

working closely rather than a loosely connected community of small projects.

– We’ve covered many scenarios and needs. • Enabling access to agency data repositories via web services is a major step forward• The test beds and other users can leverage the infrastructure to avoid writing

everything from scratch– We haven’t thought of everything.

• An example is the photos table in the BearRiverOD sample database. • It’s time to enable all of the community to extend the

infrastructure and develop tools.– The challenge is how to enable organic growth without organic chaos and

confusion. We need a general checklist and simple tests to ensure that we

are all “community aware”Some chaos is unavoidable, but we shall plan for it

PART 2:PROPOSAL

The Proposal• Focus on data discovery, access, and analysis scenarios• Layer the information model

– Each layer can be used to solve a specific set of scenarios. – Layers build on core and on one another. – All tools written to a specific layer have guarantees of completeness (and a

test suite)• Define both web service and database schema interfaces.

– Direct access to database tables used when necessary for speed or necessity with SSIS, SSAS or other applications.

– Build compliance tests to ensure robustness of the interface. • Define a few “rules of the road” for community extensions.

– How methods, tables, or vocabularies can be added– How to share extensions with others

• Start with the immediately useful and relatively well understood scenarios.

– First such extension demonstrates the process– Document what we’re already doing and saying today

ODM Layers High Level Overview

Web Service Applications

Database Applications

ODMSensor Streaming Data Loaders

ODMProject Data Assembly and Analysis

ODMCatalogData Discovery and Catalogs

ODMNext ???Data Archive, Education, Publication

Web Servicemethods

Database tables and views

ODMCore

Put the Series Catalogat the Center

• Needed for all scenarios from discovery through archive.

• Adding provenance at the series gives clear tracking while being lighter weight than at the data value. – Include data source information (where did the data

come from?)– Include change information (when was the data series

created, last changed, by whom?)– Identify vocabularies for variables and sites allows

translation and abstraction. • Build out from here for each scenario adding

tables and rows.

The trade-off: overhead of dealing with a series of one

Example simplified activities• I’ve been archiving realtime discharge data and want to compare it to the daily

discharge data I just downloaded. – Different data series distinguishes them

• NWIS reports turbidity, grainsize, and suspended sediment measurments at different stations and times. I want to use all of them to get the best suspended sediment estimates I can.

– New data series contains computed values from originals• I’m analyzing phosphorus dynamics

– Need to convert/aggregate different measures• I’m computing evapotranspiration

– New derived data series depends on series containingair temperature, radiation, latent heat and precipitation

• I want to plot discharge over time and tag the gage by agency, so need to extend the Matlab object.

A Few Observations• The lower level database design is not intended as human friendly

– Separate the information model from the database model– Views and automatically generated machine schemas can be built to present the

“Information model” that is ODM.• Plan now for localization

– Language and character sets change as you move around the world– Location descriptions and addresses change (e.g. Name and Address Markup Language)

• While web service calls are stateless, applications often need to preserve state across calls

– Use identifiers (analogous to cookies) for efficiency and robustness– At the same time, avoid using identifiers for persistency

• Writing software is cheap; maintaining software is expensive– Resist the urge to write more software than you have to! (individual software not a

deliverable in infrastructure projects)– Leverage, reuse, and document the software you have especially infrastructure software;

follow standards and common models

Example checklist items

• Do all controlled vocabularies have a default (none) or (unknown) value?

• Does the implementation have at least one data source? Does a testbed have a data source identifying itself?

• Are all ODM tables present? • Are any additional tables not prefixed by ODM? • Does the series catalog correctly represent the

contents of the Data Series table?

Next Steps• Define ODMCore, a.k.a. WaterOneFlow Tables• Define ODMCatalog proposal (David Valentine and friends)

– Supports data discovery from agency web sites and testbed network servers. – Used to implement all catalog servers. – Web Service interface a subset of WaterML– Database tables include those necessary for DASH

• Define ODMSensor proposal (Jeff Horsburgh and friends) – Supports data streaming from real time sensors. – Includes sensor configurations and definitions– WebService extensions (for monitoring)– Only those database tables and columns necessary for streaming assuming

the initial configuration (variables, sites, etc) exist. • Compare these proposals

– What would it mean to migrate?– What’s still to do?

• No change to testbeds at this time.

PART III:INITIAL STEPS

SeriesCatalog at the Center

• The series catalog is the primary collection object or “data folder”

• All provenance and versioning happens here – Larger groupings for analysis or

archive can be constructed via a spline table

– M:N derivations tracked by the same spline table replacing the DerivedFrom table

Sources near the Center• All data series are

associated with a source– Includes agencies, individual

investigators, data archives, etc

• Each source can define a variable and site vocabulary– Translation tables between

sources are built up over time– A given source can always

reuse an existing vocabulary – The authoritative list of

sources is kept somewhere

SeriesCatalog Changes• Columns that duplicate foreign

key table columns such as VariableCode are removed

– A view can (and should) be used

– All foreign keys are identity columns to allow localization

• New provenance information added to track create and modify actions

• Source replaced by indirect through Site and Variable

• DataType and General Category replaced by indirect through Variable

Provenance additions

• ProcessGroup table holds descriptive information for one or more data series

• DataSeries and Process group have CreatedDate/ModifiedDate/LastChangerID summary information– A log table to track all changes

can be added if desired

Site Changes• Site properties separated

from site identifier– A view can be used for

common properties• Properties may be reported

in different units (eg multiple spatial reference datums)

• Properties may change over time (eg resurvey may change latitude/longitude)

• The properties of interest depend on the science (eg including IGBP class)

Documents

ODM present and futures for internal discussion 5 September 2007 Ilya Zaslavsky, Dave Valentine, Tom Whitenack, Catharine van Ingen