35
Supersite Relational Database Project: (Data Portal?) a sub- project of St. Louis Midwest Supersite Project Draft of the November 16, 2001 Presentation to the Supersite Program Nov 13, 2001

2004-11-13 Supersite Relational Database Project: (Data Portal?)

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Supersite Relational Database Project:

(Data Portal?)

a sub- project of St. Louis Midwest Supersite Project

Draft of the November 16, 2001 Presentation to the Supersite Program

Nov 13, 2001

Page 2: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Purpose of the Supersite Relational Database System

Design, populate and maintain a database which:

– Includes monitoring data from Supersites and auxiliary projects

– Facilitates cross-Supersite [regional or comparative] data analyses

– Supports the analyses by a variety of research groups

Page 3: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Stated Features of Relational Data System

• Data Input:– Data input electronically through FTP, Web browser, (CD, if necessary)

– Modest amount of metadata on sites, instruments, data sources/version, contacts etc.

– Data structures, formats and submission procedures simple for the submitters

• Data Storage and Maintenance:– Data stored in relational database(s), possibly distributed over multiple servers

– Maintenance of data holdings catalog and and request logs

– Data updates quarterly

• Data Access:– Access method: User-friendly web-access by multiple authorized users

– Data finding: Metadata catalog of datasets

– Data query: by parameter, method, location, date/time, or other metadata

– Data output format: ASCII, spreadsheet, other (dbf, XML)

Page 4: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Database Schema Design

• Fact Table: A fact table (yellow) contains the main data of interest, i.e. the pollutant concentration by location, day, pollutant and measurement method.

• Star Schema consists of a central fact table surrounded by de-normalized dimensional tables (blue) describing the sites, parameters, methods..

• Snowflake Schema is an extension of the star schema where each point of the star ‘explodes’ into further fully normalized tables, expanding the description of each dimension.

• Snowflake schema can capture all the key data content and relationships if full detail. It is well suited for capturing and encoding complex monitoring data into a robust relational database.

Page 5: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Abstract (Minimal) Star Schema forIntegrative, Cross-Supersite, Spatio-Temporal Analysis

• The minimal Site table includes SiteID, Name and Lat/Lon.

• The minimal Parameter table consists of ParamterID, Description and Unit

• The time dimensional table is usually skipped since time is self-describing

• The minimal Fact (Data) table consists of the Obs_Value and the three dimensional codes for Obs_DateTime, Site_ID and Parameter_ID

For integrative, cross-Supersite analysis, data queries by time, location and parameter, the database has to have time, location and parameter as dimensions

The above minimal (multidimensional) schema was used in the CAPITA data exploration software, Voyager for the past 22 years, encoding 1000+ datasets.

Most Supersite data require a more elaborate schema to fully capture the content

Page 6: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Extended Star Schema for SRDSThe Supersite program employs a variety of instrument/sampling/procedures

Hence, at least one additional dimension table is needed for Methods

A example extended star schema encodes the IMPROVE relational database (B. Schichtel)

Page 7: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Snowflake Example: Central Calif. AQ Study, CCAQS CCAQS schema incorporates a rich set of

parameters needed for QA/QC (e.g. sample tracking) as well as for data analysis.

The fully relational CCAQS schema permits the enforcing of integrity constraints and it has been demonstrated to be useful for data entry/verification.

However, no two snowflakes are identical. The rich snowflake schemata for one sampling/analysis environment cannot be easily transplanted elsewhere.

More importantly, many of the recorded parameters ‘on the fringes’ are not particularly useful for integrative, cross-supersite, regional analyses.

Page 8: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Data Portal: Features

• Data reside in their respective home environment. ‘Uprooted’ data in separate databases are not easily updated, maintained, enriched.

• Abstract (universal) query/retrieval facilitates integration and comparison along the key dimensions (space, time, parameter, method)

• The open architecture data portal, (based on Web Services) promotes the building of further value chains: Data Viewers, Data Integration Programs, Automatic Report Generators etc..

Page 9: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

From Heterogeneous to Homogeneous Schema

• Individual Supersite SQL databases can be queried by along spatial, temporal and parameter dimensional queries. However, the query to retrieve the same information depends on the of the particular database.

• A way to homogenize the distributed data is access all the data through a Data Adapter using only a subset of the tables/fields from any particular database (red)

• The proposed extracted (abstract) schema is the Minimal Star Schema, (possibly expanded ….). The final form of the extracted data schema will be arrived at by consensus.

Subset used

Abstract Schema

Fact

Table

Data AdapterExtraction of

homogeneous data from heterogeneous sources

Page 10: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Federated Data Warehouse Architecture• Tree-tier architecture consisting of

– Provider Tier: Back-end servers containing heterogeneous data, maintained by the federation members – Proxy Tier: Retrieves designated Provider data and homogenizes it into common, uniform Datasets – User Tier: Accesses the Proxy Server and uses the uniform data for presentation, integration or processing

• The Provider servers interact only with the Proxy Server in accordance with the Federation Contract– The contract sets the rules of interaction (accessible data subsets, types of queries)– Strong server security measures enforced, e.g. through Secure Socket layer

• The data User interacts only with the generic Proxy Server using flexible Web Services interface– Generic data queries, applicable to all data in the Warehouse (e.g. space, time, parameter data sub-cube)– The data query is addressed to a Web Service provided by the Proxy Server of the Federation – Uniformly formatted, self-describing data packages are handed to the user for presentation or further processing

SQLDataAdapter1

CustomDataAdapter

SQLDataAdapter2

SQLServer1

SQLServer2

LegacyServer

Presentation

Data Access & Use

Provider Tier Heterogeneous Data

Proxy Tier

Data Homogenization, etc.

Member ServersProxy Server

User Tier

Data Consumption

Processing

Integration

Federated Data Warehouse

Fire Wall, Federation ContractWeb Service, Uniform Query & Data

Page 11: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Universal Query/Response from SQL servers

• A common feature of all SQL databases for AQ data is that they can be queried by along spatial, temporal and parameter dimensional queries.

• However, the query to retrieve the same information depends on the of the particular database.

• A way to homogenize the distributed data is access all the data through an abstract virtual schema.

Page 12: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Summary of Proposed Database Schema Design

• The starting point for the design of Supersite Relational Database schema will be the Minimal Star Schema for fixed-location monitoring data.

• Extensions will be made if it clearly benefits regional analysis and cross-Supersite comparisons

• The possible extensions, based on user needs, may include the addition of:• ‘Methods’ dimension table to identify the sampling/analysis method of each

observation• Additional attributes (columns) Site and Parameter tables

• The Supersite data are not yet ready for submission to the NARSTO archive. Thus, there is still time to develop an agreed-upon schema for the Supersite data in SRDS.

• The schema modifications and and the consensus-building will be conducted through the SRDS website

Page 13: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Data Entry to the Supersite Relational Data System:

EPA

Supersite Data

Coordinated

Supersite

Relational

Tables

EOSDIS

Data

Archive

NARSTO ORNLDES, Data Ingest

Supersite

SQL

Server

DES-SQLTransformer

Manual-SQL TransformerAuxiliary

Batch Data

DataQuery

TableOutput

Direct Web Data Input

1. Automatic translation and transfer of NARSTO-archived DES data to SQL

2. Web-submission of of relational tables by the data producers/custodians

3. Batch transfer of large auxiliary datasets to the SQL server

Page 14: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Data Preparation Procedures:

• Data gathering, QA/QC and standard formatting is to be done by individual projects

• The data exchange standards, data ingest and archives are by ORNL and NASA

• Data ingest is to automated, aided by tools and procedures supplied by this project– NARSTO DES-SQL translator

– Web submission tools and procedures

– Metadata Catalog and I/O facilities

• Data submissions and access will be password protected as set by the community.

• Submitted data will be retained in a temporary buffer space and following verification transferred to the shared SQL database.

• The data access, submissions etc. will be automatically recorded an summarized in human-readable reports.

Page 16: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Related CAPITA Projects

• EPA Network Design Project (~$150K/yr –April 2003). Development of novel quantitative methods of network optimization. The network performance evaluation is conducted using the complete PM FRM data set in AIRS which will be available for input into the SRDS.

• EPA WebVis Project (~$120K/yr - April 2003). Delivery current visibility data to the public through a web-based system. The surface met data are being transferred into the SQL database (Since March 2001) and will be available to SRDS.

• NSF Collaboration Support Project (~$140K/yr – Dec 2004). Continuing development of interactive web sites for community discussions and for web-based data sharing; (directly applicable to this project)

• NOAA ASOS Analysis Project (~$50K/yr - May 2002). Evaluate the potential utility of the ASOS visibility sensors (900 sites, one minute resolution) as PM surrogate. Data now available for April-October 2001 – can be incorporated into to the Supersite Relational Data System.

• St. Louis Supersite Project website (~$50K/yr – Dec 2003) . The CAPITA group maintains the St. Louis Supersite website and some auxiliary data. It will also be used for this project

Page 17: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Federated Data Warehouse Architecture

XML WebServices

Satellite

Vector

GIS Data

XDim Data

OLAPCube

SQLTable

HTTPServices

Text Data

WebPage

TextData

Time Chart

Scatter Chart

Text, Table

Data View & Process TierLayered Map

Cursor

Data Warehouse Tier

Data View

Manager

Connection

Manager

Data Access

Manager

Cursor-Query

Manager

OpenGISServices

Data are rendered by linked Data Views (map, time, text)

Distributed data of multiple types (spatial, temporal text)

The Broker handles the views, connections, data access, cursor

Page 18: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Example Data Viewer(to be made more Supersite relevant)

Map View

Variable View

Time View WebCam

View

The views are linked so that making a change in one view, such as selecting a different location in the map view, updates the other views.

Page 19: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Supersite Relational Data System: Schedule

• First four four months to design of the relational database, associated data transformers, I/O; submitted to the Supersite workgroups for comment

• In six months, Supersite data preparation and entry begins• In Year 2 and Year 3, data sets will be updated by providers as needed; system accessible to data

user community

Year 1 - 2002 Year 2 - 2003 Year 2 - 2004

RDMS Design Feed

back

Impl. &

Test SQL Supersite Data Entry

Auxiliary Data Entry

Other Coordinated Data Entry

Supersite, Coordinated and Auxiliary Data Updates

Page 20: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Personnel, Management and Facilities

Personnel• PI, R. B. Husar (10%), Kari Hoijarvi (25%). Software experience at CAPITA, Microsoft, Visala. • 20% of project budget ($12k/yr) to consultants: J. Watson, DRI, W. White and J. Turner, WU.• Collaborators, (CAPITA associates): B. Schichtel, CIRA, S. Falke, EPA, M. Bezic, Microsoft.

Management • This project is a sub-project of the St. Louis-Midwest Supersite project, Dr. Jay Turner, PI.• Special focus on supporting large scale, crosscutting, and integrative analysis.• This project will leverage the other CAPITA data sharing projects

Resources and Facilities• CAPITA has the largest known privately held collection of air quality, metrological and emission

data, available in uniform Voyager format and extensively accessed from the CAPITA website• The computing and communication facilities include two servers, ten workstations and laptops,

connected internally and externally through high-speed networks.• Software development tools, including the Visual Studio, part of the .NET dev-environment

Page 21: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Miscellaneous Stuff

• The remainder is pages are potentially reusable stuff – not yet organized.

Page 22: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

OpenGIS Web Services

• Mission: Definition and specification of geospatial web services.

• A Web service is an application that can be published, located, and dynamically invoked across the Web.

• Applications and other Web services can discover and invoke the service.

• The sponsors of the Web services initiative include – Federal Geographic Data Committee– Natural Resources Canada– Lockheed Martin– National Aeronautics and Space Administration– U.S. Army Corps of Engineers Engineer Research and Development Center– U.S. Environmental Protection Agency EMPACT Program– U.S. Geological Survey– US National Imagery and Mapping Agency.

• Phase I - February 2002 – Common Architecture: OGC Services Model, OGC Registry Services, and Sensor Model

Language. – Web Mapping: Map Server- raster, Feature Server-vector, Coverage Server-image, Coverage

Portrayal Services. – Sensor Web: OpenGIS Sensor Collection Service for accessing data from a variety of land, water,

air and other sensors.

Page 23: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Distributed Data Analysis & Dissemination System:D-DADS

• Specifications: Uses standardized forms of data, metadata and access protocols Supports distributed data archives, each run by its own provider Provides tools for data exploration, analysis and presentation

• Features: Data are structured as relational tables and multidim. data cubes Dimensional data cubes are distributed but shared Analysis is supported by built-in and user functions Supports other data types, such as images, GIS data layers, etc.

Page 24: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

D-DADS Architecture

ARC/INFO

VirtualDataCube

ArcSDETranslator

OLAPService

Provider

DataCube

LegacyDatabase

CustomOLAP

Translator

DataCube

SQLDatabase

OLAP ServiceProvider

GISTable

OLAP

StandardizedDescription &

Format

Database(SQL,

Oracle,etc.)

ArcSDETranslator

GISMap

DataProviders

Data Access andManipulation Tools

UserInteraction

DataCube

ArcIMS

Page 25: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

The D-DADS Components

• Data Providers supply primary data to system, through SQL or other data servers. • Standardized Description & Format populate and describe the data cubes and

other data types using a standard metadata describing data

• Data Access and Manipulation tools for providing a unified interface to data cubes, GIS data layers, etc. for accessing and processing (filtering, aggregating, fusing) data and integrating data into virtual data cubes

• Users are the analysts who access the D-DADS and produce knowledge from the data

The multidimensional data access and manipulation component of D-DADS will be implemented using OLAP.

Page 26: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Interoperability

“the ability to freely exchange all kinds of spatial information about the Earth and about objects and phenomena on, above, and below the Earth’s surface; and to cooperatively, over networks, run software capable of manipulating such information.” (Buehler & McKee, 1996)

Such a system has two key elements:

• Exchange of meaningful information

• Cooperative and distributed data management

One requirement for an effective distributed environmental data system is interoperability, defined as,

Page 27: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

On-line Analytical Processing: OLAP

• A multidimensional data model making it easy to select, navigate, integrate and explore the data.

• An analytical query language providing power to filter, aggregate and merge data as well as explore complex data relationships.

• Ability to create calculated variables from expressions based on other variables in the database.

• Pre-calculation of frequently queried aggregated values, i.e. monthly averages, enables fast response time to ad hoc queries.

Page 28: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

User Interaction with D-DADS

Query

Data View(Table, Map, etc.)

Distributed Database

XML data

XML data

Page 29: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Metadata Standardization

• The Supersite Data Management Workgroup

• NARSTO

• FGDC

Metadata standards for describing air quality data are currently being actively pursued by several organizations, including:

Page 30: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Potential D-DADS Nodes

The following organizations are potential nodes in a distributed data analysis and dissemination system:

• CAPITA

• NPS-CIRA

• EPA Supersites- California- Texas- St. Louis

Page 31: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Summary

In the past, data analysis has been hampered by data flow resistances. However, the tools and framework to overcome each of these resistances now exist, including:

• World Wide Web• XML• OLAP• OpenGIS• Metadata standards

Incorporating these tools will initiate a distributed data analysis and dissemination system.

Page 32: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

‘Global’ and ‘Local’ AQ Analysis

• AQ data analysis needs to be performed at both global and local levels

• The ‘global’ refers to regional national, and global analysis. It establishes the larger-scale context.

• ‘Local’ analysis focuses on the specific and detailed local features

• Both global and local analyses are needed for for full understanding.

• Global-local interaction (information flow) needs to be established for effective management.

National and Local AQ Analysis

Page 33: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Data Re-Use and Synergy

• Data producers maintain their own workspace and resources (data, reports, comments).

• Part of the resources are shared by creating a common virtual resources.

• Web-based integration of the resources can be across several dimensions:Spatial scale: Local – global data sharing

Data content: Combination of data generated internally and externally

• The main benefits of sharing are data re-use, data complementing and synergy.

• The goal of the system is to have the benefits of sharing outweigh the costs.

Content

Content

User

User

User

LocalLocal

GlobalGlobal

Virtual Shared Resources

Data, KnowledgeTools, Methods

User

User

Shared part of resources

Page 34: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Integration for Global-Local Activities

Global Activity Local Benefit

Global data, tools => Improved local productivity

Global data analysis => Spatial context; initial analysis

Analysis guidance => Standardized analysis, reporting

Local Activity Global Benefit

Local data, tools => Improved global productivity

Local data analysis => Elucidate, expand initial analysis

Identify relevant issues => Responsive, relevant global work

Global and local activities are both needed – e.g. ‘think global, act local’

‘Global’ and ‘Local’ here refers to relative, not absolute scale

Page 35: 2004-11-13 Supersite Relational Database Project: (Data Portal?)

Content Integration for Multiple Uses (Reports)

Data from multiple measurements are shared by their providers or custodiansData are integrated, filtered, aggregated and fused in the process of analysisReports use the analysis for Status and Trends; Exposure Assessment; Compliance …

The creation of the needed reports requires data sharing and integration from multiple sources.