17
5/19/05 5/19/05 New Geoscience Applications New Geoscience Applications 1 A DISTRIBUTED A DISTRIBUTED WORKFLOW DATABASE WORKFLOW DATABASE DESIGNED FOR DESIGNED FOR COREWALL COREWALL APPLICATIONS APPLICATIONS Bill Kamp Bill Kamp , , Lumnilogical Research Lumnilogical Research Center, Univ of Center, Univ of Minnesota, Minnesota,

5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

Embed Size (px)

DESCRIPTION

5/19/05 New Geoscience Applications 3 Overview  The data required for a core interpretation session can be very large.  An individual IODP core's data can be in the 10 to 100 gigabyte range.  To compound this problem, many users will be interpreting at locations with slow internet connections.  Finally users may be interpreting data from databases that are often designed as read-only archives and not designed to hold ‘works in progress' of investigators.  Our goal is to provide a very smart clipboard.

Citation preview

Page 1: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 11

A DISTRIBUTED A DISTRIBUTED WORKFLOW DATABASE WORKFLOW DATABASE

DESIGNED FOR DESIGNED FOR COREWALL COREWALL

APPLICATIONSAPPLICATIONS

Bill KampBill Kamp, Lumnilogical , Lumnilogical Research Center, Univ of Research Center, Univ of

Minnesota, Minnesota,

Page 2: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 22

The CorewallThe Corewall

Page 3: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 33

OverviewOverview The data required for a core interpretation The data required for a core interpretation

session can be very large. session can be very large. An individual IODP core's data can be in An individual IODP core's data can be in

the 10 to 100 gigabyte range. the 10 to 100 gigabyte range. To compound this problem, many users To compound this problem, many users

will be interpreting at locations with slow will be interpreting at locations with slow internet connections. internet connections.

Finally users may be interpreting data Finally users may be interpreting data from databases that are often designed as from databases that are often designed as read-only archives and not designed to read-only archives and not designed to hold ‘works in progress' of investigators.hold ‘works in progress' of investigators.

Our goal is to provide a very smart Our goal is to provide a very smart clipboard. clipboard.

Page 4: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 44

The Data Requirement Demand a The Data Requirement Demand a DatabaseDatabase

Workflow OrientedWorkflow Oriented Large ThroughputLarge Throughput Internet AwareInternet Aware Accept all data typesAccept all data types Locally and Remotely Connect to GeowallLocally and Remotely Connect to Geowall Integrate with legacy ToolsIntegrate with legacy Tools And most Importantly – TransparentAnd most Importantly – Transparent

– Little or no CWD work by the ResearcherLittle or no CWD work by the Researcher Automatic, automatic, automaticAutomatic, automatic, automatic

Page 5: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 55

Legacy ToolsLegacy Tools Core Log Integration Platform from Core Log Integration Platform from

Lamont-Doherty Earth Observatory (LLamont-Doherty Earth Observatory (LDEO) DEO) – SplicerSplicer: Provides interactive depth-: Provides interactive depth-

shifting of multiple holes of core data to shifting of multiple holes of core data to build build composite sectionscomposite sections

– SaganSagan: Allows the composite sections : Allows the composite sections output by Splicer to be mapped to their output by Splicer to be mapped to their true stratigraphic depths, unifying core true stratigraphic depths, unifying core and log records and log records

Page 6: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 66

Sample PlotSample Plot

Page 7: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 77

InterfacesInterfaces We will provide interfaces that enable the We will provide interfaces that enable the

CWD (Computer Workflow Database) to CWD (Computer Workflow Database) to retrieve user selected data from retrieve user selected data from established databases such as JANUS, established databases such as JANUS, LacCore Vault, dbSEABED, and PaleoStrat. LacCore Vault, dbSEABED, and PaleoStrat.

We hope to also pull data through the We hope to also pull data through the emerging portals such as CHRONOS. emerging portals such as CHRONOS.

The result is fast cached access to multiple The result is fast cached access to multiple data sources. data sources.

Page 8: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 88

FeaturesFeatures The CWD captures the results of analyses and The CWD captures the results of analyses and

interpretations. interpretations. As the workflow is captured it can be accessed by As the workflow is captured it can be accessed by

other collaborators locally or remotely. other collaborators locally or remotely. In a high bandwidth environment, such as a core In a high bandwidth environment, such as a core

lab or a university office, a group of collaborators lab or a university office, a group of collaborators could track the work of one-another as they work could track the work of one-another as they work on the same cores. on the same cores.

In a low-bandwidth environment we will cache the In a low-bandwidth environment we will cache the data locally upon first access.data locally upon first access.

In a zero-bandwidth environment, the CDW can In a zero-bandwidth environment, the CDW can be copied to a portable mass storage device: All be copied to a portable mass storage device: All pointers are relative to the location of the CWD.pointers are relative to the location of the CWD.

Page 9: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 99

Coordinate SystemsCoordinate Systems Co-registration across coordinate systems, e.g. Co-registration across coordinate systems, e.g.

wire length, geologic boundary, and/or geologic wire length, geologic boundary, and/or geologic age.age.

We use the standard algorithms from SAGAN and We use the standard algorithms from SAGAN and SPLICER for this purpose.SPLICER for this purpose.

We intend to take advantage of existing We intend to take advantage of existing technologies such as the Storage Resource Broker technologies such as the Storage Resource Broker and Meta-data Catalog [SRBMDC] to facilitate the and Meta-data Catalog [SRBMDC] to facilitate the locating of replicated data-setslocating of replicated data-sets

We will use SESAR identifiers to uniquely and We will use SESAR identifiers to uniquely and automatically identify the sample and the author automatically identify the sample and the author and the experiment when the data is loaded.and the experiment when the data is loaded.

Page 10: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 1010

Database DesignDatabase Design The The paradigm paradigm for the metadata is:for the metadata is:

– AuthorAuthor– ExperimentExperiment– Raw DataRaw Data– PresentationPresentation

Data type is missing: We support all Data type is missing: We support all mime data typesmime data types– XML and Text stored in the databaseXML and Text stored in the database– All other data stored in the Bin CacheAll other data stored in the Bin Cache

Page 11: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 1111

The Data DiagramThe Data Diagram

Page 12: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 1212

CachesCaches Uploading requires a caching systemUploading requires a caching system

– Upload Cache, accessedUpload Cache, accessed DirectlyDirectly FTPFTP HTTP uploadHTTP upload

– Archive Cache: All data is stored in raw form in an Archive Cache: All data is stored in raw form in an archive that is permanentarchive that is permanent

– Staging: A temporary holding place for data while it is Staging: A temporary holding place for data while it is examined and transformedexamined and transformed

– Bin Cache: The location of the binary data managed by Bin Cache: The location of the binary data managed by the databasethe database

The complete uploading process, including The complete uploading process, including automatic recognition of the data type, is automatic recognition of the data type, is available as a single script, called ForceUpload.available as a single script, called ForceUpload.– It is the best way when you have multiple data sets of It is the best way when you have multiple data sets of

the same data type.the same data type.

Page 13: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 1313

Data AccessData Access All raw data is available via URL’s.All raw data is available via URL’s. The author has the option of refining the The author has the option of refining the

automatically generated presentation, i.e. automatically generated presentation, i.e. the HTML page that shows the data.the HTML page that shows the data.

Presentations can be dynamically built Presentations can be dynamically built using database data. Tools are provided.using database data. Tools are provided.

If data is not local, it is transferred to the If data is not local, it is transferred to the local bin cache, and the CWD is updated.local bin cache, and the CWD is updated.

If you are not on the internet you need to If you are not on the internet you need to bring with you the database (small) and bring with you the database (small) and the bin cachethe bin cache

Page 14: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 1414

Sample PresentationsSample Presentations 9.134.readme.txt.html9.134.readme.txt.html 9.137.cwilocs.zip.html9.137.cwilocs.zip.html 1.195.logo.bmp.html1.195.logo.bmp.html 1.148.kamp_1218c_021x_07.jpg.html1.148.kamp_1218c_021x_07.jpg.html 1.7.MOLE-JUAN03-1A.Geotek.and.L-a-b.dat1.7.MOLE-JUAN03-1A.Geotek.and.L-a-b.dat

a.xls.htmla.xls.html 7.122.GLAD4-HVT03-4B-9H-1.BMP.html7.122.GLAD4-HVT03-4B-9H-1.BMP.html 7.123.GLAD4-HVT03-4C-1H-1.BMP.html7.123.GLAD4-HVT03-4C-1H-1.BMP.html 7.93.GLAD4-HVT03-4B-1H-1.BMP.html7.93.GLAD4-HVT03-4B-1H-1.BMP.html

Page 15: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 1515

ReplicationReplication The data base is replicated to multiple The data base is replicated to multiple

sites on the internet automatically via sites on the internet automatically via TCP/IP. This is a MySql feature.TCP/IP. This is a MySql feature.

The URL of the data is sent to the The URL of the data is sent to the replicated database.replicated database.

If upon the first access, if the data is not If upon the first access, if the data is not local, it is fetched to the bin cache via a local, it is fetched to the bin cache via a URL, and the pointers in the local CWD are URL, and the pointers in the local CWD are updated.updated.

Currently we have a parent-child Currently we have a parent-child relationship: All data is first uploaded to relationship: All data is first uploaded to the main CWD.the main CWD.

When we complete the integration of When we complete the integration of SESAR identifiers, the design will support SESAR identifiers, the design will support peer-to-peer relationships.peer-to-peer relationships.

Page 16: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 1616

Database AccessDatabase Access Data uploaded via a web siteData uploaded via a web site Data pulled out the CWD via CorewallData pulled out the CWD via Corewall Data will automatically cross load to Data will automatically cross load to

other DB’s such as Chronos when other DB’s such as Chronos when there is a meta-data matchthere is a meta-data match

The latter will be enforced via XSLT’sThe latter will be enforced via XSLT’s

Page 17: 5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS…

5/19/055/19/05 New Geoscience ApplicationsNew Geoscience Applications 1717

Current StateCurrent State Test versions are on the web:Test versions are on the web: Currently at Currently at

http://www.iagp.net/LRC/LrcVaulthttp://www.iagp.net/LRC/LrcVault Soon to be at Soon to be at

http://burnout.geo.umn.eduhttp://burnout.geo.umn.edu Documented at Documented at

http://mm/html/iagp/LRC/LrcVault/http://mm/html/iagp/LRC/LrcVault/ Currently holds 10 GByte of test dataCurrently holds 10 GByte of test data