A quick start guide to using HDF5 files in GLOBE Claritas

V6.0

Getting Started With HDF5

• Why have we brought in a new data format?• What actually is HDF5?• How do I create HDF5 files?• How do I read in HDF5 files

– Reading one file at a time– Reading multiple files and selections

• Points to Note• Future Developments

SEGY is great but…• It is designed to be read sequentially from tape

– and our “index” file solution didn’t scale well to “big data”– and our index file solution only allowed primary key access

• It only has 240 bytes of 32-bit integer headers defined– and our extended trace headers didn’t scale well to “big data”

• Some processes require “n-key random access”– “surface consistent” suite, PreSTM, 3DSRME etc.

• You need to read the whole file to access trace headers– Some “database” systems offer more flexibility

• Parallel I/O doesn’t scale well on large clusters

So what is HDF5?• Developed over the last 20 years

• Initially by National Centre for Supercomputing Applications http://www.ncsa.illinois.edu/• Now developed by the HDF5 Group http//:www.hdfgroup.org

• A suite of technologies, not just a file format• General purpose library and file format for storing scientific data• Fully supported set of command line tools, APIs and interfaces

• A pan-industry open standard• Used for storage by both MatLab and Scilab, can be read by Mathmatica• Fully supported set of command line tools, APIs and interfaces

• A self describing format• No ambiguity about integer or floating point types or storage in trace bytes• Names can be allocated to components, as you would in a database structure

• Built for “big data”• Petabyte+ scale datasets running on tens of thousands of cores

Our Implementation of HDF5HDFView 2.9 : free, third party tool, showing how any HDF5 application can open the new

format

Data, Processing History, 400-byte reel header, 3200-byte text

header, history and trace headers from Claritas extended SEGY all

present

Seismic samples displayed graphically – could also be

displayed as a table

All trace headers – SEGY 240byte and extended - opened in a

spreadsheet; full mathematical operations

We have “encapsulated” the GLOBE Claritas SEGY in HDF5

The 400-byte binary reel header opened as a table, so that values

can be edited or modified

Creating HDF5 Files : SEISWRITE

Specify a file name!

Optimisation controls; these have smart defaults set and can be modified for managing very large datasets where you know that non-sequential read-access will be needed, or partial read of trace samples will be required

Replaces current use of DISCWRITE, although this will continue to be availableNew functionality development will focus on SEISWRITE and HDF5 format data

Reading HDF5 files : SEISREAD

With HDF5 format, you use SEISREAD in place of the DISCxxxxx ModulesYou don’t need to worry about the order of data on disc, just how you want to read it

Simple ReadingFile Name

Primary key order; default is all, ascending

Secondary key order; default is all, ascending

Tertiary key order; only when needed

You can read data in ANY order; original order doesn’t matter

Selection and Repeats

6 Repeat copies specified

Primary key SHOTID with only SHOTID 900 only selected; note tolerance

Secondary key CHANNEL, all selected, in ascending order (default)

Six copies of SHOTID 900 passed to the processing flow, with REPEAT set from 1-6

More Complex Selections

Two copies of SHOTIDs from 100 to 900 with an increment of 100, all channels in ascending, with REPEAT set to 1 and 2

More complex SHOTID selection using the same syntax as DISCREAD; note tolerance is set to 0

Sorting to CDP (DISCGATH)

Identical to simple reading

Specify CDP and primary keySpecify CDPTRACE as secondary key

Default is to read all data in ascending primary/secondary key order

Reading Multiple Files

Seismic File List used in the same format as with DISCREAD, with selections

SETRAEPEAT parameter used as per DISCREAD to create panels, files are merged if this is “no”

Primary Key defined here is used in the Seismic File List definition

This last file has a “native” ordering of CDP, CDPTRACE, but will be order to SHOT, CHANNEL on read, automatically

Points to Note

• Can only specify a primary key in a Seismic File List– Same as DISCWRITE, although the original data order no longer matters

• User needs to managed extended trace headers merge– Use DELHDR prior to merging files; will be removed in future releases

• Files can be 10-15% larger than SEGY• Compatible with Cluster File Systems (Gluster etc.)• I/O above about 2Gbytes should be improved

Future development• Improved PKEY/SKEY/TKEY selection handling• Direct update of trace headers from applications

– Geometry, SV (FB picks) etc.

• Add HDF5 support in KPRET2D – Only module where this is not available

• Add full parallel I/O to iMage suite– Increase parallel scalability even further

• Algorithmic optimisation– Re-write to take full advantage of random access

Technology

A quick start guide to using HDF5 files in GLOBE Claritas