Victoria, May 2006 DAL for theorists: Implementation of the SNAP service for the TVO Claudio Gheller, Giuseppe Fiameni InterUniversitary Computing Center

Victoria, May 2006

DAL for theorists: Implementation of the SNAP service for the TVO

Claudio Gheller, Giuseppe FiameniInterUniversitary Computing Center CINECA, Bologna

Ugo Becciani, Alessandro CostaAstrophysical Observatory of Catania

Victoria, May 2006

The Simple Numerical Access Protocol Service

The Snap service extracts or "cuts out" rectangular (spherical or even irregular) regions of some larger theory dataset, returning a subset of the requested size to the client.

Snap basic components:

• DATA

• SNAP code

• SERVICE

Victoria, May 2006

1. Data and Data Model

In order to analyze the needs of data produced by numerical simulations, we have considered a wide spectrum of applications:

• Particle based Cosmological simulations

• Grid based Cosmological simulations

• Magnatohydrodynamics simulations

• Planck mission simulated data

• ...(thanks to V. Antonuccio, G. Bodo, S. Borgani, N. Lanza, L. Tornatore)

At the moment, we consider only RAW data

Victoria, May 2006

1. Data

In general, data produced by numerical simulations are

• Large (GB to TB scale)

• Monolithic (few files contains plenty of data)

• Uncompressible

• Non standard (propretary formats are the rule)

• Non portable (depend from simulation machine)

• No (or few) annotations – metadata

• Heterogeneous in units (often code units)

Victoria, May 2006

Data: the HDF5 format

HDF5 (http://hdf.ncsa.uiuc.edu) represents a possible solution to deal with such data

HDF5 is• Portable between most of

modern platform• High performance• Well supported• Well documented• Rich of tools

HDF5 data files are• Platform independent (portable)• Well organized• Self defined• Metadata enriched• Efficiently accessible

HDF5 drawbacks• Requires some expertise and

skill to be used• Information are difficult to

access• Can be subject to major library

changes (see HDF4 to HDF5)

Victoria, May 2006

Data: our HDF5 implementation

Each file represents an output time

The structure is simple: all the data objects are at the root level:

/BmMassDensity Dataset {512, 512, 512}

/BmTemperature Dataset {512, 512, 512}

/BmVelocity Dataset {512, 512, 512, 3}

/DmMassDensity Dataset {512, 512, 512}

/DmPosition Dataset {134217728, 3}

/DmVelocity Dataset {134217728, 3}

HDF5 metadata make the file completely self-consistent

Structural metadata (strictly required from the library)

• rank• Dimensionality

Annotation metadata (required from our implementation)

• Data object name• Data object description• Unit• Formula

Data objects (at the moment) can be:

• Structured grid: rank 4 (scalars or vectors)

• Unstructured points: rank 2 (scalar or vectors)

Victoria, May 2006

Data Model schema

Victoria, May 2006

Implementation of the model

The database is at present implemented on a PostgreSQL Linux installation.

Victoria, May 2006

The Snap Code: overview

The Snap code acts on large datafiles on different platforms. Therefore it has been implemented according to the following requirements:

• Efficiency

• Robustness

• Portability

• Extensibility

We have adopted the C++ programming language over the HDF5 format and APIs.

It is compiled under Linux (Gnu Compiler) and AIX (xlC compiler)

Source HDF5 fileDataset1

...

...

Dataset N

Snapped HDF5 file

Dataset1

...

Dataset M

SNAP

service Dow

nlo

ad

Victoria, May 2006

The Snap Code

Input:

Data filename

Data objects (one or more)

Spatial Units

Box Center

Box Size

Output filename

Data objects names

Output:

One ore more HDF5 file with the same descriptive metadata as the original dataset.

Goal: select all the data that fall inside a pre-defined region. At present the region can be only rectangular.

Victoria, May 2006

The Snap Code

Mesh Based Data:

Selection is performed using HDF5 hyperslabs selection functions. Only necessary data are loaded in memory.

Selection is extremely fast.

Particle Based Data:

Particle positions are loaded in memory

Particles inside the selected region are identified and their ids are stored (linked list)

Other particle based dataset are loaded in memory and the list is used to select target particles

Selected particles are written in the file

Procedure can become “heavy”

Data Geometry and Topology: at present we support regular mesh based data and unstructured data (particles). The data structure is crucial for the Snap implementation features

Future upgrades:

Support of spherical (or even irregular) regions

Support of periodic boundary conditions

Parallel implementation

Victoria, May 2006

Access to the Archive (the service)

The archive can be accessed in two complementary ways:

• Via web and web portal

• Via web service and high level applications

Data Archive (data + metadata+apps)

WEB WEB SERVICE

Web Portal

VisIVO

User app. 1

User app. 2

TomCat+Axis

OGSA-DAIPHP, Java…

Documents

Victoria, May 2006 DAL for theorists: Implementation of the SNAP service for the TVO Claudio Gheller, Giuseppe Fiameni InterUniversitary Computing Center