17
M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg) HAMSTER-WS, Langen, 11.07.03

Semantic Data Management for Organising Terabyte Data Archives

  • Upload
    wray

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Semantic Data Management for Organising Terabyte Data Archives. Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg). HAMSTER-WS , Langen, 11.07.03. Content: DKRZ archive development CERA 1) concept and structure Automatic fill process CERA access statistic. - PowerPoint PPT Presentation

Citation preview

Page 1: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 1

Semantic Data Management forOrganising Terabyte Data Archives

Michael Lautenschlager

World Data Center for Climate(M&D/MPIMET, Hamburg)

HAMSTER-WS, Langen, 11.07.03

Page 2: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 2

Content:

• DKRZ archive development

• CERA1) concept and structure

• Automatic fill process

• CERA access statistic

1) Climate and Environmental data Retrieval and Archiving

Page 3: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 4

DKRZ Archive Development

DKRZ's Archive Increase (Estim. 06.03)

6001150

17502350

3250

4150

12 40 160 360660

960

2002 2003 2004 2005 2006 2007

Years

Dat

a A

mo

un

t [T

B]

Unix-File Archive

CERA DB

Page 4: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 5

Problems in file archive access: Missing Data CatalogueDirectory structure of the Unix file system is not sufficient to organise

millions of files. Data are not stored application-orientedRaw data contain time series of 4D data blocks.

Access pattern is time series of 2D fields. Lack of experience with climate model dataProblems in extracting relevant information from climate model raw data

files. Lack of computing facilities at client siteNon-modelling scientists are not equipped to handle large amounts of data

(1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals).

Year 2003 2004 2005 2006 2007

Estimated File Archive Size

1,2 PB 1,8 PB 2,4 PB 3,3 PB 4,2 PB

Page 5: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 6

Model Raw DataStructure

Application-orientedData Storage

5D Model Data Structure

6-hour StorageInterval

Page 6: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 7

(I) Data catalogue and pointer to Unix files Enable search and identification of data Allow for data access as they are

(II) Application-oriented data storage Time series of individual variables are stored as BLOB

entries in DB TablesAllow for fast and selective data access

Storage in standard file-format (GRIB)Allow for application of standard data processing routines

(PINGOs)

CERA Concept:Semantic Data Management

Page 7: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 8

CERA Database: 7.1 TB (12.2001)* Data Catalogue* Processed Climate Data * Pointer to Raw Data files

Mass Storage Archive:210 TB neglecting Security Copies (12.2001)

CE

RA

Dat

abas

eS

yste

m

Web-Based User InterfaceCatalogue Inspection

Climate Data Retrieval

DK

RZ

Mas

s S

tora

ge A

rch

ive

In

tern

etA

cces

s

Current database size is 15.3778 Terabyte Number of experiments: 299 Number of datasets: 19818 Number of blob within CERA at 20-MAY-03: 1104798147

Typical BLOB sizes: 17 kB and 100 kB

Number of data retrievals:

1500 – 8000 / month

Parts of CERA DB

DB Size 09.07.03: 19.6 TB

Page 8: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 9

Level 1 - Interface:Metadata entries(XML, ASCII)

Level 2 – Interf.:Separate filescontaining BLOBtable data

Experiment Description

Pointer toUnix-Files

Dataset 1Description

Dataset nDescription

BLOB DataTable

BLOB DataTable

Data Structure in CERA DB

Page 9: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 10

Climate Model Raw Data

Application-oriented Data Storage(Interface level 2)

Primary DataProcessing

Page 10: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 11

Creation of application-orienteddata storage must beautomatic !!!

Automatic Fill Process (AFP)

Page 11: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 12

Archive Data Flow per month

ComputeServer

CommonFile

System

MassStorageArchive

CERADB

System

50 TB 75 TB (2006+)

2003: 4 TB2004: 10 TB2005: 17 TB2006+: 25 TB

Unix-Files

Application OrientedData Hierarchy

Application OrientedData Hierarchy

Unix-Files

MetadataInitialisation

Important:Automatic fill processhas to be performedbefore correspondingfiles migrate to massstorage archive.

Page 12: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 13

Automatic Fill ProcessSteps and Relations

DB-Server:

1. Initialisation of CERA DBMetadata and BLOB data tables are created

Compute Server:

1. Climate model calculation starts with 1. month

2. Next model month starts and primary data processing of previous monthBLOB table input is produced and stored in the dynamic DB fill cache

3. Step 2 repeated until end of model experiment

DB Server:

1. BLOB data table input accessed from DB fill cache

2. BLOB table injection and update of metadata

3. Step 2 repeated until table partition is filled (BLOB table fill cache)

4. Close partition, write corresponding DB files to HSM archive, open new partition and continue with 2)

5. Close entire table and update metadata after end of model experiment

Page 13: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 14

GFS

Automatic Fill ProcessDevelopment

192 CPU's1.5 TB Memory

ca. 60 TB

Oracle RDBMS iscurrently not connected with GFS.

Future: • RDBMS movementto Linux system and connection with GFS,• AFP is planned to move to DB-server

AFP developmentwith 1 TB dynamic DB fill cache

4 TB/month

BLOB table fill cache

Page 14: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 15

AFP Cache Sizes

Dynamic DB fill cache• In order to guarantee stable operation the fill cache should buffer data from 10 days production.

• Cache size is determined by the automatic data fill rate.

Year AFP [TB/month] DB Fill Cache [TB]

2003 4 1.5

2005 17 6

2006 - 2007 25 9

Page 15: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 16

AFP Cache Sizes

BLOB table partition cache• Number of parallel climate model calculations and size of BLOB table partitions determine the cache size.

• BLOB table sections of 10 years for the high resolution model and 50 years for the medium resolution result in 0.5 TB partitions

• Present-day experience on parallel model runs result in BLOB table partition cache sizes equal to dynamic DB fill cache.

• Modification will be inferred if number of parallel runs and/or table partitions change.

Page 16: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 17

CERA Access Statistic

Month Number of downloads

Volume (GB)

Average Transfersize (MB)

June 2003 3426 78 23 May 2003 5803 117 21 APRIL 2003 5343 66 16 MARCH 2003 3611 109 31 FEBRUARY 2003 8058 168 21 JANARY 2003 2312 204 90 DECEMBER 2002 1985 200 103 NOVEMBER 2002 2813 134 49 OCTOBER 2002 4270 225 54 SEPTEMBER 2002 4054 264 67 AUGUST 2002 5475 216 40 JULY 2002 2888 196 70 JUNE 2002 1835 219 122 MAY 2002 2317 150 66 APRIL 2002 1284 170 136 MARCH 2002 1682 88 54 FEBRUARY 2002 2258 105 47 JANUARY 2002 420 35 85

Page 17: Semantic Data Management for Organising Terabyte Data Archives

M.Lautenschlager (WDCC, Hamburg) / 09.07.03 / 18

CERA DB using countries