27
www.altecspace.it All rights reserved © 2016 - Altec DATA ACCESS SERVICES AT DPCT TO SUPPORT SCIENTISTS’ ONLINE AND OFFLINE DATA ANALYSIS Rosario Messineo on behalf of Angelo Mulone and DPCT Team [email protected]

DATA ACCESS SERVICES AT DPCT - ESA …old.esaconferencebureau.com/custom/16M05/bids/ALL/D3... 2016 -DATA ACCESS SERVICES AT DPCT TO SUPPORT SCIENTISTS’ ONLINE AND OFFLINE DATA ANALYSIS

Embed Size (px)

Citation preview

www.altecspace.it

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

DATA ACCESS SERVICES AT DPCT TO SUPPORT SCIENTISTS’ ONLINE AND

OFFLINE DATA ANALYSIS

Rosario Messineo on behalf of Angelo Mulone and DPCT Team

[email protected]

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

ALTEC Activities

P,,ISS

SUPPORT

FLIGHT

CONTROL

HUMAN SPACE

FLIGHT &

BIOMED

PMM CREW TRAININGLOGISTICS CENTER

CONTROL

CENTERS &

DATA

PROCESSING

ALTEC ESCALTEC MSC

EXOMARS ROCC

IXV MCC GAIA DPCT

PLANETARY

EXPLORATION

RESEARCH

PROGRAMS

STEPS 2SMAT F2 FP7 FP7

FACILITIES

OPSYS

OUTREACH

INITIATIVES

H2020

VRLAB

Virtual

Reality

Laboratory

MMTD

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Introduction DPCT Overview

Gaia Main Database at DPCT

Technical Challenge

Data Management Data Flow

Data Store

Database Technologies

Processing Databases

Repository Database

Data Access and Analysis Services Data Access and Analysis Services

Data Access Application and Tools

On Demand database

Next Data Exploitation Platform

Conclusion

Outline

Page 3

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

IntroductionData

ManagementDAAS Conclusion

Page 4

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

DPCT is one of six DPCs in the Gaia Data Processing and Analysis

Consortium (DPAC), built and hosted in Turin.

DPCT is a set of products, people and processes to achieve the

Astrometric Verification Unit (AVU) in the Gaia mission (Mission

Requirements) .

DPCT is the result of a joint effort between the science team of the

Astrophysical Observatory of Torino (P.I. Prof. Mario Lattanzi) and the

ALTEC industry team.

DPCT is funded by ASI (Italian Space Agency).

DPCT is in operations since the Gaia satellite launch (19.Dic.2013).

Overview

Page 5

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Data produced by DPCs and defined in the common data modelare received at DPCE as files.

Data distribution from DPCE to other DPCs is performed following a subscription mechanism.

Gaia MDB at DPCT

Page 6

DPCT populated the MDB with the output of the AVU systems.

DPCT mantains an important part of the MDB content in its data stores.

The DPCT receives data from MDB both on daily and cyclic basis.

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

The management of the overall data flow, ensuring data availability, data integrity, data consistency and data protection, is the major technical challenge of DPCT. The size of DPCT databases will consist of hundreds of terabytes which will

have to be stored for a long time.

In addition, data access must be efficient enough to satisfy processing time constraints.

DPCT will manage an big infrastructure up to 1PB with fast and secure data access.

DPCT has to support also the Gaia data exploitation beyond the mission objectives.

DPCT Technical Challenge

Page 7

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

IntroductionData

ManagementDAAS Conclusion

Page 8

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Data Flow

Page 9

Data are received as files through internet and stored in a staging area.

Received data are ingested into two different database’s groups: processing (LEVEL1) and the repository database (LEVEL0) .

Processing databases dedicated to hosted systems and LOCALDB for daily pipeline

GSRDB for cyclic pipeline

Data ingested are pre-processed

In the repository database (REPDB ) data are stored as they are received: It contains all data received at DPCT from the

beginning of the mission.

Consolidated processing output is moved from LOCALDB to REPDB.

50 GB of data received daily.

TBs of data received on cyclic basis.

Data Transfer executed up to 500 Mbps.

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

• Automatic data ingestion a reception time.

• Bulk ingestion for cyclic data.

• Millions of observation ingested into REPDB and processing databases.Ingestion

• Extraction of the output at end of processing to delivery data to MDB.

• Proide data to support development and integration processes.Extraction

• For scientific pipelines

• For ScientistsData Access

• Regular checks to exclude duplicated and/or missing data

• Pipelines output is archived moving data to the repository DB.

• Data moved in repository are removed from processing DBConsolidation

• The data retraction process is the deletion of no more valid data

• If input data are deleted then deletion chain of data produce using them

• The data deletion is a very complex task.Retraction

•Strategy on two levels: disk (snapshot technology) and tape.

•First level backup are instantly while tape backups takes days.

•Copy of backups will moved in an extenals site to be used in case of disaster.Protection

Data Management Activities

Page 10

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Data stores: databases and filesystems.

Database implemented using the technologies allowing: Data always online,

Good performance through tables partitioning at two levels,

Disk space usage under control using data compression,

Very quick queries activating parallellism,

Users trust.

Data consolidation procedures are executed at database level for: controlling the amount of storage space used by the DBMS,

organizing data ready to be queried and

maximizing the usage of different typologies of disks.

Data Stores

Page 11

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Data Store Architecture

Page 12

700 TB useable disk space on two HP P7400 storage units.

FC disks and NL disks

Tape Libraries

HP ESL G3/LTO-5

DB-1

HP DL580 G7

32 CPU cores

256 GB RAM

SAN 8 Gbps

LAN 10 Gbps

PR-1Processing

Server

PR-2 PR-N

DB-2

Running

Instances of

all

databases

DB-3

Oracle

Real

Application

11g

Backup

Server

Mount snapshot

and perform

backuo on tape

........

REPDB LOCALDB GSRDB DBs - SnapDBs - Snap

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

With so huge data we decided to use RDBMS technology because: At time of DPCT architecture consolidation NoSQL technologies were

not mature as they are today.

Oracle is provided to DPCT as COTS from scientific community wich is responsible for the license procurement.

Oracle support is very reliable.

Allow online analysis on consistent database.

After more than two years of mission no performance and robustness issues but: Complex physical configuration was needed,

High technical expertise is needed to mantain databases,

More disk space than is needed to mantain data in Oracle,

Data have to be accessed using time or spatial configured fields. This is not suitable for generic data analysis.

Changes to common data model impact database structure.

Why SQL DB and ORACLE ?

Page 13

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Dedicated to AIM and BAM daily systems.

Partitioning used to improve data access performance:

Tables are partitioned by range using time-based fields

Big tables are sub-partitioned, partitioned by range and sub-partitioned by hash (data distribution on a fixed number of parts).

Database

LOCALDB

Page 14

0,00

5000,00

10000,00

15000,00

20000,00

25000,00

LOCALDB Size (GB)

80%

15%5%

LOCALDB Composition

AIM

BAM

Other

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Database

GSRDB

Page 15

13000,00

13200,00

13400,00

13600,00

13800,00

14000,00

14200,00

14400,00

14600,00

GSRDB Size (GB)

Dedicated to GSR cyclic processing

New empty schema at start of cycle

Size depends on processing cycle

Populated from Repository 92%

8%

GSRDB Space Used

Occupied

Allocated Free

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Repository DataBase dedicated to manage data coming from MDB and the output of DPCT systems.

Tables are partitioned on a field identifying the processing run to ease data retraction

Many procedures to manage tables partitioning/sub-partitioning and to address database size problem.

Data stored in REPDB are available to scientists in every moment.

Scan a table with more than 40 billion of observations is feasible in less than 20 minutes.

Database

REPDB

Page 16

0,0020000,0040000,0060000,0080000,00

100000,00120000,00140000,00160000,00

REPDB Size (GB)

75%

25%

Disk Class Occupacy

Fast

Slow

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

IntroductionData

ManagementDAAS Conclusion

Page 17

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

The DPCT architecture is defined and implemented to provide the data access and analysis services to scientists.

Scientists have read-only access to data managed at DPCT.

The DPCT has a client area, exposing several services, which is accessible from remote through a secure access.

The access to DPCT tools and facilities is implemented with a strict security policy: VPN,

LDAP, password expiration, certificates ecc.

Data Access and

Analysis Services

Page 18

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Custom workflow environment named DAAS to

packet data according to pipeline modules. Extremely useful for pipeline troubleshooting and ad hoc analysis.

IDL + Data Mining configured to read data from

DPCT repository database. A library bridges Java to IDL thus allowing the usage of any Java

Gaia library into IDL.

Scientists’ development environment configured to

read data from DPCT repository database.

Direct Access to Repository

Page 19

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Dedicated web interfaces to monitor the processing and

see results.

Dedicated web interfaces for data requests management. Scientists insert data request.

DPCT operator evaluates the data request feasibility and perform data

extraction.

Common tools released from DPAC and configured on

top of DPCT data stores.

General Tools such as Topcat, DS9, etc..

Data Access

Applications And Tools

Page 20

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Databases «On demand» are foreseen to support data analysis.

One «On demand» DB is new dedicated database created, populated, hosted and provided to scientists for a specific data analysis.

Several database technologies could be used to implement «On demand».

«On demand» database are provided together with tools and application enabling data analysis.

Oracle 12c multitenancy feature was tested. Through multitenancy users can move a database across two different sites

maintaining data, permissions and logical structures.

On Demand Database

Page 21

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Multitenancy in ALTEC Cloud

Page 22

Gaia DPCT

In ALTEC cloud DPCT user

can own a virtual machine

with a custom database with

needed data

Virtual Machine are configured

with a set of tool that can interact

with database (ex: IDL,

Cassandra, ...)

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

TECSEL2 : TEmporal Characterization of the remote SEnsors response

to radiation damage in L2

Data exported in a database built with Cassandra technology.

Algorithms executed in memory through Spark.

TECSEL2 Architecture

Page 23

AIM dataset

~ 100M input objects

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

DPCT Data Exploitation

Experimental Platform

Page 24

DPCT Operational

Platform

REPDB

ALTEC Development Environment

Cloud IaaS - PaaS

Big Data

ALTEC Space

Data Processing

Data Access

Tools

Now In the future

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

IntroductionData

ManagementDAAS Conclusion

Page 25

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

The DPCT has a complex database architecture based on the

commercial technologies defined and implemented to support mission.

The REPDB database is a very large database and it is the data source

for data exploitation tasks at DPCT. In the REPDB data are online and

accessible through several data access services.

The next challenge at DPCT is to create a data analysis environment

to experiment new approaches, methodologies and techniques needed

for the scientific data exploitation to maximize the extraction of information

from existing datasets not yet fully exploited.

On demand database will be available on the data analysis environment.

Conclusion

Page 26

All

righ

ts r

eser

ved

© 2

01

6 -

Alt

ec

www.altecspace.it

Thanks for your attention !

[email protected]

www.altecspace.it