Upload
phunghuong
View
213
Download
0
Embed Size (px)
Citation preview
www.altecspace.it
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
DATA ACCESS SERVICES AT DPCT TO SUPPORT SCIENTISTS’ ONLINE AND
OFFLINE DATA ANALYSIS
Rosario Messineo on behalf of Angelo Mulone and DPCT Team
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
ALTEC Activities
P,,ISS
SUPPORT
FLIGHT
CONTROL
HUMAN SPACE
FLIGHT &
BIOMED
PMM CREW TRAININGLOGISTICS CENTER
CONTROL
CENTERS &
DATA
PROCESSING
ALTEC ESCALTEC MSC
EXOMARS ROCC
IXV MCC GAIA DPCT
PLANETARY
EXPLORATION
RESEARCH
PROGRAMS
STEPS 2SMAT F2 FP7 FP7
FACILITIES
OPSYS
OUTREACH
INITIATIVES
H2020
VRLAB
Virtual
Reality
Laboratory
MMTD
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Introduction DPCT Overview
Gaia Main Database at DPCT
Technical Challenge
Data Management Data Flow
Data Store
Database Technologies
Processing Databases
Repository Database
Data Access and Analysis Services Data Access and Analysis Services
Data Access Application and Tools
On Demand database
Next Data Exploitation Platform
Conclusion
Outline
Page 3
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
IntroductionData
ManagementDAAS Conclusion
Page 4
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
DPCT is one of six DPCs in the Gaia Data Processing and Analysis
Consortium (DPAC), built and hosted in Turin.
DPCT is a set of products, people and processes to achieve the
Astrometric Verification Unit (AVU) in the Gaia mission (Mission
Requirements) .
DPCT is the result of a joint effort between the science team of the
Astrophysical Observatory of Torino (P.I. Prof. Mario Lattanzi) and the
ALTEC industry team.
DPCT is funded by ASI (Italian Space Agency).
DPCT is in operations since the Gaia satellite launch (19.Dic.2013).
Overview
Page 5
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Data produced by DPCs and defined in the common data modelare received at DPCE as files.
Data distribution from DPCE to other DPCs is performed following a subscription mechanism.
Gaia MDB at DPCT
Page 6
DPCT populated the MDB with the output of the AVU systems.
DPCT mantains an important part of the MDB content in its data stores.
The DPCT receives data from MDB both on daily and cyclic basis.
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
The management of the overall data flow, ensuring data availability, data integrity, data consistency and data protection, is the major technical challenge of DPCT. The size of DPCT databases will consist of hundreds of terabytes which will
have to be stored for a long time.
In addition, data access must be efficient enough to satisfy processing time constraints.
DPCT will manage an big infrastructure up to 1PB with fast and secure data access.
DPCT has to support also the Gaia data exploitation beyond the mission objectives.
DPCT Technical Challenge
Page 7
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
IntroductionData
ManagementDAAS Conclusion
Page 8
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Data Flow
Page 9
Data are received as files through internet and stored in a staging area.
Received data are ingested into two different database’s groups: processing (LEVEL1) and the repository database (LEVEL0) .
Processing databases dedicated to hosted systems and LOCALDB for daily pipeline
GSRDB for cyclic pipeline
Data ingested are pre-processed
In the repository database (REPDB ) data are stored as they are received: It contains all data received at DPCT from the
beginning of the mission.
Consolidated processing output is moved from LOCALDB to REPDB.
50 GB of data received daily.
TBs of data received on cyclic basis.
Data Transfer executed up to 500 Mbps.
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
• Automatic data ingestion a reception time.
• Bulk ingestion for cyclic data.
• Millions of observation ingested into REPDB and processing databases.Ingestion
• Extraction of the output at end of processing to delivery data to MDB.
• Proide data to support development and integration processes.Extraction
• For scientific pipelines
• For ScientistsData Access
• Regular checks to exclude duplicated and/or missing data
• Pipelines output is archived moving data to the repository DB.
• Data moved in repository are removed from processing DBConsolidation
• The data retraction process is the deletion of no more valid data
• If input data are deleted then deletion chain of data produce using them
• The data deletion is a very complex task.Retraction
•Strategy on two levels: disk (snapshot technology) and tape.
•First level backup are instantly while tape backups takes days.
•Copy of backups will moved in an extenals site to be used in case of disaster.Protection
Data Management Activities
Page 10
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Data stores: databases and filesystems.
Database implemented using the technologies allowing: Data always online,
Good performance through tables partitioning at two levels,
Disk space usage under control using data compression,
Very quick queries activating parallellism,
Users trust.
Data consolidation procedures are executed at database level for: controlling the amount of storage space used by the DBMS,
organizing data ready to be queried and
maximizing the usage of different typologies of disks.
Data Stores
Page 11
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Data Store Architecture
Page 12
700 TB useable disk space on two HP P7400 storage units.
FC disks and NL disks
Tape Libraries
HP ESL G3/LTO-5
DB-1
HP DL580 G7
32 CPU cores
256 GB RAM
SAN 8 Gbps
LAN 10 Gbps
PR-1Processing
Server
PR-2 PR-N
DB-2
Running
Instances of
all
databases
DB-3
Oracle
Real
Application
11g
Backup
Server
Mount snapshot
and perform
backuo on tape
........
REPDB LOCALDB GSRDB DBs - SnapDBs - Snap
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
With so huge data we decided to use RDBMS technology because: At time of DPCT architecture consolidation NoSQL technologies were
not mature as they are today.
Oracle is provided to DPCT as COTS from scientific community wich is responsible for the license procurement.
Oracle support is very reliable.
Allow online analysis on consistent database.
After more than two years of mission no performance and robustness issues but: Complex physical configuration was needed,
High technical expertise is needed to mantain databases,
More disk space than is needed to mantain data in Oracle,
Data have to be accessed using time or spatial configured fields. This is not suitable for generic data analysis.
Changes to common data model impact database structure.
Why SQL DB and ORACLE ?
Page 13
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Dedicated to AIM and BAM daily systems.
Partitioning used to improve data access performance:
Tables are partitioned by range using time-based fields
Big tables are sub-partitioned, partitioned by range and sub-partitioned by hash (data distribution on a fixed number of parts).
Database
LOCALDB
Page 14
0,00
5000,00
10000,00
15000,00
20000,00
25000,00
LOCALDB Size (GB)
80%
15%5%
LOCALDB Composition
AIM
BAM
Other
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Database
GSRDB
Page 15
13000,00
13200,00
13400,00
13600,00
13800,00
14000,00
14200,00
14400,00
14600,00
GSRDB Size (GB)
Dedicated to GSR cyclic processing
New empty schema at start of cycle
Size depends on processing cycle
Populated from Repository 92%
8%
GSRDB Space Used
Occupied
Allocated Free
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Repository DataBase dedicated to manage data coming from MDB and the output of DPCT systems.
Tables are partitioned on a field identifying the processing run to ease data retraction
Many procedures to manage tables partitioning/sub-partitioning and to address database size problem.
Data stored in REPDB are available to scientists in every moment.
Scan a table with more than 40 billion of observations is feasible in less than 20 minutes.
Database
REPDB
Page 16
0,0020000,0040000,0060000,0080000,00
100000,00120000,00140000,00160000,00
REPDB Size (GB)
75%
25%
Disk Class Occupacy
Fast
Slow
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
IntroductionData
ManagementDAAS Conclusion
Page 17
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
The DPCT architecture is defined and implemented to provide the data access and analysis services to scientists.
Scientists have read-only access to data managed at DPCT.
The DPCT has a client area, exposing several services, which is accessible from remote through a secure access.
The access to DPCT tools and facilities is implemented with a strict security policy: VPN,
LDAP, password expiration, certificates ecc.
Data Access and
Analysis Services
Page 18
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Custom workflow environment named DAAS to
packet data according to pipeline modules. Extremely useful for pipeline troubleshooting and ad hoc analysis.
IDL + Data Mining configured to read data from
DPCT repository database. A library bridges Java to IDL thus allowing the usage of any Java
Gaia library into IDL.
Scientists’ development environment configured to
read data from DPCT repository database.
Direct Access to Repository
Page 19
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Dedicated web interfaces to monitor the processing and
see results.
Dedicated web interfaces for data requests management. Scientists insert data request.
DPCT operator evaluates the data request feasibility and perform data
extraction.
Common tools released from DPAC and configured on
top of DPCT data stores.
General Tools such as Topcat, DS9, etc..
Data Access
Applications And Tools
Page 20
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Databases «On demand» are foreseen to support data analysis.
One «On demand» DB is new dedicated database created, populated, hosted and provided to scientists for a specific data analysis.
Several database technologies could be used to implement «On demand».
«On demand» database are provided together with tools and application enabling data analysis.
Oracle 12c multitenancy feature was tested. Through multitenancy users can move a database across two different sites
maintaining data, permissions and logical structures.
On Demand Database
Page 21
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Multitenancy in ALTEC Cloud
Page 22
Gaia DPCT
In ALTEC cloud DPCT user
can own a virtual machine
with a custom database with
needed data
Virtual Machine are configured
with a set of tool that can interact
with database (ex: IDL,
Cassandra, ...)
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
TECSEL2 : TEmporal Characterization of the remote SEnsors response
to radiation damage in L2
Data exported in a database built with Cassandra technology.
Algorithms executed in memory through Spark.
TECSEL2 Architecture
Page 23
AIM dataset
~ 100M input objects
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
DPCT Data Exploitation
Experimental Platform
Page 24
DPCT Operational
Platform
REPDB
ALTEC Development Environment
Cloud IaaS - PaaS
Big Data
ALTEC Space
Data Processing
Data Access
Tools
Now In the future
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
IntroductionData
ManagementDAAS Conclusion
Page 25
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
The DPCT has a complex database architecture based on the
commercial technologies defined and implemented to support mission.
The REPDB database is a very large database and it is the data source
for data exploitation tasks at DPCT. In the REPDB data are online and
accessible through several data access services.
The next challenge at DPCT is to create a data analysis environment
to experiment new approaches, methodologies and techniques needed
for the scientific data exploitation to maximize the extraction of information
from existing datasets not yet fully exploited.
On demand database will be available on the data analysis environment.
Conclusion
Page 26
All
righ
ts r
eser
ved
© 2
01
6 -
Alt
ec
www.altecspace.it
Thanks for your attention !
www.altecspace.it