ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we
want to routinely examine the ASDC data holdings provided by each data provider
• Our goal is to maximize the use of the storage resources (currently online disks and tape) to optimally serve the needs of users of the data
• NASA IT Security directives: All publicly accessible data and websites must be moved to DMZ (outside internal campus network)
ASDC Data Storage Re-architecture Some questions to ask as we work with projects to examine data sets:
• Was the data set only used as inputs for generation of final products and is
no longer required? – Has it been superseded by another version and is now obsolete? – Has it been replaced by an alternative data set and is no longer required? – Does the data set still need to remain in the ASDC archive?
• What level of protection does the data set warrant? – ASDC will archive and preserve long term all publicly orderable data sets. This includes a
disaster recovery copy at an off-site location. – ASDC will archive ancillary data sets used to support production of the current and
previous version of final data products – ASDC is not chartered to permanently archive ancillary data sets where a another
organization has responsibility for long term archive • Is data set being stored in the most effective location for required user or
application access? – Publicly orderable data sets will be located on fast access disk system in LaRC DMZ – Data sets required for production and use by local scientists will be store in the DPO – Data sets not required for current or near-term access will likely be retrieved from the
tape archive.
ASDC Data Storage Re-architecture Architecture of Current Data Storage
ANGe sends data to 3 storage locations: Tape Archive
Oracle SL8500 with LTO-4/LTO-5 tapes (4/5 PB)
DPO SGI/NetApp IS5000 Storage Units (Using 3.3 PB of 5.5 PB)
Orders Cache IBM DS5300 RAID Unit (524 TB)
ASDC Data Storage Re-architecture Architecture of Current Data Storage (cont.)
• Tape Archive – Oracle SL8500 Tape Library – Primary archive media for all data archived in
ANGe – 24 x LTO-4 tape drives; 12 x LTO-6 tape drives – 10,000 tape slots available – 6,000 LTO-4 tapes (800 GB each; 4 PB capacity) – 2, 000 LTO-6 tapes (2.5 TB each; 5 PB capacity)
ASDC Data Storage Re-architecture Architecture of Current Data Storage (cont.)
• DPO (Data Products Online) – Online disk storage for most of data archived via ANGe
(/ASDC_archive4, /ASDC_archive5, /ASDC_archive6) – SGI/NetApp IS5000 Storage Systems (5.5 PB usable
storage); Configured as 5 GPFS Building Blocks (1.1 PB each)
• Orders Cache – Online disk storage located in DMZ for caching
orderable data products based on most frequently ordered
– IBM DS5300 Storage System (520 TB usable storage)
Data Capacities as of May 2016 ~ 2.3 PB
ASDC Data Storage Re-architecture Technology Refresh
• New tape technology: LTO-4 (800GB) LTO-6 (2.5 T) per tape
• Quantum StorNext Software IBM HPSS – Cost savings:
• StorNext licensed by the capacity of data stored; HPSS license cost does not increase as data volume continues to grow; must store data on self to live within licensed capacity
• StorNext has been operationally expensive due to software instabilities
– IBM HPSS can be integrated with existing IBM GPFS file systems to provide an end-to-end data management solution
– ANGe can write single copy of data files to DPO and IBM software will handle writing tape copies
ASDC Data Storage Re-architecture Other IBM HPSS Pros
• IBM High Performance Storage System (HPSS) that has been used at NASA Langley for over 20 years and is deployed in production environment in many multi-petabytes sites around the world.
• IBM has a track record of helping customers transition from StorNext and other archiving solutions.
GHI session
Process Manager
Scheduler Daemon
Event Daemon
Mount Daemon
ghi_migrate ghi_recall ghi_list
GHI IOM
I/O Manager ISHTAR
HPSS
HPSS Mover
HPSS Core GHI DB
Configuration Manager
DPO (GPFS) ASDC_archive 4 ASDC_archive 5 ASDC_archive 6
ANGe (Ingest/Archive)
Migration Policy
Threshold Policy
ghi_stage
ASDC Data Storage Re-architecture Storage Tiers for ASDC Data
10
Storage Tier Description Current Medium Profile
DMZ Data Store
DMZ accessible Online disk-based data store
SGI/NetApp Disk system; ~1 PB capacity
High performance IBM GPFS file system; Low latency; 4TB disk drives with RAID 6 protection
Internal Data Store (DPO)
Internal accessible only online disk-based data store
SGI/NetApp Disk system; ~4 PB capacity
High performance IBM GPFS files system; Low latency; 4TB disk drives with RAID 6 protection
Tape Archive Data Store
Data on tapes in local tape Library
Oracle L8500 tape library with LTO-6 tapes; 5 PB capacity
IBM HPSS managed tape archive with two copies of data (separate tapes)
DR Tape Data Store
Data on tapes stored at Disaster Recovery site
Iron Mountain (Ashland, VA)
Second tape copy of data sent to DR site within 14 days after creation
ASDC Data Storage Re-architecture: Data Policy Establish Data Retention Policies by Data Categories
11
Categories intended to cover all ASDC data holdings (CALLIPSO, CERES, MISR, MODIS, etc.)
Code Data Category P0 Publicly Orderable Data (current versions)
P0CP Publicly Orderable Data (current versions; input for current production stream) P1 Publicly Orderable Data (1 version back from current version)
P1CP Publicly Orderable Data (1 version back; input for current production stream) P2 Publicly Orderable Data (2 or more versions from current version)
P2CP Publicly Orderable Data (2 or more versions back; input for current production stream) L0 Level 0 Data (including associated orbit/attitude) I0 Intermediate Data (current production stream) I1 Intermediate Data (non-current production stream) A0 Ancillary data (used in current or planned production) A1 Ancillary Data (used for non-current production)
ANP Ancillary Data (not used for production) QA QA/QC Output Data (short term validation) VAL Validation Output Data (long term validation) PD Project Documentation (ATDBs, mission website, etc.)
ASDC Data Storage Re-architecture: Data Policy Publicly Orderable Data (current versions)
12
Data Category Description Category
Required for current production processing input
Required for historical production processing
Current versions of our publicly orderable data products. Most recent ones the science teams point people toward for any defined period of time and spatial region. Publicly orderable Some products n/a Version Availability Access Speed Required Retention Priority Current Publicly available High High
Example Data Sets: CALIPSO V3/V4; CATS V2; CERES Ed3/Ed4; MISR V006; MOPITT V006; SAGE III Meteror-3M V004; TES V006 1. Data will be actively stored and managed on fast access storage tier and be
accessible to the public outside the campus firewall (in DMZ) 2. Data will be actively stored and managed on fast access storage tier
designated for inputs to production and use by local scientists and DMT 3. Copy of data will be actively stored and managed in archive storage tier with
high latency (tape media) 4. Copy of data will be actively stored and managed at disaster recovery facility;
integrity of data stored at DR site validated annually 5. Revisit status of data annually or when a new version is published
ASDC Data Storage Re-architecture The Plan Forward
• New HPSS configuration being deployed by IBM with initial operations March/April 2017
• Need data retention policies in place to govern the migration of data from disk when it no longer requires immediate and rapid access
• Need to remove data from the archive that no longer has value
• Will continue to work with Walt in defining data policies for CERES data sets