Upload
dinah-hancock
View
220
Download
7
Tags:
Embed Size (px)
Citation preview
1
Scientific Data ManagementScientific Data Management
CenterCenter
(ISIC)(ISIC)
http://sdmcenter.lbl.govhttp://sdmcenter.lbl.govcontains extensive publication listcontains extensive publication list
2
Scientific Data Management CenterScientific Data Management Center
Center PI: Arie Shoshani LBNL
DOE Laboratories co-PIs:
Bill Gropp, Rob Ross ANLArie Shoshani, Doron Rotem LBNLTerence Critchlow, Chandrika Kamath LLNLNagiza Samatova, Andy White ORNL
Universities co-PIs :Mladen Vouk North Carolina State Alok Choudhary Northwestern Reagan Moore, Bertram Ludaescher UC San Diego (SDSC)Calton Pu Georgia TechSteve Parker U of Utah (future)
Participating Institutions
3
Phases of Scientific Exploration
Data Generation From large scale simulations or experiments Fast data growth with computational power examples
• HENP: 100 Teraops and 10 Petabytes by 2006• Climate: Spatial Resolution: T42 (280 km) -> T85 (140 km) -> T170 (70 km),
T42: about 1 TB/100 year run => factor of ~ 10-20
Problems• Can’t dump the data to storage fast enough – waste of compute resources• Can’t move terabytes of data over WAN robustly – waste of scientist’s time• Can’t steer the simulation – waste of time and resource• Need to reorganize and transform data – large data intensive tasks slowing
progress
4
Phases of Scientific Exploration
Data Analysis Analysis of large data volume Can’t fit all data in memory Problems
• Find the relevant data – need efficient indexing• Cluster analysis – need linear scaling• Feature selection – efficient high-dimensional analysis• Data heterogeneity – combine data from diverse sources• Streamline analysis steps – output of one step needs to match input of next
5
Example Data Flow in TSI
InputData
HighlyParallelCompute
Output~500x500files
Aggregate to ~500 files (< 2 to 10+ GB each)
Archive
Data Depot
Logistic NetworkL-Bone
Local MassStorage 14+TB)
Aggregate to one file (1+ TB each)
VizWall
Viz Client
Local 44 Proc.Data Cluster- data sits on local nodes for weeks
Viz Software
Logistical Network
Courtesy: John Blondin
6
Goal: Reduce the Data Management Overhead
• Efficiency• Example: parallel I/O, indexing, matching storage structures to
the application
• Effectiveness• Example: Access data by attributes-not files, facilitate massive
data movement
• New algorithms• Example: Specialized PCA techniques to separate signals or to
achieve better spatial data compression
• Enabling ad-hoc exploration of data• Example: by enabling exploratory “run and render” capability to
analyze and visualize simulation output while the code is running
7
Approach
Use an integrated framework that:
• Provides a scientific workflow capability
• Supports data mining and analysis tools
• Accelerates storage and access to data
Simplify data management tasks for the scientist
• Hide details of underlying parallel and indexingtechnology
• Permit assembly of modules using a simple graphical workflow description tool
DataMining &Analysis
Layer
StorageEfficientAccessLayer
ScientificProcess
AutomationLayer
ScientificApplication
ScientificUnderstanding
SDM Framework
8
Technology Details by Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDFSoftware
Layer
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
Hardware, OS, and MSS (HPSS)
WorkFlowManagement
Tools
Web Wrapping
Tools
EfficientParallel
Visualization(pVTK)
Efficientindexing(Bitmap Index)
DataAnalysis
tools(PCA, ICA)
ASPECT:integration Framework
Parallel NetCDFSoftware
Layer
ParallelVirtual
FileSystem
StorageResourceManager
(To HPSS)
ROMIOMPI-IOSystem
DataMining &Analysis(DMA)Layer
StorageEfficientAccess(SEA)Layer
ScientificProcess
Automation(SPA)Layer
9
Accomplishments:Storage Efficient Access (SEA)
Developed Parallel netCDF Enables high performance parallel I/O to
netCDF datasets Achieves up to 10 fold performance
improvement over HDF5
Enhanced ROMIO: Provides MPI access to PVFS Advanced parallel file system interfaces
for more efficient access
Developed PVFS2 Adds Myrinet GM and InfiniBand support improved fault tolerance asynchronous I/O offered by Dell and HP for Clusters
Deployed an HPSS Storage Resource Manager (SRM) with PVFS
Automatic access of HPSS files to PVFS through MPI-IO library
SRM is a middleware component
P0P0
P1P1
P2P2
P3P3
netCDFnetCDF
Parallel File SystemParallel File System
Parallel netCDFParallel netCDF
P0P0
P1P1
P2P2
P3P3
Parallel File SystemParallel File System
Before After
Parallel Virtual File System:Enhancements and deployment
Shared memory communication
FLASH I/O Benchmark Performance (8x8x8 block sizes)
10
Robust Multi-file ReplicationRobust Multi-file Replication
Problem: move thousands of files robustly Takes many hours Need error recovery Mass storage systems
failures Network failures Use Storage Resource
Managers (SRMs) Problem: too slow
Use parallel streams Use concurrent transfers Use large FTP windows Pre-stage files from MSS
NCAR
Anywhere
LBNL
DiskCache
DiskCache
SRM-COPY
(thousands of files)
SRM-GET (one file at a time)
DataMover
SRM(performs writes)
SRM(performs reads)GridFTP GET (pull mode)
stage filesarchive files
Network transfer
Get listof files
MSS
11
File tracking helps to identify File tracking helps to identify bottlenecksbottlenecks
Shows that archiving is the bottleneck
12
File tracking shows recovery from transient failures
Total:45 GBs
13
Accomplishments:Data Mining and Analysis (DMA)
Developed Parallel-VTK Efficient 2D/3D Parallel Scientific
Visualization for NetCDF and HDF files Built on top of PnetCDF
Developed “region tracking” tool For exploring 2D/3D scientific
databases Using bitmap technology to identify
regions based on multi-attribute conditions
Implemented Independent Component Analysis (ICA) module
Used for accurate for signal separation Used for discovering key parameters
that correlate with observed data
Developed highly effective data reduction Achieves 15 fold reduction with high level
of accuracy Using parallel Principle Component Analysis
(PCA) technology
Developed ASPECT A framework that supports a rich set of
pluggable data analysis tools Including all the tools above A rich suite of statistical tools based on R
package
PVTK Serial (vs) Parallel Writer (80 MB)
0
10
20
30
40
0 2 4 6 8 10 12 14 16 18
Number of Processors
Tim
e (s
eco
nd
s)
PVTK Serial Writer PVTK Parallel Writer
El Nino signal (red) and estimation (blue) closely match
Combustion region tracking
14
ASPECT Analysis Environment
Data Select Data Access Correlate Render Display(temp, pressure)From astro-data Where (step=101)(entropy>1000);
Sample (temp, pressure) Visualize scatter
plot in QT
Run pVTK filter
Run R analysis
pVTKTool
SelectData
R AnalysisTool
TakeSample
Use Bitmap(condition)
Get variables(var-names, ranges)
Read Data(buffer-name)Write Data
Read Data(buffer-name)Write Data
Read Data(buffer-name)
Parallel NetCDF
PVFS Bitmap Index
Selection
Hardware, OS, and MSS (HPSS)
Data Mining & Analysis Layer
Storage EfficientAccess Layer
15
Accomplishments:Scientific Process Automation (SPA)
Unique requirements of scientific WFs Moving large volumes between modules
• Tightlly-coupled efficient data movement Specification of granularity-based iteration
• e.g. In spatio-temporal simulations – a time step is a “granule”
Support for data transformation
• complex data types (including file formats, e.g. netCDF, HDF)
Dynamic steering of workflow by user
• Dynamic user examination of results
Developed a working scientific work flow system
Automatic microarray analysis Using web-wrapping tools developed by
the center Using Kepler WF engine Kepler is an adaptation of the UC Berkeley
tool, Ptolemy
workflow steps defined graphically
workflow results presented to user
16
GUI for setting up and running workflows
17
Re-applying Technology
Technology
Parallel NetCDF
Parallel VTK
Compressed bitmaps
Storage ResourceManagers
Feature Selection
Scientific Workflow
New Applications
Climate
Climate
Combustion, Astrophysics
Astrophysics
Fusion
Astrophysics (planned)
Initial Application
Astrophysics
Astrophysics
HENP
HENP
Climate
Biology
SDM technology, developed for one application, can be effectively targeted at many other applications …
18
Broad Impact of the SDM Center…
Astrophysics:High speed storage technology, parallel NetCDF, parallel VTK, and ASPECT integration software used for Terascale Supernova Initiative (TSI) and FLASH simulationsTony Mezzacappa – ORNL, John Blondin –NCSU, Mike Zingale – U of Chicago, Mike Papka – ANL
Climate:High speed storage technology, Parallel NetCDF, and ICA technology used for Climate Modeling projects Ben Santer – LLNL, John Drake – ORNL, John Michalakes – NCAR
Combustion:Compressed Bitmap Indexing used for fast generation of flame regions and tracking their progress over timeWendy Koegler, Jacqueline Chen – Sandia Lab
ASCI FLASH – parallel NetCDF
Dimensionality reduction
Region growing
19
Broad Impact (cont.)
Biology:Kepler workflow system and web-wrapping technology used for executing complex highly repetitive workflow tasks for processing microarray dataMatt Coleman - LLNL
High Energy Physics:Compressed Bitmap Indexing and Storage Resource Managers used for locating desired subsets of data (events) and automatically retrieving data from HPSSDoug Olson - LBNL, Eric Hjort – LBNL, Jerome Lauret - BNL
Fusion:A combination of PCA and ICA technology used to identify the key parameters that are relevant to the presence of edge harmonic oscillations in a Tokomak
Keith Burrell - General Atomics
Building a scientific workflow
Dynamic monitoring of HPSS file transfers
Identifying key parameters for the DIII-D Tokamak
20
Goals for Years 4-5
Fully develop the integrated SDM framework Implement the 3 layer framework on SDM center facility Provide a way to select only components needed Develop self-guiding web pages on the use of SDM components Use existing successful examples as guides
Generalize components for reuse Develop general interfaces between components in the layers support loosely-coupled WSDL interfaces Support tightly-coupled components for efficient dataflow
Integrate operation of components in the framework Hide details form user – automate parallel access and indexing Develop a reusable library of components that can be selected
for use in the workflow system