Upload
beverly-mcdowell
View
214
Download
0
Embed Size (px)
Citation preview
1
MosaStore -A Versatile Storage System
Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany,
Matei Ripeanu, Emalayan Vairavanathan,(and many others from UBC, ANL, ORNL)
Networked Systems Laboratory (NetSysLab)University of British Columbia
http://netsyslab.ece.ubc.ca
2
A golf course …
… a (nudist) beach
(… and 199 days of rain each year)
Networked Systems Laboratory (NetSysLab)University of British Columbia
The Landscape
Storage System Middleware
Supercomputers Desktop GridsCloud Computing
Workflows CheckpointingData Analysis
Diverse platform capabilities
Diverse workload characteristics
Challenge: Design an efficient storage system middleware
CCCC
4
Motivation: Underprovisioned storage systems on manyHPC platforms (e.g., BlueGene/P at ANL)
10 Gb/s Switch
Complex
10 Gb/s Switch
Complex
GPFS
24 servers
IO rate : 8GBps = 51KBps / core
2.5K IO Nodes
850 MBps per 64 nodes
160K coresH
i-Sp
eed
Netw
ork
2.5 GBpsper node
The shared storage is a bottleneckThere are underutilized resources close to application
5
Solution: a temporary shared datastore
10 Gb/s Switch
Complex
10 Gb/s Switch
Complex
GPFS
24 servers
IO rate : 8GBps = 51KBps / core
2.5K IO Nodes
850 MBps per 64 nodes
160K coresS
hare
d d
ata
-sto
re
2.5 GBpsper node
Nodes dedicated to an applicationStorage system coupled with the application’s execution
6
Benefits
10 Gb/s Switch
Complex
10 Gb/s Switch
Complex
GPFS
24 servers
IO rate : 8GBps = 51KBps / core
2.5K IO Nodes
850 MBps per 64 nodes
160K coresS
hare
d d
ata
-sto
re
2.5 GBpsper node
Storage closer to the application. Ability to specialize
Evaluation: Harnessing ‘Close to Application’ Underutilized Resources
Overall: 1.52x
Workflow Stages(DOCK6)
Read input, compute, and write temporary results
Summarize, sort, and select
Archive
Storage Optimizations
Cache the input data
Cache temporary files
Asynch. flush results to GPFS
Results (8K cores)
1.06x
11.76x
1.51x
Exploiting the underutilized resources can critically improve the storage system performance
Zhang et. al., “Design and Evaluation of a Collective I/O Model for Loosely-coupled Petascale Programming”, MTAGS ’08.
Evaluation: Specialization
0
100
200
300
400
500
600
16 32 80 160 240 320 400 480 560Number of clients
Aver
age
Band
wid
th (M
B/s) Lustre Average
stdchk Average
MosaStore throughput at larger scale (pool of 35 nodes)Experiment by: Henry Monti (VirginiaTech) on Cray XT4 cluster at ORNL
Deduplication benefits a checpointing workload• 3x higher throughput• 25-70% less storage
space and network effort
• Scales to hundreds of clients
Specialization can critically improve the storage system performance
[S. Al-Kiswany, M. Ripeanu, S. Vazhkudai, A. Gharaibeh, “stdchk: A Checkpoint Storage System for Desktop Grid Computing”, ICDCS ‘08]
Summary so far• MosaStore: versatile storage architecture, that :
Exploits underutilized resources ‘close`to the application. Supports specialization and configurability
• System is Configured at deployment time Deployment lifetime coupled with that of the target application.
[S. Al-Kiswany, A. Gharaibeh, M. Ripeanu, “The Case for a Versatile Storage System”, HotStorage’09]
MosaStore - Storage System PrototypeGoals: (1) exploration platform, and (2) support for large-scale computational science research projects.
MosaStore - Storage System PrototypeGoals: (1) exploration platform, and (2) support for large-scale computational science research projects.
Versatile Storage
Configurable and extensible storage system that can be specialized for a broad set of apps.
[ICDCS ’08, HotStorage ’09]
Configurable and extensible storage system that can be specialized for a broad set of apps.
[ICDCS ’08, HotStorage ’09]
How to harness massively multicore processors to support storage system operations?
[HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10]
How to harness massively multicore processors to support storage system operations?
[HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10]
StoreGPU Cross-layer Optimizations
Can one enable cross-layer optimizations?
[HPDC HotTopics ’08, CCGrid`12, WSLF`11]
Can one enable cross-layer optimizations?
[HPDC HotTopics ’08, CCGrid`12, WSLF`11]
CMFS API
Automatingconfig. choice
How I choose a good configuration for my application?
[ERSS`11¸ GRID`10]
How I choose a good configuration for my application?
[ERSS`11¸ GRID`10]
• Application Storage SystemApplications can present hints on the desired
use of the data: e.g., desired replication levels, caching, data importance, etc
• Storage System Application Storage can expose storage-level attributes
e.g., file location characteristics, file health status,
Today: applications and storage systems treat data items uniformly
Opportunity: additional information can enable differentiated treatment of data items
POSIX API CustomMetadata
Our use-case: A workflow aware file system
12
Workflow Applications
Montage workflow
File based communication
Irregular and application-dependant data access
100000s of process, runs for weeks
Generate large I/O volumes (100TB cumulative).
Execution29%
Data managt30%
Scheduling, Idle41%
Source [Zhao et. al, 2012]512 BG/P cores, GPFS intermediate file system
13
I/O patterns in Workflow Applications
• Pipeline
• Broadcast
• Reduce
• Scatter
• Gather
Case studies in storage access by loosely coupled petascale applications , Wozniak et al, PDWS, 2009
Application: Montage
14• <
Stages 6, 7,8Pipeline pattern
Stage - 10Reduce pattern
Stage - 9Pipelinepattern
Stage - 5Reduce pattern
15
I/O Patterns and Storage Optimizations
Pipeline Locality aware scheduling
Broadcast Replication
Reduce Data placementLocality-aware scheduling
Scatter Block-level placement Locality-aware scheduling
Gather Block level co-placementLocality-aware scheduling
Pattern Optimizations
Data-item specific patterns and optimizations! Need for information flows in both directionsIdea: Cross-layer communication to support this
16
A workflow-aware file system
Thesis: cross-layer communication supported by file-level metadata
the key mechanism to enable a workflow-aware file system
Progress so far: promising evaluation of potential gains (CCGrid`12)
Next step: build the system and evaluate it with applications (?SC`12)
MosaStore - Storage System PrototypeGoals: (1) exploration platform, and (2) support for large-scale computational science research projects.
MosaStore - Storage System PrototypeGoals: (1) exploration platform, and (2) support for large-scale computational science research projects.
Versatile Storage
Configurable and extensible storage system that can be specialized for a broad set of apps.
[ICDCS ’08, HotStorage ’09]
Configurable and extensible storage system that can be specialized for a broad set of apps.
[ICDCS ’08, HotStorage ’09]
Harnessing massively multicore processors to support storage system operations.
[HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10]
Harnessing massively multicore processors to support storage system operations.
[HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10]
StoreGPU Cross-layer Optimizations
Enabl bidirectional cross-layer optimizations.
[HPDC HotTopics ’08, CCGrid`12, WSLF`11]
Enabl bidirectional cross-layer optimizations.
[HPDC HotTopics ’08, CCGrid`12, WSLF`11]
CMFS API
Automatingconfig. choice
How I choose a good configuration for my application?
[ERSS`11¸ GRID`10]
How I choose a good configuration for my application?
[ERSS`11¸ GRID`10]