Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Thomas SteinkeZuse Institute Berlin (ZIB) <www.zib.de>
Efficiency Considerations of Efficiency Considerations of Cauchy ReedCauchy Reed--SolomonSolomonImplementations on Implementations on Accelerator and MultiAccelerator and Multi--CoreCorePlatformsPlatforms
SAAHPCSAAHPC
June 15 2010June 15 2010
Knoxville, TNKnoxville, TN
Kathrin Peter
Sebastian Borchert
Outline
Motivation
The Reed-Solomon algorithm
Platforms and implementations
Reed-Solomon throughput and efficiency
Conclusions
Motivation I: Fault-tolerance Storage systems
Mean Time To Data Loss (MTTDL) for 100k disk deployments:
RAID-5 is non-starter with 100k disks: MTTDL ~ 9 days!
RAID-D2 (8+2P stripes): MTTDL ~ 100 years
RAID-D3 (8+3P stripes): MTTDL ~ 130 million years!
source: IBM, Almaden Research Center – Storage Systems, SC’09
Motivation II: Application Level Fault-Recovery
Mean Time To Interrupt (MTTI) for Petascale+ class compute configurations: O(1 day)
application level fault-recovery, application level checkpoint-restart
example: Charm++ provides in-memory distributed checkpoint scheme
- memory footprint doubled
F Cappello, A Geist, B Gropp, S Kale, B Kramer, M Snir: Toward Exascale Resilience, 2009
Scope & Limitations
objective: investigating alternative processing platforms for RS encoding (decoding)
focus on one particular step of the overall processing pipeline
aspects ignored include …- application (producer) : data injection bandwidth
- disk I/O bandwidth, disk grouping
project is not aiming to design a storage systemno disk and data path configuration options considered here
The Reed-Solomon algorithm
Non-binary, cyclic block code (1960 I. Reed, G. Solomon)
Applications: Reliable data transmissionReliable data storage: en-/decoding in the disk (RAID) controller
Requirement: Fast encoding
Data disk disk disk disk disk
(Re) Calculationwhen read / write
Crash
Advantages of the Reed-Solomon Coding
Flexibility in the coding schema
k + m RS code means:k data blocks
m check blocks
up to m errors can be tolerated
Encoding Principle for (k+m) RS Schema
Encoding is a matrix-vector multiplication:
Galois field multiplication is expensive
Cauchy Reed-Solomon
The Cauchy Variant of Reed-Solomon
Cauchy Reed-Solomon : work of J. S. Plank et. al.
GF2 only XOR operations
Platforms Used in this Study
GPGPU:NVIDIA Tesla C1070/SGI XE500, Tesla C870/Sun Ultra27
CUDA 2.3, CUDA 3.0
FPGA:SGI RC 100/SGI Altix 450
Mitrionics SDK 2.0, RASClib 2.2, Xilinx ISE 9.2
Cell BE:IBM PowerXCell8i/IBM QS22
IBM CBE SDK 3.1
SIMD Processor:ClearSpeed CSX e620/Sun X4600M2
ClearSpeed’s Cn Compiler, CSAPI v 3.11
Memory Hierarchy
Host RAMGlobal Device
MemoryLocal
Memory
QPI: 32 GB/sXDR: 25 GB/sPCIe x16: 8 GB/sNUMAlink4: 6 GB/s
in:GPUClearSpeedCBE
Data
Check
Data Source Data Processing
General Implementation Strategy
5+3 Reed-Solomon schema, Cauchy RS
input data volumes: 150 … 2048 Mbytesto saturate the complete data path
co-processor model (except CBE and x86)requires overlapping of data processing & communication
Platform Specific Optimizations
x86SSE
parallelization: OpenMP
NUMA: mem affinity (numactl)
FPGAXOR tree /w constants
128 bit wide I/O
double buffering via RASClib
1/5 resource utilization
GPGPUtransfer models
- synchronous xfer (block)
- asynchronous stream
kernel is called as a 2D grid with 1D thread pool
CellBE8 SPUs, 512 byte blocks
double buffering SPU
NUMA 8-16 SPUs
Flip Flops Slices BRAM
Metrics Used for Performance Evaluation
raw throughput performance Reed-Solomon rate:
RS rate := size of input data set / total time
host memory-to-host memory performance (includes data transfers)
normalization:
relative RS rate := RS rate / link bandwidth
RS Rates (Comparing Apples with Oranges…)
Best Reed-Solomon Rates and Kernel Rates
14503
14476
1630
3255
1442
384
23774
41505
605
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
X5570 (8x)
XPowerCell8i (8x)
G80 (64x)
T10 (32x)
XCV4LX200
CSX600 (96x)
RS Rate [MByte/s]
overall RS RateKernel Rate
ClearSpeed (2007)
FPGA (2006)
GPU (2009)
GPU (2007)
CBE (2008)
x86 (2009)
5+3 RS Schema
RS Rates (Comparing Apples with Oranges…)
Best Reed-Solomon Rates and Kernel Rates
14503
14476
1630
3255
1442
384
23774
41505
605
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
X5570 (8x)
XPowerCell8i (8x)
G80 (64x)
T10 (32x)
XCV4LX200
CSX600 (96x)
RS Rate [MByte/s]
overall RS RateKernel Rate
ClearSpeed (2007)
FPGA (2006)
GPU (2009)
GPU (2007)
CBE (2008)
x86 (2009)
5+3 RS Schema
Reference data:
Curry et al. (2008):13+3 RS schema on GTX 260RS rate: 1.4 GB/s
Brinkmann et.al. (2009):X-8 RS schema on 8800 GTSRS rate: 1.0 GB/s
Overall RS & Kernel Efficiencies
0 10 20 30 40 50 60 70 80 90 100
X5570 (8x)
XPowerCell8i (8x)
G80 (64x)
T10 (32x)
XCV4LX200
CSX600 (96x)
Efficiency [%]
Reed-Solomon Efficiencies
overall RS EfficiencyKernel Efficiency
ClearSpeed (2007)
FPGA (2006)
GPU (2009)
GPU (2007)
CBE (2008)
x86 (2009)
123
Reed-Solomon on x86: Performance & Scaling
Reed-Solomon Rate: Intel Nehalem
4090
8122
12542
14503
0
2000
4000
6000
8000
10000
12000
14000
16000
1 2 4 8
# Threads
RS
Rat
e [M
B/s
]QPI bandwidth limit:
32000 MB/s
optimization level: SSE, OpenMP, NUMA
RS on CBE PowerXCell8i: Performance & Scaling
Reed-Solomon Rate: PowerXCell8i
5551
10375
14023 14476
0
2000
4000
6000
8000
10000
12000
14000
16000
1 2 4 8
# Threads
RS
Rat
e [M
B/s
]XDR bandwidth limit:
25600 MB/s
Overall Results
Performance & Efficiency Summary
1. Cell BE, x86 Nehalem
2. GPGPU, FPGA
3. ClearSpeed
Category 50+: Cell BE
Category 40: x86 Nehalem, FPGA,GPGPU-C1060
Category 20: GPGPU-C870, ClearSpeed
Ranking according tosustained Reed-Solomon rate
Categories according toReed-Solomon efficiency
Limitations of the Study
we measured the performance of the encoding step for a fixed 5+3 RS schema, only
performance of the decoding step can be considered similar
the total data processing workflow includes additional stepsapplication (producer) : data injection bandwidth
permanent storage : disk I/O bandwidth
Conclusion
Reed-Solomon encoding is feasible using non-ASIC technology
algorithmic improvements: Cauchy Reed-Solomon
technology improvements: energy efficient accelerators
Reed-Solomon application scenarios:
1. non-critical requirements (power, cooling)x86 platform is a convenient solution
2. data intensive processing environments:FPGA integrated into data path
Acknowledgement
Thanks to …
Michael Peick, Johannes Bock (initial GPU & FPGA version)
Mathias Foquet-Lapar, SGI (Tesla C1070 on SGI’s AEP sys)
Willi Homberg, FZ Jülich (QS22 system JUICEnext)
???