Efficiency Considerations of Cauchy Reed-Solomon …saahpc.ncsa.illinois.edu/10/presentations/day3/session1/... · 2010. 6. 15. · Reed-Solomon on x86: Performance & Scaling Reed-Solomon

Thomas SteinkeZuse Institute Berlin (ZIB) <www.zib.de>

[email protected]

Efficiency Considerations of Efficiency Considerations of Cauchy ReedCauchy Reed--SolomonSolomonImplementations on Implementations on Accelerator and MultiAccelerator and Multi--CoreCorePlatformsPlatforms

SAAHPCSAAHPC

June 15 2010June 15 2010

Knoxville, TNKnoxville, TN

Kathrin Peter

Sebastian Borchert

[email protected]

Outline

Motivation

The Reed-Solomon algorithm

Platforms and implementations

Reed-Solomon throughput and efficiency

Conclusions

[email protected]

Motivation I: Fault-tolerance Storage systems

Mean Time To Data Loss (MTTDL) for 100k disk deployments:

RAID-5 is non-starter with 100k disks: MTTDL ~ 9 days!

RAID-D2 (8+2P stripes): MTTDL ~ 100 years

RAID-D3 (8+3P stripes): MTTDL ~ 130 million years!

source: IBM, Almaden Research Center – Storage Systems, SC’09

[email protected]

Motivation II: Application Level Fault-Recovery

Mean Time To Interrupt (MTTI) for Petascale+ class compute configurations: O(1 day)

application level fault-recovery, application level checkpoint-restart

example: Charm++ provides in-memory distributed checkpoint scheme

- memory footprint doubled

F Cappello, A Geist, B Gropp, S Kale, B Kramer, M Snir: Toward Exascale Resilience, 2009

[email protected]

Scope & Limitations

objective: investigating alternative processing platforms for RS encoding (decoding)

focus on one particular step of the overall processing pipeline

aspects ignored include …- application (producer) : data injection bandwidth

- disk I/O bandwidth, disk grouping

project is not aiming to design a storage systemno disk and data path configuration options considered here

[email protected]

The Reed-Solomon algorithm

Non-binary, cyclic block code (1960 I. Reed, G. Solomon)

Applications: Reliable data transmissionReliable data storage: en-/decoding in the disk (RAID) controller

Requirement: Fast encoding

Data disk disk disk disk disk

(Re) Calculationwhen read / write

Crash

[email protected]

Advantages of the Reed-Solomon Coding

Flexibility in the coding schema

k + m RS code means:k data blocks

m check blocks

up to m errors can be tolerated

[email protected]

Encoding Principle for (k+m) RS Schema

Encoding is a matrix-vector multiplication:

Galois field multiplication is expensive

Cauchy Reed-Solomon

[email protected]

The Cauchy Variant of Reed-Solomon

Cauchy Reed-Solomon : work of J. S. Plank et. al.

GF2 only XOR operations

[email protected]

Platforms Used in this Study

GPGPU:NVIDIA Tesla C1070/SGI XE500, Tesla C870/Sun Ultra27

CUDA 2.3, CUDA 3.0

FPGA:SGI RC 100/SGI Altix 450

Mitrionics SDK 2.0, RASClib 2.2, Xilinx ISE 9.2

Cell BE:IBM PowerXCell8i/IBM QS22

IBM CBE SDK 3.1

SIMD Processor:ClearSpeed CSX e620/Sun X4600M2

ClearSpeed’s Cn Compiler, CSAPI v 3.11

[email protected]

Memory Hierarchy

Host RAMGlobal Device

MemoryLocal

Memory

QPI: 32 GB/sXDR: 25 GB/sPCIe x16: 8 GB/sNUMAlink4: 6 GB/s

in:GPUClearSpeedCBE

Data

Check

Data Source Data Processing

[email protected]

General Implementation Strategy

5+3 Reed-Solomon schema, Cauchy RS

input data volumes: 150 … 2048 Mbytesto saturate the complete data path

co-processor model (except CBE and x86)requires overlapping of data processing & communication

[email protected]

Platform Specific Optimizations

x86SSE

parallelization: OpenMP

NUMA: mem affinity (numactl)

FPGAXOR tree /w constants

128 bit wide I/O

double buffering via RASClib

1/5 resource utilization

GPGPUtransfer models

- synchronous xfer (block)

- asynchronous stream

kernel is called as a 2D grid with 1D thread pool

CellBE8 SPUs, 512 byte blocks

double buffering SPU

NUMA 8-16 SPUs

Flip Flops Slices BRAM

[email protected]

Metrics Used for Performance Evaluation

raw throughput performance Reed-Solomon rate:

RS rate := size of input data set / total time

host memory-to-host memory performance (includes data transfers)

normalization:

relative RS rate := RS rate / link bandwidth

[email protected]

RS Rates (Comparing Apples with Oranges…)

Best Reed-Solomon Rates and Kernel Rates

14503

14476

1630

3255

1442

384

23774

41505

605

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

X5570 (8x)

XPowerCell8i (8x)

G80 (64x)

T10 (32x)

XCV4LX200

CSX600 (96x)

RS Rate [MByte/s]

overall RS RateKernel Rate

ClearSpeed (2007)

FPGA (2006)

GPU (2009)

GPU (2007)

CBE (2008)

x86 (2009)

5+3 RS Schema

[email protected]

RS Rates (Comparing Apples with Oranges…)

Best Reed-Solomon Rates and Kernel Rates

14503

14476

1630

3255

1442

384

23774

41505

605

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

X5570 (8x)

XPowerCell8i (8x)

G80 (64x)

T10 (32x)

XCV4LX200

CSX600 (96x)

RS Rate [MByte/s]

overall RS RateKernel Rate

ClearSpeed (2007)

FPGA (2006)

GPU (2009)

GPU (2007)

CBE (2008)

x86 (2009)

5+3 RS Schema

Reference data:

Curry et al. (2008):13+3 RS schema on GTX 260RS rate: 1.4 GB/s

Brinkmann et.al. (2009):X-8 RS schema on 8800 GTSRS rate: 1.0 GB/s

[email protected]

Overall RS & Kernel Efficiencies

0 10 20 30 40 50 60 70 80 90 100

X5570 (8x)

XPowerCell8i (8x)

G80 (64x)

T10 (32x)

XCV4LX200

CSX600 (96x)

Efficiency [%]

Reed-Solomon Efficiencies

overall RS EfficiencyKernel Efficiency

ClearSpeed (2007)

FPGA (2006)

GPU (2009)

GPU (2007)

CBE (2008)

x86 (2009)

123

[email protected]

Reed-Solomon on x86: Performance & Scaling

Reed-Solomon Rate: Intel Nehalem

4090

8122

12542

14503

0

2000

4000

6000

8000

10000

12000

14000

16000

1 2 4 8

# Threads

RS

Rat

e [M

B/s

]QPI bandwidth limit:

32000 MB/s

optimization level: SSE, OpenMP, NUMA

[email protected]

RS on CBE PowerXCell8i: Performance & Scaling

Reed-Solomon Rate: PowerXCell8i

5551

10375

14023 14476

0

2000

4000

6000

8000

10000

12000

14000

16000

1 2 4 8

# Threads

RS

Rat

e [M

B/s

]XDR bandwidth limit:

25600 MB/s

[email protected]

Overall Results

[email protected]

Performance & Efficiency Summary

1. Cell BE, x86 Nehalem

2. GPGPU, FPGA

3. ClearSpeed

Category 50+: Cell BE

Category 40: x86 Nehalem, FPGA,GPGPU-C1060

Category 20: GPGPU-C870, ClearSpeed

Ranking according tosustained Reed-Solomon rate

Categories according toReed-Solomon efficiency

[email protected]

Limitations of the Study

we measured the performance of the encoding step for a fixed 5+3 RS schema, only

performance of the decoding step can be considered similar

the total data processing workflow includes additional stepsapplication (producer) : data injection bandwidth

permanent storage : disk I/O bandwidth

[email protected]

Conclusion

Reed-Solomon encoding is feasible using non-ASIC technology

algorithmic improvements: Cauchy Reed-Solomon

technology improvements: energy efficient accelerators

Reed-Solomon application scenarios:

1. non-critical requirements (power, cooling)x86 platform is a convenient solution

2. data intensive processing environments:FPGA integrated into data path

[email protected]

Acknowledgement

Thanks to …

Michael Peick, Johannes Bock (initial GPU & FPGA version)

Mathias Foquet-Lapar, SGI (Tesla C1070 on SGI’s AEP sys)

Willi Homberg, FZ Jülich (QS22 system JUICEnext)

???

Documents

Efficiency Considerations of Cauchy Reed-Solomon …saahpc.ncsa.illinois.edu/10/presentations/day3/session1/... · 2010. 6. 15. · Reed-Solomon on x86: Performance & Scaling Reed-Solomon