Upload
dorcas-pierce
View
213
Download
0
Embed Size (px)
Citation preview
pFPC: A Parallel Compressorfor Floating-Point Data
Martin Burtscher1 and Paruj Ratanaworabhan2
1The University of Texas at Austin2Cornell University
March 2009pFPC: A Parallel Compressor for Floating-Point Data
Introduction Scientific programs
Often produce and transfer lots of floating-point data(e.g., program output, checkpoints, messages)
Large amounts of data Are expensive and slow to transfer and store
FPC algorithm for IEEE 754 double-precision data Compresses linear streams of FP values fast and well Single-pass operation and lossless compression
March 2009
Introduction (cont.) Large-scale high-performance computers
Consist of many networked compute nodes Compute nodes have multiple CPUs but only one link
Want to speed up data transfer Need real-time compression to match link throughput
pFPC: a parallel version of the FPC algorithm Exceeds 10 Gb/s on four Xeon processors
pFPC: A Parallel Compressor for Floating-Point Data
March 2009pFPC: A Parallel Compressor for Floating-Point Data
Sequential FPC Algorithm [DCC’07]
Make two predictions Select closer value XOR with true value Count leading zero bytes Encode value Update predictors
64
FCM DFCM 64 64
3f82 4… 3f51 9…
compare compare
predictor closercode value
1 64leading
zero bytecounter
encoder
bita cnta bitb cntb remaindera
x y 0 2 z
. . .
compressedstream
3f82 3b1e 0e32 f39d. . .
uncompressed 1Dstream of doubles
selector
double
XOR
remainderb. . . . . .
1+3 0 to 8 bytes
7129 889b 0e5d
March 2009
pFPC: Parallel FPC Algorithm pFPC operation
Divide data stream into chunks Logically assign chunks round-robin to threads Each thread compresses its data with FPC
Key parameters Chunk size & number of threads
pFPC: A Parallel Compressor for Floating-Point Data
first second thread 1 . . .
double double
. . .
thread 2 . . .
chunk A chunk C chunk Echunk size
chunk A chunk B chunk C chunk D chunk Fchunk B chunk D
March 2009pFPC: A Parallel Compressor for Floating-Point Data
Evaluation Method Systems
3.0 GHz Xeon with 4 processors Others in paper
Datasets Linear streams of real-world data (18 – 277 MB) 3 observations: error, info, spitzer 3 simulations: brain, comet, plasma 3 messages: bt, sp, sweep3d
March 2009
Compression Ratio vs. Thread Count Configuration
Small predictor Chunk size = 1
Compression ratio Low (FP data) Other algos worse
Fluctuations Due to multi-
dimensional data
pFPC: A Parallel Compressor for Floating-Point Data
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Com
pres
sion
Rati
o
Thread Count
March 2009
Compression Ratio vs. Chunk Size Configuration
Small predictor 1 to 4 threads
Compression ratio Flat for 1 thread Steep initial drop
Chunk size Larger is better for
history-based pred.
pFPC: A Parallel Compressor for Floating-Point Data
1.11
1.12
1.13
1.14
1.15
1.16
1.17
1.18
1.19
Com
pres
sion
Rati
o
Chunk Size
1
2
3
4
March 2009
Throughput on Xeon System Throughput increases with chunk size
Loop overhead, false sharing, TLB performance Throughput scales with thread count
Limited by load balance and memory bandwidth
pFPC: A Parallel Compressor for Floating-Point Data
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
8192
1638
4
3276
8
6553
6
Thro
ughp
ut (M
B/s)
Chunk Size
1 thread 8kB 2 threads 8kB3 threads 8kB 4 threads 8kB
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
8192
1638
4
3276
8
6553
6
Thro
ughp
ut (M
B/s)
Chunk Size
1 thread 8kB 2 threads 8kB3 threads 8kB 4 threads 8kB
Compression Decompression
March 2009
Summary pFPC algorithm Chunks up data and logically assigns chunks in
round-robin fashion to threads Reaches 10.9 and 13.6 Gb/s throughput with a
compression ratio of 1.18 on a 4-core 3 GHz Xeon Portable C source code is available on-line
http://users.ices.utexas.edu/~burtscher/research/pFPC/
pFPC: A Parallel Compressor for Floating-Point Data
March 2009
Conclusions For best compression ratio, thread count should
equal to or be small multiple of data’s dimension Chunk size should be one
For highest throughput, chunk size should at least match system’s page size (and be page aligned) Larger chunks also yield higher compression ratios
with history-based predictors Parallel scaling is limited by memory bandwidth
Future work should focus on improving compression ratio without increasing the memory bandwidth
pFPC: A Parallel Compressor for Floating-Point Data