PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

pFPC: A Parallel Compressorfor Floating-Point Data

Martin Burtscher1 and Paruj Ratanaworabhan2

1The University of Texas at Austin2Cornell University

March 2009pFPC: A Parallel Compressor for Floating-Point Data

Introduction Scientific programs

Often produce and transfer lots of floating-point data(e.g., program output, checkpoints, messages)

Large amounts of data Are expensive and slow to transfer and store

FPC algorithm for IEEE 754 double-precision data Compresses linear streams of FP values fast and well Single-pass operation and lossless compression

March 2009

Introduction (cont.) Large-scale high-performance computers

Consist of many networked compute nodes Compute nodes have multiple CPUs but only one link

Want to speed up data transfer Need real-time compression to match link throughput

pFPC: a parallel version of the FPC algorithm Exceeds 10 Gb/s on four Xeon processors

pFPC: A Parallel Compressor for Floating-Point Data


Sequential FPC Algorithm [DCC’07]

Make two predictions Select closer value XOR with true value Count leading zero bytes Encode value Update predictors

64

FCM DFCM 64 64

3f82 4… 3f51 9…

compare compare

predictor closercode value

1 64leading

zero bytecounter

encoder

bita cnta bitb cntb remaindera

x y 0 2 z

. . .

compressedstream

3f82 3b1e 0e32 f39d. . .

uncompressed 1Dstream of doubles

selector

double

XOR

remainderb. . . . . .

1+3 0 to 8 bytes

7129 889b 0e5d

March 2009

pFPC: Parallel FPC Algorithm pFPC operation

Divide data stream into chunks Logically assign chunks round-robin to threads Each thread compresses its data with FPC

Key parameters Chunk size & number of threads


first second thread 1 . . .

double double

. . .

thread 2 . . .

chunk A chunk C chunk Echunk size

chunk A chunk B chunk C chunk D chunk Fchunk B chunk D


Evaluation Method Systems

3.0 GHz Xeon with 4 processors Others in paper

Datasets Linear streams of real-world data (18 – 277 MB) 3 observations: error, info, spitzer 3 simulations: brain, comet, plasma 3 messages: bt, sp, sweep3d

March 2009

Compression Ratio vs. Thread Count Configuration

Small predictor Chunk size = 1

Compression ratio Low (FP data) Other algos worse

Fluctuations Due to multi-

dimensional data


1.0

1.1

1.2

1.3

1.4

1.5

1.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Com

pres

sion

Rati

o

Thread Count

March 2009

Compression Ratio vs. Chunk Size Configuration

Small predictor 1 to 4 threads

Compression ratio Flat for 1 thread Steep initial drop

Chunk size Larger is better for

history-based pred.


1.11

1.12

1.13

1.14

1.15

1.16

1.17

1.18

1.19

Com

pres

sion

Rati

o

Chunk Size

1

2

3

4

March 2009

Throughput on Xeon System Throughput increases with chunk size

Loop overhead, false sharing, TLB performance Throughput scales with thread count

Limited by load balance and memory bandwidth


200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

Thro

ughp

ut (M

B/s)

Chunk Size

1 thread 8kB 2 threads 8kB3 threads 8kB 4 threads 8kB

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

Thro

ughp

ut (M

B/s)

Chunk Size

1 thread 8kB 2 threads 8kB3 threads 8kB 4 threads 8kB

Compression Decompression

March 2009

Summary pFPC algorithm Chunks up data and logically assigns chunks in

round-robin fashion to threads Reaches 10.9 and 13.6 Gb/s throughput with a

compression ratio of 1.18 on a 4-core 3 GHz Xeon Portable C source code is available on-line

http://users.ices.utexas.edu/~burtscher/research/pFPC/


March 2009

Conclusions For best compression ratio, thread count should

equal to or be small multiple of data’s dimension Chunk size should be one

For highest throughput, chunk size should at least match system’s page size (and be page aligned) Larger chunks also yield higher compression ratios

with history-based predictors Parallel scaling is limited by memory bandwidth

Future work should focus on improving compression ratio without increasing the memory bandwidth


Documents

PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University