11
pFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

Embed Size (px)

Citation preview

Page 1: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

pFPC: A Parallel Compressorfor Floating-Point Data

Martin Burtscher1 and Paruj Ratanaworabhan2

1The University of Texas at Austin2Cornell University

Page 2: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

March 2009pFPC: A Parallel Compressor for Floating-Point Data

Introduction Scientific programs

Often produce and transfer lots of floating-point data(e.g., program output, checkpoints, messages)

Large amounts of data Are expensive and slow to transfer and store

FPC algorithm for IEEE 754 double-precision data Compresses linear streams of FP values fast and well Single-pass operation and lossless compression

Page 3: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

March 2009

Introduction (cont.) Large-scale high-performance computers

Consist of many networked compute nodes Compute nodes have multiple CPUs but only one link

Want to speed up data transfer Need real-time compression to match link throughput

pFPC: a parallel version of the FPC algorithm Exceeds 10 Gb/s on four Xeon processors

pFPC: A Parallel Compressor for Floating-Point Data

Page 4: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

March 2009pFPC: A Parallel Compressor for Floating-Point Data

Sequential FPC Algorithm [DCC’07]

Make two predictions Select closer value XOR with true value Count leading zero bytes Encode value Update predictors

64

FCM DFCM 64 64

3f82 4… 3f51 9…

compare compare

predictor closercode value

1 64leading

zero bytecounter

encoder

bita cnta bitb cntb remaindera

x y 0 2 z

. . .

compressedstream

3f82 3b1e 0e32 f39d. . .

uncompressed 1Dstream of doubles

selector

double

XOR

remainderb. . . . . .

1+3 0 to 8 bytes

7129 889b 0e5d

Page 5: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

March 2009

pFPC: Parallel FPC Algorithm pFPC operation

Divide data stream into chunks Logically assign chunks round-robin to threads Each thread compresses its data with FPC

Key parameters Chunk size & number of threads

pFPC: A Parallel Compressor for Floating-Point Data

first second thread 1 . . .

double double

. . .

thread 2 . . .

chunk A chunk C chunk Echunk size

chunk A chunk B chunk C chunk D chunk Fchunk B chunk D

Page 6: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

March 2009pFPC: A Parallel Compressor for Floating-Point Data

Evaluation Method Systems

3.0 GHz Xeon with 4 processors Others in paper

Datasets Linear streams of real-world data (18 – 277 MB) 3 observations: error, info, spitzer 3 simulations: brain, comet, plasma 3 messages: bt, sp, sweep3d

Page 7: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

March 2009

Compression Ratio vs. Thread Count Configuration

Small predictor Chunk size = 1

Compression ratio Low (FP data) Other algos worse

Fluctuations Due to multi-

dimensional data

pFPC: A Parallel Compressor for Floating-Point Data

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Com

pres

sion

Rati

o

Thread Count

Page 8: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

March 2009

Compression Ratio vs. Chunk Size Configuration

Small predictor 1 to 4 threads

Compression ratio Flat for 1 thread Steep initial drop

Chunk size Larger is better for

history-based pred.

pFPC: A Parallel Compressor for Floating-Point Data

1.11

1.12

1.13

1.14

1.15

1.16

1.17

1.18

1.19

Com

pres

sion

Rati

o

Chunk Size

1

2

3

4

Page 9: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

March 2009

Throughput on Xeon System Throughput increases with chunk size

Loop overhead, false sharing, TLB performance Throughput scales with thread count

Limited by load balance and memory bandwidth

pFPC: A Parallel Compressor for Floating-Point Data

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

Thro

ughp

ut (M

B/s)

Chunk Size

1 thread 8kB 2 threads 8kB3 threads 8kB 4 threads 8kB

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

Thro

ughp

ut (M

B/s)

Chunk Size

1 thread 8kB 2 threads 8kB3 threads 8kB 4 threads 8kB

Compression Decompression

Page 10: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

March 2009

Summary pFPC algorithm Chunks up data and logically assigns chunks in

round-robin fashion to threads Reaches 10.9 and 13.6 Gb/s throughput with a

compression ratio of 1.18 on a 4-core 3 GHz Xeon Portable C source code is available on-line

http://users.ices.utexas.edu/~burtscher/research/pFPC/

pFPC: A Parallel Compressor for Floating-Point Data

Page 11: PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University

March 2009

Conclusions For best compression ratio, thread count should

equal to or be small multiple of data’s dimension Chunk size should be one

For highest throughput, chunk size should at least match system’s page size (and be page aligned) Larger chunks also yield higher compression ratios

with history-based predictors Parallel scaling is limited by memory bandwidth

Future work should focus on improving compression ratio without increasing the memory bandwidth

pFPC: A Parallel Compressor for Floating-Point Data