30
Efficient Reproducible Floating-Point Reduction Operations on Large Scale Systems James Demmel, Hong Diep Nguyen ParLab - EECS - UC Berkeley SIAM AN13 Jul 8-12, 2013

Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Efficient Reproducible Floating-Point ReductionOperations on Large Scale Systems

James Demmel, Hong Diep Nguyen

ParLab - EECS - UC Berkeley

SIAM AN13 Jul 8-12, 2013

Page 2: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Plan

Introduction

Algorithms

Experimental results

Conclusions and Future work

Page 3: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Plan

Introduction

Algorithms

Experimental results

Conclusions and Future work

Page 4: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Floating-point arithmetic: defines a discrete subset of real valuesand suffers from rounding errors.

→ Floating-point operations (+,×) are commutative but notassociative:

(−1 + 1) + 2−53 ≠ −1 + (1 + 2−53).

Consequence: results of floating-point computations depend on theorder of computation.

Reproducibility: ability to obtain bit-wise identical results fromrun-to-run on the same input data, with different resources.

Page 5: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Motivations

Demands for reproducible floating-point computations:

▸ Debugging: look inside the code step-by-step, and might needto rerun multiple times on the same input data.

▸ Understanding the reliability of output. Ex: 1, Power StateEstimation problem (spmv + dot product), after the 5th stepthe Euclidean norm of the residual vector differs up to 20%from one run to another.

▸ Contractual reasons (road type approval, drug design),

▸ . . .

1Villa et al, Effects of Floating-point non-Associativity on NumericalComputations on Massively Multithreaded Systems, CUG 2009 Proceedings

Page 6: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Sources of non-reproducibility

A performance-optimized floating-point library is prone tonon-reproducibility for various reasons:

▸ Changing Data Layouts:▸ Data alignment,▸ Data partitioning,▸ Data ordering,

▸ Changing Hardware Resources:▸ Fused Multiply-Adder support,▸ Intermediate precision (64 bits, 80 bits, 128 bits, etc),▸ Data path (SSE, AVX, GPU warp, etc),▸ Cache line size,▸ Number of processors,▸ Network topology,▸ ???

Page 7: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Sources of non-reproducibility

A performance-optimized floating-point library is prone tonon-reproducibility for various reasons:

▸ Changing Data Layouts:▸ Data alignment,▸ Data partitioning,▸ Data ordering,

▸ Changing Hardware Resources:▸ Fused Multiply-Adder support,▸ Intermediate precision (64 bits, 80 bits, 128 bits, etc),▸ Data path (SSE, AVX, GPU warp, etc),▸ Cache line size,▸ Number of processors,▸ Network topology,▸ ???

Page 8: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Reproducibility at Large Scale

Large Scale: improve performance by increasing the number ofprocessors.

▸ Highly dynamic scheduling,

▸ Network heterogeneity: reduction tree shape can vary,

▸ Drastically increased communication time

Cost = Arithmetic + CommunicationFLOPs #words moved + #messages

▸ Communication-Avoiding algorithms change the order ofcomputation on purpose, for ex. 2.5D Matmult, 2.5D LU, etc,

▸ A little extra arithmetic cost is allowed so long as thecommunication cost is controlled.

Page 9: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Communication cost

0

0.2

0.4

0.6

0.8

1

ddot

ddot

ddot

ddot

ddot

ddot

ddot

ddot

ddot

ddot

ddot

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y d

do

t tim

e)

DDOT normalized timing breakdown (n = 106)

# Processors

computation communication

Page 10: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

State of the art

Source of floating-point non-reproducibility: rounding errors leadto dependence of computed result on order of computations.

To obtain reproducibility:

▸ Fix the order of computations:▸ sequential mode: intolerably costly at large-scale systems▸ fixed reduction tree: substantial communication overhead

▸ Eliminate/Reduce the rounding errors:▸ exact arithmetic (rounded at the end): much more expensivein communication and very wide multi-word arithmetic

▸ fixed-point arithmetic: limited range of values▸ higher precision: reproducible with high probability (notcertain).

▸ Our proposed solution: deterministic errors.

Page 11: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Plan

Introduction

Algorithms

Experimental results

Conclusions and Future work

Page 12: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

A proposed solution for global sum

Objectives:

▸ bit-wise identical results from run-to-run regardless ofhardware heterogeneity, # processors, reduction tree shape,. . .

▸ independent of data ordering,

▸ only 1 reduction per sum,

▸ no severe loss of accuracy.

Idea: pre-rounding input values.

Page 13: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Pre-rounding technique

EMAX EMIN

x1x2x3x4x5x6...

Rounding occurs at each addition. Computation’s error depends onthe intermediate results, which depend on the order ofcomputation.

Page 14: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Pre-rounding technique

EMAX EMIN

x1x2x3x4x5x6...

Boundary

Bits discarded

in advance

No rounding error at each addition. Computation’s error dependson the Boundary, which depends on max ∣xi ∣, not on the ordering

Page 15: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Pre-rounding technique

EMAX EMIN

x1x2x3x4x5x6...

Boundary

proc 1

proc 2

proc 3

Bits discarded

in advance

No rounding error at each addition. Computation’s error dependson the Boundary, which depends on max ∣xi ∣, not on the ordering⇒ extra communication among processors.

Page 16: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

1-Reduction technique

EMAX EMIN

x1x2x3x4x5x6...

W-bit

proc 1

proc 2

proc 3

Boundaries are precomputed. Special Reduction Operator:(MAX of boundaries combined SUM of corresponding partial sums)

Page 17: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

1-Reduction technique

EMAX EMIN

x1x2x3x4x5x6...

W-bit

proc 1

proc 2

proc 3

Boundaries are precomputed. Special Reduction Operator:(MAX of boundaries combined SUM of corresponding partial sums)

Page 18: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

k-fold Algorithm

EMAX EMIN

x1x2x3x4x5x6...

W-bit

proc 1

proc 2

proc 3

Increase the number of bins to increase the accuracy.

Page 19: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

k-fold Algorithm

EMAX EMIN

x1x2x3x4x5x6...

W-bit

proc 1

proc 2

proc 3

Increase the number of bins to increase the accuracy.

Page 20: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

k-fold Algorithm: Accuracy

k-fold algorithm has an error bound:

absolute error ≤ N ⋅Boundaryk < N ⋅ 2−(k−1)⋅W ⋅max ∣xi ∣.

In practice: k = 3, W = 40.

absolute error < N ⋅ 2−80 ⋅max ∣xi ∣ = 2−27 ⋅N ⋅ ε ⋅max ∣xi ∣Standard sum’s error bound ≤ (N − 1) ⋅ ε ⋅∑ ∣xi ∣

Page 21: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Plan

Introduction

Algorithms

Experimental results

Conclusions and Future work

Page 22: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Experimental results: Accuracy

Summation of n = 106 floating-point numbers. Computed resultsof both reproducible summation and standard summation (withdifferent ordering: ascending value, descending value, ascendingmagnitude, descending magnitude ) are compared with resultcomputed using quad-double precision.

Generator xi reproducible standard

drand48() 0 -8.5e-15 ÷ 1.0e-14

drand48() − 0.5 1.5e-16 −1.7e − 13 ÷ 1.8e − 13

sin(2.0 ∗ π ∗ i/n) 1.5e-15 −1.0 ÷ 1.0

sin(2.0 ∗ π ∗ i/n) ∗ 2−5 1.0 −1.0 ÷ 1.0

Page 23: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Experimental results: Performance

0

1

2

3

4

5

6

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y d

do

t tim

e)

DDOT normalized timing breakdown (n = 106)

# Processors

computationcommunication

4.5

5.5 5.45.2

4.6

3.8

3.4

2.5

2.1

1.7

1.2

Page 24: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Experimental results: Performance

0

1

2

3

4

5

6

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y d

asu

m t

ime

)

DASUM normalized timing breakdown (n = 106)

# Processors

computationcommunication6.0 5.9

5.7

5.3

4.2

3.3

2.9

2.4

1.6 1.51.3

Page 25: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Experimental results: Performance

0

0.2

0.4

0.6

0.8

1

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y d

nrm

2 t

ime

)

DNRM2 normalized timing breakdown (n = 106)

# Processors

computation communication

0.3 0.3 0.3 0.3 0.30.3

0.4

0.4

0.5

0.7

0.9

Page 26: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Conclusions

The proposed 1-Reduction pre-rounding technique▸ provides bit-wise identical reproducibility, regardless of

▸ data permutation, data assignment,▸ processor count, reduction tree shape,▸ hardware heterogeneity, etc.

▸ obtains better error bound than the standard sum’s,

▸ can be done in on-the-fly mode,

▸ requires only ONE reduction for the global parallel summation,

▸ is suitable for very large scale systems (ExaScale),

▸ can be applied to Cloud computing environment,

▸ can be applied to other operations which use summation asthe reduction operator.

Page 27: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Future works

In Progress

▸ Reproducible Blas level 1,

▸ Parallel Prefix Sum,

▸ Matrix-vector / Matrix-matrix multiplication,

TODO

▸ Higher level driver routine: trsm, factorizations like LU, . . .

▸ n.5D algorithms (2.5D Matmult, 2.5D LU),

▸ spMV,

▸ Other associative operations

▸ Real-world applications ?

Page 28: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Experimental results: Performance (single precision)

0

1

2

3

4

5

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y s

do

t tim

e)

SDOT normalized timing breakdown (n = 106)

# Processors

computationcommunication

4.5 4.54.4

4.13.9

3.3

2.9

2.1

1.5

1.3 1.3

Page 29: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Experimental results: Performance (single precision)

0

1

2

3

4

5

6

7

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y s

asu

m t

ime

)

SASUM normalized timing breakdown (n = 106)

# Processors

computationcommunication

6.8 6.86.4

5.9

5.2

4.3

3.3

2.0

1.6

1.2 1.3

Page 30: Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Experimental results: Performance (single precision)

0

0.5

1

1.5

2

2.5

3

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y s

nrm

2 t

ime

)

SNRM2 normalized timing breakdown (n = 106)

# Processors

computation communication

2.52.4 2.4 2.3

2.2

2.0

1.8

1.6

1.4 1.4

1.2