Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction

Efficient Reproducible Floating-Point ReductionOperations on Large Scale Systems

James Demmel, Hong Diep Nguyen

ParLab - EECS - UC Berkeley

SIAM AN13 Jul 8-12, 2013

Plan

Introduction

Algorithms

Experimental results

Conclusions and Future work

Plan

Introduction

Algorithms



Floating-point arithmetic: defines a discrete subset of real valuesand suffers from rounding errors.

→ Floating-point operations (+,×) are commutative but notassociative:

(−1 + 1) + 2−53 ≠ −1 + (1 + 2−53).

Consequence: results of floating-point computations depend on theorder of computation.

Reproducibility: ability to obtain bit-wise identical results fromrun-to-run on the same input data, with different resources.

Motivations

Demands for reproducible floating-point computations:

▸ Debugging: look inside the code step-by-step, and might needto rerun multiple times on the same input data.

▸ Understanding the reliability of output. Ex: 1, Power StateEstimation problem (spmv + dot product), after the 5th stepthe Euclidean norm of the residual vector differs up to 20%from one run to another.

▸ Contractual reasons (road type approval, drug design),

▸ . . .

1Villa et al, Effects of Floating-point non-Associativity on NumericalComputations on Massively Multithreaded Systems, CUG 2009 Proceedings

Sources of non-reproducibility

A performance-optimized floating-point library is prone tonon-reproducibility for various reasons:

▸ Changing Data Layouts:▸ Data alignment,▸ Data partitioning,▸ Data ordering,

▸ Changing Hardware Resources:▸ Fused Multiply-Adder support,▸ Intermediate precision (64 bits, 80 bits, 128 bits, etc),▸ Data path (SSE, AVX, GPU warp, etc),▸ Cache line size,▸ Number of processors,▸ Network topology,▸ ???

Sources of non-reproducibility

A performance-optimized floating-point library is prone tonon-reproducibility for various reasons:

▸ Changing Data Layouts:▸ Data alignment,▸ Data partitioning,▸ Data ordering,

▸ Changing Hardware Resources:▸ Fused Multiply-Adder support,▸ Intermediate precision (64 bits, 80 bits, 128 bits, etc),▸ Data path (SSE, AVX, GPU warp, etc),▸ Cache line size,▸ Number of processors,▸ Network topology,▸ ???

Reproducibility at Large Scale

Large Scale: improve performance by increasing the number ofprocessors.

▸ Highly dynamic scheduling,

▸ Network heterogeneity: reduction tree shape can vary,

▸ Drastically increased communication time

Cost = Arithmetic + CommunicationFLOPs #words moved + #messages

▸ Communication-Avoiding algorithms change the order ofcomputation on purpose, for ex. 2.5D Matmult, 2.5D LU, etc,

▸ A little extra arithmetic cost is allowed so long as thecommunication cost is controlled.

Communication cost

0

0.2

0.4

0.6

0.8

1

ddot

ddot

ddot

ddot

ddot

ddot

ddot

ddot

ddot

ddot

ddot

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y d

do

t tim

e)

DDOT normalized timing breakdown (n = 106)

# Processors

computation communication

State of the art

Source of floating-point non-reproducibility: rounding errors leadto dependence of computed result on order of computations.

To obtain reproducibility:

▸ Fix the order of computations:▸ sequential mode: intolerably costly at large-scale systems▸ fixed reduction tree: substantial communication overhead

▸ Eliminate/Reduce the rounding errors:▸ exact arithmetic (rounded at the end): much more expensivein communication and very wide multi-word arithmetic

▸ fixed-point arithmetic: limited range of values▸ higher precision: reproducible with high probability (notcertain).

▸ Our proposed solution: deterministic errors.

Plan

Introduction

Algorithms



A proposed solution for global sum

Objectives:

▸ bit-wise identical results from run-to-run regardless ofhardware heterogeneity, # processors, reduction tree shape,. . .

▸ independent of data ordering,

▸ only 1 reduction per sum,

▸ no severe loss of accuracy.

Idea: pre-rounding input values.

Pre-rounding technique

EMAX EMIN

x1x2x3x4x5x6...

Rounding occurs at each addition. Computation’s error depends onthe intermediate results, which depend on the order ofcomputation.


EMAX EMIN

x1x2x3x4x5x6...

Boundary

Bits discarded

in advance

No rounding error at each addition. Computation’s error dependson the Boundary, which depends on max ∣xi ∣, not on the ordering


EMAX EMIN

x1x2x3x4x5x6...

Boundary

proc 1

proc 2

proc 3

Bits discarded

in advance

No rounding error at each addition. Computation’s error dependson the Boundary, which depends on max ∣xi ∣, not on the ordering⇒ extra communication among processors.

1-Reduction technique

EMAX EMIN

x1x2x3x4x5x6...

W-bit

proc 1

proc 2

proc 3

Boundaries are precomputed. Special Reduction Operator:(MAX of boundaries combined SUM of corresponding partial sums)

1-Reduction technique

EMAX EMIN

x1x2x3x4x5x6...

W-bit

proc 1

proc 2

proc 3

Boundaries are precomputed. Special Reduction Operator:(MAX of boundaries combined SUM of corresponding partial sums)

k-fold Algorithm

EMAX EMIN

x1x2x3x4x5x6...

W-bit

proc 1

proc 2

proc 3

Increase the number of bins to increase the accuracy.

k-fold Algorithm

EMAX EMIN

x1x2x3x4x5x6...

W-bit

proc 1

proc 2

proc 3

Increase the number of bins to increase the accuracy.

k-fold Algorithm: Accuracy

k-fold algorithm has an error bound:

absolute error ≤ N ⋅Boundaryk < N ⋅ 2−(k−1)⋅W ⋅max ∣xi ∣.

In practice: k = 3, W = 40.

absolute error < N ⋅ 2−80 ⋅max ∣xi ∣ = 2−27 ⋅N ⋅ ε ⋅max ∣xi ∣Standard sum’s error bound ≤ (N − 1) ⋅ ε ⋅∑ ∣xi ∣

Plan

Introduction

Algorithms



Experimental results: Accuracy

Summation of n = 106 floating-point numbers. Computed resultsof both reproducible summation and standard summation (withdifferent ordering: ascending value, descending value, ascendingmagnitude, descending magnitude ) are compared with resultcomputed using quad-double precision.

Generator xi reproducible standard

drand48() 0 -8.5e-15 ÷ 1.0e-14

drand48() − 0.5 1.5e-16 −1.7e − 13 ÷ 1.8e − 13

sin(2.0 ∗ π ∗ i/n) 1.5e-15 −1.0 ÷ 1.0

sin(2.0 ∗ π ∗ i/n) ∗ 2−5 1.0 −1.0 ÷ 1.0

Experimental results: Performance

0

1

2

3

4

5

6

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

ddot

prd

dot

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y d

do

t tim

e)

DDOT normalized timing breakdown (n = 106)

# Processors

computationcommunication

4.5

5.5 5.45.2

4.6

3.8

3.4

2.5

2.1

1.7

1.2


0

1

2

3

4

5

6

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

dasum

prd

asum

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y d

asu

m t

ime

)

DASUM normalized timing breakdown (n = 106)

# Processors

computationcommunication6.0 5.9

5.7

5.3

4.2

3.3

2.9

2.4

1.6 1.51.3


0

0.2

0.4

0.6

0.8

1

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

dnrm

2

prd

nrm

2

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y d

nrm

2 t

ime

)

DNRM2 normalized timing breakdown (n = 106)

# Processors


0.3 0.3 0.3 0.3 0.30.3

0.4

0.4

0.5

0.7

0.9

Conclusions

The proposed 1-Reduction pre-rounding technique▸ provides bit-wise identical reproducibility, regardless of

▸ data permutation, data assignment,▸ processor count, reduction tree shape,▸ hardware heterogeneity, etc.

▸ obtains better error bound than the standard sum’s,

▸ can be done in on-the-fly mode,

▸ requires only ONE reduction for the global parallel summation,

▸ is suitable for very large scale systems (ExaScale),

▸ can be applied to Cloud computing environment,

▸ can be applied to other operations which use summation asthe reduction operator.

Future works

In Progress

▸ Reproducible Blas level 1,

▸ Parallel Prefix Sum,

▸ Matrix-vector / Matrix-matrix multiplication,

TODO

▸ Higher level driver routine: trsm, factorizations like LU, . . .

▸ n.5D algorithms (2.5D Matmult, 2.5D LU),

▸ spMV,

▸ Other associative operations

▸ Real-world applications ?

Experimental results: Performance (single precision)

0

1

2

3

4

5

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

sdot

prs

dot

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y s

do

t tim

e)

SDOT normalized timing breakdown (n = 106)

# Processors


4.5 4.54.4

4.13.9

3.3

2.9

2.1

1.5

1.3 1.3


0

1

2

3

4

5

6

7

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

sasum

prs

asum

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y s

asu

m t

ime

)

SASUM normalized timing breakdown (n = 106)

# Processors


6.8 6.86.4

5.9

5.2

4.3

3.3

2.0

1.6

1.2 1.3


0

0.5

1

1.5

2

2.5

3

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

snrm

2

prs

nrm

2

1 2 4 8 16 32 64 128 256 512 1024

Tim

e (

no

rma

lize

d b

y s

nrm

2 t

ime

)

SNRM2 normalized timing breakdown (n = 106)

# Processors


2.52.4 2.4 2.3

2.2

2.0

1.8

1.6

1.4 1.4

1.2

Documents

Efficient Reproducible Floating-Point Reduction Operations on …bebop.cs.berkeley.edu/reproblas/docs/talks/SIAM_AN13.pdf · 2013-11-23 · Cost = Arithmetic + Communication ... 1-Reduction