Automated Floating-Point Precision Analysis

Automated Floating-PointPrecision Analysis

Michael O. Lam

Ph.D. Defense6 Jan 2014

Jeff Hollingsworth, Advisor

Context

• Floating-point arithmetic is ubiquitous

Context

Floating-point arithmetic represents real numbers as (± 1.frac × 2exp)

– Sign bit– Exponent– Significand (“mantissa” or “fraction”)

032 16 8 4

Significand (23 bits)Exponent (8 bits)

Single Precision

03264 16 8 4

Double Precision

Context

032 16 8 4

0x40000000

03264 16 8 4

0x4000000000000000

Representing 2.0:

Context

032 16 8 4

0x40200000

03264 16 8 4

0x4005000000000000

Representing 2.625:

Context

032 16 8 4

0x3DCCCCCD

03264 16 8 4

0x3FB999999999999A

Representing 0.1:

Context

032 16 8 4

0x3F9DF3B6

03264 16 8 4

0x3FF3BE76C8B43958

Representing 1.234:

Context

• Floating-point is ubiquitous but problematic– Rounding error

• Accumulates after many operations• Not always intuitive (e.g., non-associative)• Naïve approach: higher precision

– Lower precision is preferable• Tesla K20X is 2.3X faster in single precision• Xeon Phi is 2.0X faster in single precision• Single precision uses 50% of the memory bandwidth

Problem

• Current analysis solutions are lacking– Numerical analysis methods are difficult– Static analysis is too conservative– Trial-and-error is time-consuming

• We need better analysis solutions– Produce easy-to-understand results– Incorporate runtime effects– Automated or semi-automated

Thesis

Automated runtime analysis techniques can inform application developers regarding floating-point behavior,

and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.

Contributions

1. Floating-point software analysis framework2. Cancellation detection3. Mixed-precision configuration4. Reduced-precision analysis

Initial emphasis on capabilityover performance

2.7182818284590452353603...

Example: Sum2PI_Xint sum2pi_x() { int i, j, k; real x, y, acc, sum; real final = PI * OUTER; /* correct answer */

sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) {

/* calculate 2^j */ x = 1.0; for (k=0; k<j; k++) x *= 2.0; /* 870K execs */

/* approximately calculate pi */ y = (real)PI / x; /* 58K execs */ acc += y; /* 58K execs */ } sum += acc; /* 2K execs */ } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n");}

/* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */

/* constants */#define PI 3.14159265359#define EPS 1e-7

/* loop iterations; OUTER is X */#define OUTER 2000#define INNER 30

Contribution 1 of 4

Software Framework

Framework

CRAFT: Configurable Runtime Analysis for Floating-point Tuning

2.7182818284590452353603...

Framework

• Dyninst: a binary analysis library– Parses executable files (InstructionAPI & ParseAPI)– Inserts instrumentation (DyninstAPI)– Supports full binary modification (PatchAPI)– Rewrites binary executable files (SymtabAPI)

• Binary-level analysis benefits– Programming language-agnostic– Supports closed third-party libraries– Sensitive to compiler transformations

Framework

• CRAFT framework– Dyninst-based binary mutator (C/C++)– Swing-based GUI viewers (Java)– Automated search scripts (Ruby)

• Proof-of-concept analyses– Instruction counting– Not-a-Number (NaN) detection– Range tracking (from Brown et al. 2007)

Sum2PI_X

No NaNs detected

Contribution 2 of 4

Cancellation Detection

Cancellation

• Loss of significant digits due to subtraction

• Cancellation detection– Instrument every addition and subtraction– Report cancellation events

2.491264 (7) 1.613647 (7) - 2.491252 (7) - 1.613647 (7) 0.000012 (2) 0.000000 (0)

(5 digits cancelled) (all digits cancelled)

PRECISION

Cancellation: GUI

Cancellation: Sum2PI_X

Version SignificandSize (bits)

CanceledBits

Single 23 18Mixed 23/52 23Double 52 29

Cancellation: Results

• Gaussian elimination– Detect effects of a small pivot value– Highlight algorithmic differences

• Domain-specific insights– Dense point fields– Color saturations

• Error checking– Larger cancellations are better

Cancellation: Conclusions

• Automated analysis can detect cancellation• Cancellation detection serves a wide variety of

purposes• Later work expanded the ability to identify

problematic cancellation [Benz et al. 2012]

Contribution 3 of 4

Mixed Precision

• Tradeoff: Single (32 bits) vs. Double (64 bits)• Single precision is faster– 2X+ computational speedup in recent hardware– 50% reduction in memory storage and bandwidth

• Double precision is more accurate– 16 digits vs. 7 digits

Mixed Precision

• Most operations use single precision• Crucial operations use double precision

1: LU ← PA2: solve Ly = Pb3: solve Ux0 = y4: for k = 1, 2, ... do5: rk ← b – Axk-1

6: solve Ly = Prk

7: solve Uzk = y8: xk ← xk-1 + zk

9: check for convergence10: end for

Red text indicates double-precision(all other steps are single-precision)

Mixed-precision linear solver[Buttari 2008]

Difficult to prototype

50% speedup on average(12X in special cases)

Mixed Precision

OriginalBinary Modified

BinaryCRAFT

Double Precision Mixed Precision

MixedConfig

Mixed Precision

• Simulate single precision by storing 32-bit version inside 64-bit double-precision field

downcast conversion

03264 16 8 4

Double

03264 16 8 4ReplacedDouble

7 F F 4 D E A D

Non-signalling NaN 032 16 8 4

Single

Mixed Precision

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd 0x601e38(%rax, %rbx, 8) %xmm0

2 mulsd -0x78(%rsp) * %xmm0 %xmm0

3 addsd -0x4f02(%rip) + %xmm0 %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

Mixed Precision

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd 0x601e38(%rax, %rbx, 8) %xmm0check/replace -0x78(%rsp) and %xmm0

2 mulss -0x78(%rsp) * %xmm0 %xmm0check/replace -0x4f02(%rip) and %xmm0

3 addss -0x4f02(%rip) + %xmm0 %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

Mixed Precision

push %raxpush %rbx

<for each input operand> <copy input into %rax> mov %rbx, 0xffffffff00000000 and %rax, %rbx # extract high word mov %rbx, 0x7ff4dead00000000 test %rax, %rbx # check for flag je next # skip if replaced <copy input into %rax> cvtsd2ss %rax, %rax # down-cast value or %rax, %rbx # set flag <copy %rax back into input>next: <next operand>

pop %rbxpop %rax

<replaced instruction> # e.g. addsd => addss

Mixed Precision

• Question: Which parts to replace?• Answer: Automatic search– Empirical, iterative feedback loop– User-defined verification routine– Heuristic search optimization

Automated Search

• Keys to search algorithm– Depth-first search

• Look for replaceable larger structures first• Modules, functions, blocks, etc.

– Prioritization• Inspect highly-executed routines first

Mixed Precision: Sum2PI_X

Failed single-precisionreplacement

Mixed Precision: Sum2PI_Xint sum2pi_x() { int i, j, k; real x, y, acc; sum_type sum;

real final = PI * OUTER;

x = 1.0; for (k=0; k<j; k++) x *= 2.0;

y = (real)PI / x; acc += y; } sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n");}

sum type

32 ✗64 ? ✔

Mixed Precision: Sum2PI_Xint sum2pi_x() { int i, j, k; real x, y, acc; sum_type sum;

real final = PI * OUTER;

x = 1.0; for (k=0; k<j; k++) x *= 2.0;

y = (real)PI / x; acc += y; } sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n");}

sum type

32 ✗64 ✔ ✔

Mixed Precision: Results

• SuperLU– Lower error threshold = fewer replacements

Threshold % Executions Replaced

Final Error

1.0e-03 99.9 1.59e-04

1.0e-04 87.3 4.42e-05

7.5e-05 52.5 4.40e-05

5.0e-05 45.2 3.00e-05

2.5e-05 26.6 1.69e-05

1.0e-05 1.6 7.15e-07

1.0e-06 1.6 4.7e7-07

• AMGmk– Highly-adaptive multigrid microkernel– Built-in error tolerance– Search found complete replacement– Manual conversion

• Speedup: 175s to 95s (1.8X)• Conventional x86_64 hardware

Mixed Precision: ResultsBenchmark(name.CLASS)

CandidateInstructions

Configurations Tested

% Dynamic Replaced

bt.W 6,228 3,934 83.2

bt.A 6,262 4,000 78.6

cg.W 962 251 7.4

cg.A 956 255 5.6

ep.W 423 117 47.2

ep.A 423 114 45.5

ft.W 426 75 0.3

ft.A 426 74 0.2

lu.W 6,038 4,117 57.4

lu.A 6,014 3,057 57.4

mg.W 1,393 443 39.2

mg.A 1,393 437 36.6

sp.W 4,458 5,124 40.5

sp.A 4,507 4,920 30.5

• Memory-based analysis– Replacement candidates: output operands– Generally higher replacement rates– Analysis found several valid variable-level replacements

Benchmark(name.CLASS)

CandidateOperands

Configurations Tested

% Executions Replaced

bt.A 2,342 300 97.0

cg.A 287 68 71.3

ep.A 236 59 37.9

ft.A 466 108 46.2

lu.A 1,742 104 99.9

mg.A 597 153 83.4

sp.A 1,525 1,094 88.9

Mixed Precision: Conclusions

• Automated tools can prototype mixed-precision configurations

• Automated search can provide precision-level replacement insights

• Precision analysis could provide another “knob” for application tuning

• Even if computation requires double precision, storage/communication may not

Contribution 4 of 4

Reduced Precision

• Simulate reduced precision with truncation– Truncate result after every operation– Allows zero up to double (64-bit) precision– Less overhead (fewer added operations)

• Search routine– Identifies component-level precision requirements

0 Single Double Single Double

Reduced Precision: GUI

• Bit-level precision requirements

0 Single Double

Reduced Precision: Sum2PI_X

0 bits (single – exponent only)

22 bits (single)

27 bits (double – overly conservative)

32 bits (double)

Reduced Precision

• Faster search convergence compared to mixed-precision analysis

Benchmark Instructions OriginalWall time (s)

Speedup

cg.A 956 1,305 59.2%ep.A 423 978 42.5%ft.A 426 825 50.2%lu.A 6,014 514,332 86.7%mg.A 1,393 2,898 66.0%sp.A 4,507 422,371 44.1%

Reduced Precision

• General precision requirement profiles

Low sensitivity High sensitivity

Reduced Precision: ResultsNAS (top) & LAMMPS (bottom)

bt.A (78.6%)

mg.A (36.6%) ft.A (0.2%)

lj rhodo

Reduced Precision: ResultsNAS mg.W (incremental)

>5.0% - 4:66

>0.1% - 15:45

>1.0% - 5:93 >0.5% - 9:45

>0.05% - 23:60 Full – 28:71

Reduced Precision: Conclusions

• Automated analysis can identify general precision level requirements

• Reduced-precision analysis provides results more quickly than mixed-precision analysis

• Incremental searches reduce the time to solution without sacrificing fidelity

Contributions

• General floating-point analysis framework– 32.3K LOC total in ~200 files– LGPL on Sourceforge: sf.net/p/crafthpc

• Cancellation detection– WHIST’11 paper, PARCO 39/3 article

• Mixed-precision configuration– SC’12 poster, ICS’13 paper

• Reduced-precision analysis– ICS’14 submission in preparation

Future Work

• Short term– Optimization and platform ports– Analysis extension and composition– Further case studies

• Long term– Compiler-based implementation– IDE and development cycle integration– Program modeling and verification

Conclusion

Automated runtime analysis techniques can inform application developers regarding floating-point behavior,

and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.

Acknowledgements– Collaborators –

Jeff Hollingsworth (advisor) and Pete Stewart (UMD)Bronis de Supinski, Matt Legendre, et al. (LLNL)

– Colleagues –Ananta Tiwari, Tugrul Ince, Geoff Stoker,

Nick Rutar, Ray Chen, et al.CS Department @ UMD

Intel XED2

– Family & Friends –Lindsay Lam (spouse)

Neil & Alice Lam, Barry & Susan WaltersWallace PCA and Elkton EPC

cartoon byNick Rutar

Automated Floating-Point Precision Analysis

Documents

Floating Point Converter (ALTFP CONVERT)Megafunction User ... · Single-Precision Floating-Point Format For a single-precision floating-point number, the most significant bit (MSB)

Bit-Exact Automated Reasoning About Floating-Point Arithmetic · Bit-Exact Automated Reasoning About Floating-Point ... Post-doc researcher at University of ... Bit-Exact Automated

Variable Precision Floating Point Reciprocal Division and Square

Design of a Single Precision Floating Point Divider and

Automated Floating-Point Precision Analysis · Automated Floating-Point Precision Analysis by Michael O. Lam Dissertation submitted to the Faculty of the Graduate School of the University

Single Precision Floating Point Unit

Processor Design Using 32 Bit Single Precision Floating Point Unit

SMURF: Scalar Multiple-precision Unum Risc-V Floating-point … · 2021. 7. 26. · SMURF: Scalar Multiple-precision Unum Risc-V Floating-point Accelerator for Scientific Computing

PRECISION CONTROL FOR SERVO DRIVEN AUTOMATED MOLD …

Decimal and Binary QP Precision Floating Point on IBM z13™arith23.gforge.inria.fr/slides/Lichtenau-Mueller-Car... · 2016-09-19 · Decimal and Binary QP Precision Floating Point

Efficient Floating Point 32-bit single Precision Multipliers Design

Floating-Point Mixed Precision Tuning by Static Analysis

A hardware MP3 decoder with low precision floating point

Accelerating 3D FFT with Half-Precision Floating …...Accelerating 3D FFT with Half-Precision Floating Point Hardware on GPU Kang, Yanming Hong Kong University of Science and Technology

Design of Double Precision IEEE-754 Floating-Point Units

Implementation of Custom Precision Floating Point Arithmetic on FPGAs

IMPLEMENTATION OF DOUBLE PRECISION FLOATING POINT … · 2018. 5. 1. · DOUBLE PRECISION FLOATING POINT ALGORTHIM Multipliers are key components of many high performance systems

Adaptive Precision Floating-Point Arithmetic and … Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates Jonathan Richard Shewchuk May 17,1996 CMU-CS-96-140 School

Variable precision floating point reciprocal, division and …...Variable Precision Floating Point Reciprocal, Division and Square Root for Major FPGA Vendors A Thesis Presented by

FarmBot - Humanity's Open-Source Automated Precision Farming Machine