Upload
jeffrey-welch
View
215
Download
1
Embed Size (px)
Citation preview
Introduction to Numerical Analysis I
MATH/CMPSC 455
Floating Point Representation of Real
Numbers
FLOATING POINT REPRESENTATION OF REAL NUMBERS
This is about how computers represent and operate real numbers.
Help us to understand the rounding errors
We discuss IEEE 754 Floating Point Standard
Represent binary numbers in computer:1. format2. machine representation
Formats for decimal system
Standard Notation
Scientific Notation
Normalized Scientific Notation
FLOATING POINT FORMAT
Format for floating point number (binary representation)
Normalized IEEE floating point standard:
o sign (+ or -)
o mantissa , which contains the significant bits. (N b’s)
o exponent (p, M-bit binary number representing)
… …
FLOATING POINT FORMAT
Precision sign Exponent (M)
Mantissa(N)
single 1 8 23
double 1 11 52
Long double
1 15 64
Definition (machine epsilon, ): It is the distance between 1 and the smallest floating point number greater than 1.
For the IEEE double precision floating point standard:
It is NOT the smallest representable number!!!
ROUNDING
IEEE Rounding to Nearest Rule:
For double precision, if the 53rd bit to the right of the binary
point is 0, then the round down (truncate after the 52nd bit). If
the 53rd bit is 1, then round up (add 1 to 52 bit), unless all
known bits to the right of the 1 are 0’s, in which case 1 is
added to bit 52 if and only if bit 52 is 1.
How do we fit the infinite binary number in a finite number of bits?
ROUNDING
Notation: Denote the IEEE double precision floating point number associated to x, using the Rounding to the Nearest Rule, by fl(x).
Definition (absolute error & relative error): Let be a computed version of the exact quantity .
ROUNDING
Example:
Example:
Relative rounding error:
MACHINE REPRESENTATION
… …
• Sign: 1 bit, 0 for positive, 1 for negative;
• Mantissa: 52 bits, …
• Exponent: 11 bits, positive binary integer resulting from adding 1023 to the exponent• 1~2046 -1022 ~ 1023;• 2046 infinity if the mantissa is allzeros, NaN
otherwise;• 0 subnormal floating point numbers (small
numbers including 0)
ADDITION OF FLOATING POINT NUMBERS
Step 1: line up the two numbers
Step 2: add them
Step 3: store the result as a floating point number
Double Precision
Higher Precision
Double Precision
Example :
Example :