1
Floating point representation
2
sign bit
…
mantissabits 0-22
bit 31
…
exponent ebits 23-30
IEEE single precision format (Sun, Pentium)
d1 d2 d3 d 23e0e7sgn
1272323
22
11
sgn 2)2221()1( eddd
Number is:
(-1)sgn (1.d1d2 … d23 )2e-127
or more precisely as
3
Example: IEEE single precision
1272323
22
11
sgn 2)2221()1( eddd
Sign is 0: (-1)0 is 1. Exponent is 01111100 = 4 + 8 + 16 + 32 + 64 = 124 Mantissa: 1 + 2-2 = 1.25 Value = 1.25 X 2e-127 = 1.25 X 2-3 = 1.25/8 = 0.15625
Example
4
0 0111100 01000000000000000000001
Given number: 0.15625
Next number: (1 + ¼ + 2-23)2124-127 = 0.15625+2-
26
Previous number: (1+1/8+1/16+…2-23 )2-3
= (1+1/4-2-23)2-3 = 0.15625-2-26
0 011110000111111111111111111111
Real numbers in-between 0.15625 and 0.15625 +2-26 maybe all represented as 0.15625. Discrete representation: leads to approximation errors.
5
Finiteness of floating-pt
Floating-point is a finite arithmetic. On most computers, for historical reasons, an
integer and a floating-point number each occupy one word of storage.
Therefore there are the same number of representable numbers. For example, in
IBM System/370 a word is 32 bits so there are 232 representable integers (between −231 and +231).
and there are also 232 representable floating-point numbers.
6
Largest and smallest nos.
Largest number represented is (1.111111…1)2 X 2128 Largest negative number represented is
-(1.111111…1)2 X 2128 Smallest number in absolute value represented is
1.0 X20-127 = 2-127
sign bit
…
mantissabits 0-22
bit 32
…
exponentbits 23-30
IEEE single precision format
1272323
22
11
sgn 2)2221()1( eddd
7
Underflow and Overflow
Consider IEEE single precision.
Numbers smaller than 2-128 are indistinguishable from 0.
Such numbers occurring in calculations are said to be in underflow and are treated as 0.
Numbers that are larger than 2128 cannot be represented.
Such numbers occurring in calculations are said to have overflowed.
8
Overflow and underflow: the two issues of floating point computation
The representable floating-point numbers, under the arithmetic operations available on the computer are not closed.
Product of two large numbers may not be representable: too large to “fit”. This is overflow.
Conversely, underflow is caused by any calculation whose result is too small to be distinguished from zero.
9
Floating Point Arithmetic
Approximation Errors in Calculations
(underflow/overflow)
10
Representation
For simplicity, we will leave the IEEE single precision standard, and work with decimal numbers.
Assume that floating point numbers are represented as normalized numbers as follows:
Numerical range of machine is:-0.99…9 X 10m to 0.99…9X 10m
0.d1 d2 … dk X 10e1<= d1<= 9,0 <= di <= 9 for i=2..k-m<=e<=m
11
Errors in representation
Any number x in the numerical range is represented as follows.
x can be written as
x= 0.d0d1….dr… X 10n To represent x in floating point, we keep only
the most significant k bits.
x*= 0.d0d1….dk X 10n This is called chopping the number. The
other would be to round the number. Add 5 X 10 (n-k+1) to x and then chop.
12
Approximation errors Let x be a real number. Let x* be its floating point representation on
some machine. Let us use IEEE single precision: 1 sign bit, 23
bits of precision (mantissa). Error is defined in two ways:
absolute error, relative error. Absolute error: |x*-x| Relative error: |x*-x|/ |x|, provided x is not 0.
13
Evaluating Polynomials
14
Evaluation
15
Horner’s Rule
16
Horner’s Rule
17
18
Acknowledgments
These slides are modified from that of Sumit Ganguly
Recommended