11
Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

Embed Size (px)

Citation preview

Page 1: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

Introduction to Numerical Analysis I

MATH/CMPSC 455

Floating Point Representation of Real

Numbers

Page 2: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

FLOATING POINT REPRESENTATION OF REAL NUMBERS

This is about how computers represent and operate real numbers.

Help us to understand the rounding errors

We discuss IEEE 754 Floating Point Standard

Represent binary numbers in computer:1. format2. machine representation

Page 3: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

Formats for decimal system

Standard Notation

Scientific Notation

Normalized Scientific Notation

FLOATING POINT FORMAT

Page 4: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

Format for floating point number (binary representation)

Normalized IEEE floating point standard:

o sign (+ or -)

o mantissa , which contains the significant bits. (N b’s)

o exponent (p, M-bit binary number representing)

… …

FLOATING POINT FORMAT

Page 5: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

Precision sign Exponent (M)

Mantissa(N)

single 1 8 23

double 1 11 52

Long double

1 15 64

Definition (machine epsilon, ): It is the distance between 1 and the smallest floating point number greater than 1.

For the IEEE double precision floating point standard:

It is NOT the smallest representable number!!!

Page 6: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

ROUNDING

IEEE Rounding to Nearest Rule:

For double precision, if the 53rd bit to the right of the binary

point is 0, then the round down (truncate after the 52nd bit). If

the 53rd bit is 1, then round up (add 1 to 52 bit), unless all

known bits to the right of the 1 are 0’s, in which case 1 is

added to bit 52 if and only if bit 52 is 1.

How do we fit the infinite binary number in a finite number of bits?

Page 7: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

ROUNDING

Notation: Denote the IEEE double precision floating point number associated to x, using the Rounding to the Nearest Rule, by fl(x).

Definition (absolute error & relative error): Let be a computed version of the exact quantity .

Page 8: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

ROUNDING

Example:

Example:

Relative rounding error:

Page 9: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

MACHINE REPRESENTATION

… …

• Sign: 1 bit, 0 for positive, 1 for negative;

• Mantissa: 52 bits, …

• Exponent: 11 bits, positive binary integer resulting from adding 1023 to the exponent• 1~2046 -1022 ~ 1023;• 2046 infinity if the mantissa is allzeros, NaN

otherwise;• 0 subnormal floating point numbers (small

numbers including 0)

Page 10: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

ADDITION OF FLOATING POINT NUMBERS

Step 1: line up the two numbers

Step 2: add them

Step 3: store the result as a floating point number

Double Precision

Higher Precision

Double Precision

Page 11: Introduction to Numerical Analysis I MATH/CMPSC 455 Floating Point Representation of Real Numbers

Example :

Example :