COSC 3361 Numerical Analysis I - University of Houston

COSC 3361 – Numerical Analysis IEdgar Gabriel

COSC 3361Numerical Analysis ISolution of nonlinear equations (II)

Edgar GabrielFall 2005


Summary of the last lecture

• Three methods presented for finding zeros of non-linear equations– Bisection method linear convergence– Newton’s method quadratic convergence– Secant method convergence of order ~1.62

• All methods have their advantages and disadvantages– Number of iterations required until convergence– Number of function evaluations required per iteration

(=speed)– Stability– Function derivatives required/not required


Summary of the last lecture (II)

• None of the methods did take into account the structure of the provided functions (e.g. polynomials, trigonometric functions, exponential functions etc.)


Computing Roots of Polynomials

• A polynomial of the form

– has degree n if– A polynomial of degree n has exactly n roots in the

complex plain, it being agreed that each root shall be counted a number of times equal to its multiplicity

• Complex roots:

012

21

1 ...)( azazazazazp nn

nn +++++= −

−

0≠na

1)( 2 += xxf

(1)


Polynomials

• If a polynomial p of degree n is divided by a linear factor the result is a quotient q and a remainder e

– q is a polynomial of degree n-1– If (c = r) is a zero of p, e = 0

• Applying such a division repeatedly leads to

– with being a constant

cz −

ezqczzp +−= )()()(

nn qrzrzrzzp ))...()(()( 21 −−−=nq


Horner’s Algorithm

• also called nested multiplication or synthetic division• If a polynomial p and a (complex) number are given,

Horner’s algorithm will produce the number and

• From (2) we can derive

• Let the unknown polynomial q be represented by

0z)( 0zp

0

0 )()()(

zzzpzp

zq−−= (2)

)()()()( 00 zpzqzzzp +−=

1110 ...)( −

−+++= nn zbzbbzq

(3)

(4)


Horner’s Algorithm (II)

• Substitute (4) and (1) in (3)

• Comparison of the coefficients of like powers of z leads to

)()()()( 00 zpzqzzzp +−=

)()...)((... 0011

10011

1 zpbzbzbzzazazaza nn

nn

nn +++−=+++ −

−−

−

nn ab =−1

112 −−− += nnn zbab…

110 zbab +=0000 )( bzazp +=


Horner’s Algorithm (III)

• The algorithm in compact form

input n, a(n), z0b(n-1) = a(n)for k = n : 0 step -1b(k-1) = a(k) + z0*b(k)

end


Example

• Use Horner’s algorithm to evaluate p(3) with p being

1 -4 7 -5 -23 3 -3 12 21

1 -1 4 7 19

• Thus

2574)( 234 −−+−= zzzzzp

19)74)(3()( 23 +++−−= zzzzzp


Taylor expansion with Horner’s algorithm

• Suppose we are looking for the coefficients of

• Elements could be calculated by the usual, inefficient formula

kc

012

21


nn +++++= −

−

01

010 ...)()( czzczzc nn

nn ++−+−= −

−

!)( 0

)(

kzp

ck

k =


Taylor expansion with Horner’s algorithm

but:and because of

you have

• This process is repeated until all coefficients of are found

)( 00 zpc =

12

011

00

0 ...)()()()(

)( czzczzczz

zpzpzq n

nn

n ++−+−=−−= −

−−

)( 01 zqc =

kc


Example• Find the Taylor expansion at for the polynomial of

the previous example 1 -4 7 -5 -2

3 3 -3 12 211 -1 4 7 19

3 3 6 301 2 10 37

3 3 151 5 25

3 31 8

30 =z

19)3(37)3(25)3(8)3()( 234 +−+−+−+−= zzzzzp


Taylor expansion with polynomials

• Please note:)( 00 zpc =)( 01 zqc =

function horner( n, a(n, 0<=k<=n), z0)alpha = a(n)beta = 0for k = n-1 : 0 step -1beta = alpha + z0*betaalpha = a(k)+ z0*alpha

endreturn alpha, beta


Theorem on Horner’s Method

• Let . Define pairsfor i=n, n-1, …, 0 by the algorithm

Then and• Easy to verify, that the term is correct for n=0, since

and • Easy to verify for n=1:

012

21


nn +++++= −

−

),( ii βα

��

++==

+++ ),(),()0,(),(

111 jjjjjj

nnn

zza βααβααβα

)01( ≥≥− jn

)(0 zp=α )(0 zp′=β

0)( azp = 0)( =′ zp

01)( azazp +=))(),((),(),(),( 110111000 zpzpaazazza ′=+=++= βααβα


Theorem on Horner’s Method (II)

012

21


nn +++++= −

−

)...( 121

11

0 azazazaza nn

nn +++++= −

−−

)(0 zzqa +=

)()()( zqzqzzp +′=′

Suppose the theorem works for n<m. For n=m+1 you have to add than one step:

),(),( 000111 βααβα zza ++= −−−

))()(),(( 1 zqzzqzzqa ′++= −

))(),(( zpzp ′=


Newton iterations

)()(

1n

nnn xf

xfxx

′−=+

1

0

)()(

cc

xxpxp

x nn

nn −=

′−=

input n, a(n, 0<=k<=n), z0, maxit, precision

for j = 1 : maxit[alpha, beta] = horner (n, a, z0 )z1 = z0 – alpha/betaif ( |z1-z0| < precision stopz0 = z1

end


Hardware issues

• Performance of numerical algorithm does not only depend on the algorithm, but also on the implementation of the algorithm

• Some hardware issues have to be understand for being able to design and efficiently implement numerical algorithms– pipelining concept– caches


Pipelining

• Splits a single (expensive) operation into several (cheap) sub-operations

• Each of the sub-operations can be executed in parallel

4 8 12

t

Non-pipelined execution of an operation for three elementsconsisting of four sub-operations

Pipelined execution of an operation for three elementsconsisting of four sub-operations

4 8 12

t


Pipelining (II)

• Example for a (fictive) pipelined addition

Comparison of leading sign

Comparison of exponents

Alignment shift

Addition

Normalization

Exponent

Data

Result

+,+

.100000e9+.10000e1

.100000e9+.000001e-5

.100000e9

.100000e9

.100000e9

)1()10( 8 +

)10( 80.1e9


Pipelining – Metrics (I)

Clocktime, time to finish one segment/sub-operationnumber of stages of the pipelinelength of the vectorStartup time in clocks, time after which the first result is

available,length of the loop to achieve half of the maximum speedAssuming a simple loop like:

for (i=0; i<n; i++) {

a[i] = b[i] + c[i];

}

cTmn

S

21N

MS =


Pipelining – Metrics (II)

Number of operations per loop iterationtotal number of operations for the loop, withSpeed of the loop is

For we get

op

totalop nopoptotal *=

)11

())1((*

+−=−+

==

nm

T

opnmTnop

timeop

F

cc

total

∞→n

cTop

F =max


Pipelining – Metrics (III)

Because of the Definition of we now get

or

and

�length of the loop required to achieve half of the theoretical peak performance of a pipeline is equal to the number of segments (stages) of the pipeline

21N

cc

Top

F

Nm

T

op21

21

)11

(max

21

==+−

211

21

=+−N

m

121 −= mN


Pipelining – Metrics (IV)

More general: is defined through

and leads to

E.g. for you get

� the closer you would like to get to the maximum performance of your pipeline, the larger the iteration counter of your loop has to be

αN

cc

Top

Nm

T

op α

α

=+−

)11

(

11

1*

−≈

α

α mN

43=α 3*

43 mN =


The memory bottleneck (I)

• Every loop iteration requires 3 memory operations– 2 loads– 1 store

• For a micro-processor having a frequency of 2 GHz this would require

to satisfy one Floating Point Unit (FPU) • Most modern processors have 2 FPUs and 2 IUs which can work in

parallel

for (i=0; i<n; i++ ) {c[i] = a[i] + b[i];

}

sGBytessBytes /2410*2*4*3 19 =−


Memory technology (www.kingston.com/newtech)

• DDR: Double Data Rate SDRAM• Bandwidth of a memory module

withCycleOpfSBSB BUSBus /**max =

maxSB

BUSSB

BUSf

: max. memory bandwidth: Bandwidth of the memory bus (64 Bit = 8 Bytes): Frequency of the memory bus


Memory bandwidth

800 MB/s100 PC100 SDRAM

1.1 GB/s133PC133 SDRAM

4.2 GB/s266PC4200 DDR

3.7 GB/s233PC3700 DDR

3.2 GB/s200PC3200 DDR

2.7 GB/s166PC2700 DDR

2.1 GB/s133PC2100 DDR

1.6 GB/s100PC1600 DDR

max. bandwidthFrequency of memory bus (MHz)

Name


Memory modules (cont.)

• Dual Channel Memory: 2 I/O Channels between memory controller und memory module

• DDR2: further evolution of the DDR technology– uses 1.8 Volts vs. 2.5 Volts technology– larger capacity of the chips– higher frequency

6.4 GB/s

5.3 GB/s

4.2 GB/s

3.2 GB/s

Bandwidth of a module

12.8 GB/s800 MHzPC2-6400

10.6 GB/s667 MHzPC2-5300

8.4 GB/s533 MHzPC2-4200

6.4 GB/s400 MHzPC2-3200

Dual Channel DDR2 bandwidth

Frequency of memory bus

Name


Memory interleaving

• Split the main memory into several physical areas (banks)

• each area can serve a memory request without blocking the other ones– several memory requests can be interleaved as long as

they are using different memory banks• A PC has between 1-4 memory banks• (A vector supercomputer, e.g. NEC SX-8 has 8192

memory banks)


Memory hierarchies

2 – 50~ 1 MBCaches

1 - 2< 256 WordsRegister

100 - 1000~ 1 GBmain memory

> 106~ 100 GBPrimary datastorage (disk)

TB, PTBackup (tape)

Access time[cycles]

Size


Caches

• Fast memory which is closer to the CPU/FPU• significantly smaller than the main memory• often organized also in several hierarchies

– level 1 cache– level 2 cache– …

• Each of these levels is closer to the CPU, faster, and smaller

• Reason for not having only fast memory (=cache): money, money, money…


Caches (II)

• Caches are organized in cache-lines– e.g. on a PC it is typically 64 bytes

• cache hit: if the data, which the processor is asking for is already in the cache

• cache miss: if the data, which the processor is asking for is not in the cache yet– a good performing code needs a high cache hit/cache

miss rate– compilers/processors try to circumvent cache misses

through techniques like pre-fetching etc.


Caches (III)

• If a data element has to be loaded into the cache first, the whole cache-line is loaded– more than the processor asked for– the processor better uses this data, else the load of the

whole cache-line has been wasted!• Caches are organized internally either as

– direct mapped cache– n-way associative cache


Direct mapped cache

• The cache is split into chunks of length of the cache-line• Each address in the memory can uniquely be mapped

into a block of the cache

• Problem with direct mapped cache: two address, which map onto the same cache-block can cause consecutive cash misses

0xffff8e10 0xffff8e50 … 0xffff8f90


n-way associative cache

• Cache is split into chunks of length of cache-line • each chunk is replicated n-times

– n is typically 2 or 4

• Problem of direct mapped cache is solved• Algorithms for determining which entry of a cache-block has to be replaced

– random replacement– least used replacement– longest not touched replacement

0xffff8e10 0xffff8e50 … 0xffff8f90


Cache coherence protocols

• Problem: what happens if two processors share some data– if one processor modifies the data in the cache, the copy

of the same element in the other cache has to be invalidated

CPU 1 CPU 2

Cache Cache

Memory

Documents

COSC 3361 Numerical Analysis I - University of Houston