Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
COSC 3361 – Numerical Analysis IEdgar Gabriel
COSC 3361Numerical Analysis ISolution of nonlinear equations (II)
Edgar GabrielFall 2005
COSC 3361 – Numerical Analysis IEdgar Gabriel
Summary of the last lecture
• Three methods presented for finding zeros of non-linear equations– Bisection method linear convergence– Newton’s method quadratic convergence– Secant method convergence of order ~1.62
• All methods have their advantages and disadvantages– Number of iterations required until convergence– Number of function evaluations required per iteration
(=speed)– Stability– Function derivatives required/not required
COSC 3361 – Numerical Analysis IEdgar Gabriel
Summary of the last lecture (II)
• None of the methods did take into account the structure of the provided functions (e.g. polynomials, trigonometric functions, exponential functions etc.)
COSC 3361 – Numerical Analysis IEdgar Gabriel
Computing Roots of Polynomials
• A polynomial of the form
– has degree n if– A polynomial of degree n has exactly n roots in the
complex plain, it being agreed that each root shall be counted a number of times equal to its multiplicity
• Complex roots:
012
21
1 ...)( azazazazazp nn
nn +++++= −
−
0≠na
1)( 2 += xxf
(1)
COSC 3361 – Numerical Analysis IEdgar Gabriel
Polynomials
• If a polynomial p of degree n is divided by a linear factor the result is a quotient q and a remainder e
– q is a polynomial of degree n-1– If (c = r) is a zero of p, e = 0
• Applying such a division repeatedly leads to
– with being a constant
cz −
ezqczzp +−= )()()(
nn qrzrzrzzp ))...()(()( 21 −−−=nq
COSC 3361 – Numerical Analysis IEdgar Gabriel
Horner’s Algorithm
• also called nested multiplication or synthetic division• If a polynomial p and a (complex) number are given,
Horner’s algorithm will produce the number and
• From (2) we can derive
• Let the unknown polynomial q be represented by
0z)( 0zp
0
0 )()()(
zzzpzp
zq−−= (2)
)()()()( 00 zpzqzzzp +−=
1110 ...)( −
−+++= nn zbzbbzq
(3)
(4)
COSC 3361 – Numerical Analysis IEdgar Gabriel
Horner’s Algorithm (II)
• Substitute (4) and (1) in (3)
• Comparison of the coefficients of like powers of z leads to
)()()()( 00 zpzqzzzp +−=
)()...)((... 0011
10011
1 zpbzbzbzzazazaza nn
nn
nn +++−=+++ −
−−
−
nn ab =−1
112 −−− += nnn zbab…
110 zbab +=0000 )( bzazp +=
COSC 3361 – Numerical Analysis IEdgar Gabriel
Horner’s Algorithm (III)
• The algorithm in compact form
input n, a(n), z0b(n-1) = a(n)for k = n : 0 step -1b(k-1) = a(k) + z0*b(k)
end
COSC 3361 – Numerical Analysis IEdgar Gabriel
Example
• Use Horner’s algorithm to evaluate p(3) with p being
1 -4 7 -5 -23 3 -3 12 21
1 -1 4 7 19
• Thus
2574)( 234 −−+−= zzzzzp
19)74)(3()( 23 +++−−= zzzzzp
COSC 3361 – Numerical Analysis IEdgar Gabriel
Taylor expansion with Horner’s algorithm
• Suppose we are looking for the coefficients of
• Elements could be calculated by the usual, inefficient formula
kc
012
21
1 ...)( azazazazazp nn
nn +++++= −
−
01
010 ...)()( czzczzc nn
nn ++−+−= −
−
!)( 0
)(
kzp
ck
k =
COSC 3361 – Numerical Analysis IEdgar Gabriel
Taylor expansion with Horner’s algorithm
but:and because of
you have
• This process is repeated until all coefficients of are found
)( 00 zpc =
12
011
00
0 ...)()()()(
)( czzczzczz
zpzpzq n
nn
n ++−+−=−−= −
−−
)( 01 zqc =
kc
COSC 3361 – Numerical Analysis IEdgar Gabriel
Example• Find the Taylor expansion at for the polynomial of
the previous example 1 -4 7 -5 -2
3 3 -3 12 211 -1 4 7 19
3 3 6 301 2 10 37
3 3 151 5 25
3 31 8
30 =z
19)3(37)3(25)3(8)3()( 234 +−+−+−+−= zzzzzp
COSC 3361 – Numerical Analysis IEdgar Gabriel
Taylor expansion with polynomials
• Please note:)( 00 zpc =)( 01 zqc =
function horner( n, a(n, 0<=k<=n), z0)alpha = a(n)beta = 0for k = n-1 : 0 step -1beta = alpha + z0*betaalpha = a(k)+ z0*alpha
endreturn alpha, beta
COSC 3361 – Numerical Analysis IEdgar Gabriel
Theorem on Horner’s Method
• Let . Define pairsfor i=n, n-1, …, 0 by the algorithm
Then and• Easy to verify, that the term is correct for n=0, since
and • Easy to verify for n=1:
012
21
1 ...)( azazazazazp nn
nn +++++= −
−
),( ii βα
���
++==
+++ ),(),()0,(),(
111 jjjjjj
nnn
zza βααβααβα
)01( ≥≥− jn
)(0 zp=α )(0 zp′=β
0)( azp = 0)( =′ zp
01)( azazp +=))(),((),(),(),( 110111000 zpzpaazazza ′=+=++= βααβα
COSC 3361 – Numerical Analysis IEdgar Gabriel
Theorem on Horner’s Method (II)
012
21
1 ...)( azazazazazp nn
nn +++++= −
−
)...( 121
11
0 azazazaza nn
nn +++++= −
−−
)(0 zzqa +=
)()()( zqzqzzp +′=′
Suppose the theorem works for n<m. For n=m+1 you have to add than one step:
),(),( 000111 βααβα zza ++= −−−
))()(),(( 1 zqzzqzzqa ′++= −
))(),(( zpzp ′=
COSC 3361 – Numerical Analysis IEdgar Gabriel
Newton iterations
)()(
1n
nnn xf
xfxx
′−=+
1
0
)()(
cc
xxpxp
x nn
nn −=
′−=
input n, a(n, 0<=k<=n), z0, maxit, precision
for j = 1 : maxit[alpha, beta] = horner (n, a, z0 )z1 = z0 – alpha/betaif ( |z1-z0| < precision stopz0 = z1
end
COSC 3361 – Numerical Analysis IEdgar Gabriel
Hardware issues
• Performance of numerical algorithm does not only depend on the algorithm, but also on the implementation of the algorithm
• Some hardware issues have to be understand for being able to design and efficiently implement numerical algorithms– pipelining concept– caches
COSC 3361 – Numerical Analysis IEdgar Gabriel
Pipelining
• Splits a single (expensive) operation into several (cheap) sub-operations
• Each of the sub-operations can be executed in parallel
4 8 12
t
Non-pipelined execution of an operation for three elementsconsisting of four sub-operations
Pipelined execution of an operation for three elementsconsisting of four sub-operations
4 8 12
t
COSC 3361 – Numerical Analysis IEdgar Gabriel
Pipelining (II)
• Example for a (fictive) pipelined addition
Comparison of leading sign
Comparison of exponents
Alignment shift
Addition
Normalization
Exponent
Data
Result
+,+
.100000e9+.10000e1
.100000e9+.000001e-5
.100000e9
.100000e9
.100000e9
)1()10( 8 +
)10( 80.1e9
COSC 3361 – Numerical Analysis IEdgar Gabriel
Pipelining – Metrics (I)
Clocktime, time to finish one segment/sub-operationnumber of stages of the pipelinelength of the vectorStartup time in clocks, time after which the first result is
available,length of the loop to achieve half of the maximum speedAssuming a simple loop like:
for (i=0; i<n; i++) {
a[i] = b[i] + c[i];
}
cTmn
S
21N
MS =
COSC 3361 – Numerical Analysis IEdgar Gabriel
Pipelining – Metrics (II)
Number of operations per loop iterationtotal number of operations for the loop, withSpeed of the loop is
For we get
op
totalop nopoptotal *=
)11
())1((*
+−=−+
==
nm
T
opnmTnop
timeop
F
cc
total
∞→n
cTop
F =max
COSC 3361 – Numerical Analysis IEdgar Gabriel
Pipelining – Metrics (III)
Because of the Definition of we now get
or
and
�length of the loop required to achieve half of the theoretical peak performance of a pipeline is equal to the number of segments (stages) of the pipeline
21N
cc
Top
F
Nm
T
op21
21
)11
(max
21
==+−
211
21
=+−N
m
121 −= mN
COSC 3361 – Numerical Analysis IEdgar Gabriel
Pipelining – Metrics (IV)
More general: is defined through
and leads to
E.g. for you get
� the closer you would like to get to the maximum performance of your pipeline, the larger the iteration counter of your loop has to be
αN
cc
Top
Nm
T
op α
α
=+−
)11
(
11
1*
−≈
α
α mN
43=α 3*
43 mN =
COSC 3361 – Numerical Analysis IEdgar Gabriel
The memory bottleneck (I)
• Every loop iteration requires 3 memory operations– 2 loads– 1 store
• For a micro-processor having a frequency of 2 GHz this would require
to satisfy one Floating Point Unit (FPU) • Most modern processors have 2 FPUs and 2 IUs which can work in
parallel
for (i=0; i<n; i++ ) {c[i] = a[i] + b[i];
}
sGBytessBytes /2410*2*4*3 19 =−
COSC 3361 – Numerical Analysis IEdgar Gabriel
Memory technology (www.kingston.com/newtech)
• DDR: Double Data Rate SDRAM• Bandwidth of a memory module
withCycleOpfSBSB BUSBus /**max =
maxSB
BUSSB
BUSf
: max. memory bandwidth: Bandwidth of the memory bus (64 Bit = 8 Bytes): Frequency of the memory bus
COSC 3361 – Numerical Analysis IEdgar Gabriel
Memory bandwidth
800 MB/s100 PC100 SDRAM
1.1 GB/s133PC133 SDRAM
4.2 GB/s266PC4200 DDR
3.7 GB/s233PC3700 DDR
3.2 GB/s200PC3200 DDR
2.7 GB/s166PC2700 DDR
2.1 GB/s133PC2100 DDR
1.6 GB/s100PC1600 DDR
max. bandwidthFrequency of memory bus (MHz)
Name
COSC 3361 – Numerical Analysis IEdgar Gabriel
Memory modules (cont.)
• Dual Channel Memory: 2 I/O Channels between memory controller und memory module
• DDR2: further evolution of the DDR technology– uses 1.8 Volts vs. 2.5 Volts technology– larger capacity of the chips– higher frequency
6.4 GB/s
5.3 GB/s
4.2 GB/s
3.2 GB/s
Bandwidth of a module
12.8 GB/s800 MHzPC2-6400
10.6 GB/s667 MHzPC2-5300
8.4 GB/s533 MHzPC2-4200
6.4 GB/s400 MHzPC2-3200
Dual Channel DDR2 bandwidth
Frequency of memory bus
Name
COSC 3361 – Numerical Analysis IEdgar Gabriel
Memory interleaving
• Split the main memory into several physical areas (banks)
• each area can serve a memory request without blocking the other ones– several memory requests can be interleaved as long as
they are using different memory banks• A PC has between 1-4 memory banks• (A vector supercomputer, e.g. NEC SX-8 has 8192
memory banks)
COSC 3361 – Numerical Analysis IEdgar Gabriel
Memory hierarchies
2 – 50~ 1 MBCaches
1 - 2< 256 WordsRegister
100 - 1000~ 1 GBmain memory
> 106~ 100 GBPrimary datastorage (disk)
TB, PTBackup (tape)
Access time[cycles]
Size
COSC 3361 – Numerical Analysis IEdgar Gabriel
Caches
• Fast memory which is closer to the CPU/FPU• significantly smaller than the main memory• often organized also in several hierarchies
– level 1 cache– level 2 cache– …
• Each of these levels is closer to the CPU, faster, and smaller
• Reason for not having only fast memory (=cache): money, money, money…
COSC 3361 – Numerical Analysis IEdgar Gabriel
Caches (II)
• Caches are organized in cache-lines– e.g. on a PC it is typically 64 bytes
• cache hit: if the data, which the processor is asking for is already in the cache
• cache miss: if the data, which the processor is asking for is not in the cache yet– a good performing code needs a high cache hit/cache
miss rate– compilers/processors try to circumvent cache misses
through techniques like pre-fetching etc.
COSC 3361 – Numerical Analysis IEdgar Gabriel
Caches (III)
• If a data element has to be loaded into the cache first, the whole cache-line is loaded– more than the processor asked for– the processor better uses this data, else the load of the
whole cache-line has been wasted!• Caches are organized internally either as
– direct mapped cache– n-way associative cache
COSC 3361 – Numerical Analysis IEdgar Gabriel
Direct mapped cache
• The cache is split into chunks of length of the cache-line• Each address in the memory can uniquely be mapped
into a block of the cache
• Problem with direct mapped cache: two address, which map onto the same cache-block can cause consecutive cash misses
0xffff8e10 0xffff8e50 … 0xffff8f90
COSC 3361 – Numerical Analysis IEdgar Gabriel
n-way associative cache
• Cache is split into chunks of length of cache-line • each chunk is replicated n-times
– n is typically 2 or 4
• Problem of direct mapped cache is solved• Algorithms for determining which entry of a cache-block has to be replaced
– random replacement– least used replacement– longest not touched replacement
0xffff8e10 0xffff8e50 … 0xffff8f90
COSC 3361 – Numerical Analysis IEdgar Gabriel
Cache coherence protocols
• Problem: what happens if two processors share some data– if one processor modifies the data in the cache, the copy
of the same element in the other cache has to be invalidated
CPU 1 CPU 2
Cache Cache
Memory