36
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability

Pc98 Lect5 Part1 Speedup

Embed Size (px)

Citation preview

Page 1: Pc98 Lect5 Part1 Speedup

1

Lecture 5: Part 1

Performance Laws:

Speedup and Scalability

Page 2: Pc98 Lect5 Part1 Speedup

2

Sequential Execution TimeExecution time = Seconds = Instructions x Cycles x Seconds

(Te) Program Program Instruction Cycle

500 MHz CPU: 500 x 106 clock cycles/secMy program consists of operations:

addition: 500 x 106 (1 cycle per inst)mul: 180 x 106 (3 cycles/inst)div: 120 x 106 (2 cycles/inst)data move: 300 x 106 (1 cycle/inst)

Expected execution time: 2 ns x (500x1+180x3+120x2+300x1)x 106

Page 3: Pc98 Lect5 Part1 Speedup

3

Basic Performance Metrics MIPS = (instructions/second)x 10-6

MFLOPS = (floating point ops/second)x 10-6

CPI = Average cycles per instruction Throughput: number of results per second Workload: W, number of Ops. required to complete

the program Speed: W/TE

Speedup (S)= Te / Timprove

Efficiency (using P processors) = Speedup / P

Page 4: Pc98 Lect5 Part1 Speedup

4

Part I: Speedup

Page 5: Pc98 Lect5 Part1 Speedup

5

Speedup General concepts of Speedup in parallel

computing: How much faster an application runs on parallel

computer ? What benefits derive from the use of parallelism?

General agreement : speedup = serial time/parallel time

Page 6: Pc98 Lect5 Part1 Speedup

6

Speedup Fixed-Workload Speedup (Amdahl’s Law) Fixed-Time Speedup (Gustafson’s Law) Fixed-Memory Speedup (Sun and Ni)

Page 7: Pc98 Lect5 Part1 Speedup

7

Amdahl's LawSpeedup due to enhancement E:

ExTime w/o E Speedup(E) = ------------- ExTime w/ E

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

ExTime: execution time

1/S

F F/S

Page 8: Pc98 Lect5 Part1 Speedup

8

Amdahl’s Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

Page 9: Pc98 Lect5 Part1 Speedup

9

Amdahl’s Law Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP

Speedupoverall =

ExTimenew =

Page 10: Pc98 Lect5 Part1 Speedup

10

Amdahl’s Law Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP

Speedupoverall = 10.95

= 1.053

ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

Page 11: Pc98 Lect5 Part1 Speedup

11

Amdahl’s Law Some applications need real-time response Amdahl’s law shows the upper bound of the

achievable speedup for a given problem size. Limit on the achievable speedup:

W: total workload of W must be executed sequentially

Sp = W / ( W+(1- )(W/P))

= P / (1+(P-1) ) 1 / as P

Page 12: Pc98 Lect5 Part1 Speedup

12

Fixed-Time Speedup Gustafson’s Law (1988) Scaling the problem size along with the increase of

machine size within the same execution time Scaling for higher accuracy Parameters:

W: workload done by a single node W’ = W+(1- )WP = work done by P nodes

Sp = ( W+(1- )WP) / W = +(1-)P Speedup is linear function of P

Page 13: Pc98 Lect5 Part1 Speedup

13

Fixed-Memory Speedup Sun and Ni’s Law: Memory bounding (1993) To solve the largest possible problem, limited

only by the available memory space. As P increases, use up all the increased memory

by scaling the problem size also See Kai’s book (Section 3.6.3)

Page 14: Pc98 Lect5 Part1 Speedup

14

More on Speedup Can you measure the sequential time?

What if the memory is too small ? Is the sequential machine using the same

processor as the one in parallel machine ? Can the sequential time be shorter than the

parallel time ? Is speedup really important ? Or only the

execution time is important ?

Page 15: Pc98 Lect5 Part1 Speedup

15

More on Speedup Speedup larger than “P” (Superlinear) ?

What if the disk swapping effects happen in sequential execution?

More caches/memory used in the parallel machines (PxC, PxM)

Both sequential and parallel machines use the same OS/compiler ?

Is the data distribution/collection time included in the parallel time? Some parallel machines provide parallel I/O !!

Page 16: Pc98 Lect5 Part1 Speedup

16

More on Speedup What if we use different algorithms for sequential

and parallel solutions Take a look at the parallel sorting assignment.

What is the maximum speedup for parallel sorting?

Page 17: Pc98 Lect5 Part1 Speedup

17

Definitions of Speedup Diverse definitions of serial and parallel execution

times [Sahni:1996] Relative Real Absolute Asymptotic Asymptotic relative

Page 18: Pc98 Lect5 Part1 Speedup

18

Parameters Used in Speedup Definitions I = problem instance P = number of processors Q = parallel program n = size of I

Page 19: Pc98 Lect5 Part1 Speedup

19

Relative Speedup serial time : execution time of the

parallel program on a single node of the parallel computer. (mpirun -np 1)

Relative speedup (I,P) = time to solve I using program Q and 1 processor ÷ time to solve I using Q program and P processors

Page 20: Pc98 Lect5 Part1 Speedup

20

Relative Speedup Depends on the characteristics of the

instance I being solved as well as the size P of the parallel computer

Same OS, same node architecture, using same compiler

Extra overheads in serial time (distribution/collection).

Page 21: Pc98 Lect5 Part1 Speedup

21

Real Speedup Parallel time vs. the fastest serial

algorithm or program running on a single node of the parallel computer

Real speedup (I,P) = time to solve I using best serial program and 1 processor ÷ time to solve I using Q program and P processors

Page 22: Pc98 Lect5 Part1 Speedup

22

Problems on Real Speedup The fastest algorithm might not be

known. No single algorithm might be fastest in

all instances for some applications In practice, we use the runtime of the

most frequently used sequential algorithm.

Page 23: Pc98 Lect5 Part1 Speedup

23

Absolute Speedup Parallel time vs. the fastest sequential

algorithm run on the fastest serial computer

Absolute speedup (I,P) = time to solve I using best serial program and 1 fastest processor ÷ time to solve I using Q program and P processors

Page 24: Pc98 Lect5 Part1 Speedup

24

Absolute Speedup Can also use the sequential algorithm

most often used in practice. Time-variant: researchers keep

designing new algorithms Speedup could be less than 1

Page 25: Pc98 Lect5 Part1 Speedup

25

Asymptotic Real Speedup Compares the execution time of the best

serial algorithm for a particular problem with the asymptotic complexity of the parallel algorithm

Assuming the the parallel computer has all the processors it can use

Page 26: Pc98 Lect5 Part1 Speedup

26

Asymptotic Real Speedup Asymptotic real speedup (n) =

asymptotic complexity of best serial algorithm ÷ asymptotic complexity of Q using as many processors as needed

Page 27: Pc98 Lect5 Part1 Speedup

27

Asymptotic Real Speedup For problem such as sorting where the

asymptotic complexity is not uniquely characterized by the instance size n, the worst-case complexity is used -- O(n2), O(n log n)

Unbounded number of processors.

Page 28: Pc98 Lect5 Part1 Speedup

28

Asymptotic Relative Speedup Uses the asymptotic time complexity of

the parallel algorithm when run on a single processor.

Asymptotic relative speedup = asymptotic complexity of Q using 1 processor ÷ asymptotic complexity of Q using as many processors as needed

Page 29: Pc98 Lect5 Part1 Speedup

29

Matrix Multiplication (n X n) Serial time: O(n3) Parallel time (on n3/log n processors

hypercube): O(log n) time Asymptotic Relative Speedup: O(n3/log n)

Others are measured Speedup Relative Real Absolute

Asymptotic Relative Speedup

Page 30: Pc98 Lect5 Part1 Speedup

30

Part II: Scalability

Page 31: Pc98 Lect5 Part1 Speedup

31

Scalability Algorithmic Scalability:

the available parallelism increases at least linearly with problem size.

Architectural (Size) Scalability: the architecture continues to yield the

same performance per processor, as the number of processors is increased and as the problem size is increased.

Page 32: Pc98 Lect5 Part1 Speedup

32

Architectural Scalability

Architecturalscalability

Interconnection Network: latency/bandwidth

I/O performance

OS support(lock, processessynchronizationoverheads)

Processor SpeedNo. of issues, depthof pipelining

Memory Subsystem:cache/memory speed

Performance grows in all aspects

Page 33: Pc98 Lect5 Part1 Speedup

33

Discussion of Architectural Scalability: bus-based SMPs Bus (single set of wires) : fixed bandwidth shared by all

processors (P) Bandwidth accessible by each processor decreases as P

increases Physical constraints: fixed number of slots Electrical constraints:

bus loading, wire length determine frequency power

Cost scaling: X (bus is the core) OS scaling : scheduling, locking, I/O

Page 34: Pc98 Lect5 Part1 Speedup

34

Scalable Interconnection Network

Bandwidth (P)? Latency (P)? Cost (P)? More bandwidth, smaller

L greater cost SGI XL: 1.5 M HK$ (1.2

GB/s) ALR SMP: 0.1 M HK$

(533 MB/s)

Network (bus ? multistage,mesh, torus)

PE PE PE PE

Page 35: Pc98 Lect5 Part1 Speedup

35

Bottom Line There is NO “truly scalable” machine Goal is to design for a given “range of scale”

one (instruction level) two to tens (SMP):

Top-end: Sun Enterprise 1000 : 64 processors, COMPACQ 64 Alpha processors

tens to a few thousands (Distributed-Memory Multicomputers: T3E, SP2, Paragon, ...)

tens of thousands (ASCI machines: cluster of SMPs) Techniques at one scale may not be cost effect at another

Page 36: Pc98 Lect5 Part1 Speedup

36

Bottom Line Even though we can meet the engineering requirements

to physically scale over the range of interest, we must ensure that the communication and synchronization operations required to support target programming models also scale and are cost effective.