18
CSC 4250 Computer Architectures October 27, 2006 Chapter 3. Instruction-Level Parallelism & Its Dynamic Exploitation

CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

CSC 4250Computer Architectures

October 27, 2006

Chapter 3. Instruction-Level Parallelism

& Its Dynamic Exploitation

Page 2: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Nested Loops

DADDIU R1,R0,#80Loop1: L.D F2,1600(R1)

DADDIU R2,R0,#40Loop2: L.D F0,1000(R2)

ADD.D F0,F0,F2S.D F0,1000(R2)DADDIU R2,R2,#−8BNEZ R2,Loop2DADDIU R1,R1,#−8BNEZ R1,Loop1

How many times do Loop1 and Loop2 iterate?

Page 3: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

BNEZ R2,Loop2

Branch history: TTTTN|TTTTN|TTTTN|TTTTN|…N means branch not

taken.1-bit predictor: TTTTT|NTTTT|NTTTT|NTTTT|…

→ two errors per iteration.

2-bit predictor: TTTTT|TTTTT|TTTTT|TTTTT|…→ one error per

iteration.

The error behavior for Loop1 is similar.Put more bits in the counter to improve error behavior?

Page 4: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Global Branch History

Global branch history:

TTTTN|T|TTTTN|T|TTTTN|T|TTTTN|T| …

Loop 22222 |1| 22222 |1| 22222 |1| 22222 |1| …

Can we use global branch history to get a better result?

(On previous slide, we looked at local branch history.)

Page 5: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

5-Bit Global Branch History

We keep a 5-bit global branch history, and use the bit pattern to choose one of 25 1-bit predictors:

TTTTT NTTTTN TTTTNT TTTNTT TTNTTT TNTTTT T… .NNNNN T

We get 100% accuracy in the steady state.This strategy works if at least 5 bits are used.

Page 6: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Correlating Branch Predictors (p. 200) A 2-bit predictor uses only the recent behavior of a

single branch. SPEC92 benchmark eqntott (the worst case in

Figures 3.8 and 3.9 with an 18% error rate):

if (aa==2)

aa=0;

if (bb==2)

bb=0;

if (aa!=bb) {

Page 7: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

MIPS Code

Assume that aa and bb are assigned to R1 and R2:

DSUBUI R3,R1,#2BNEZ R3,L1 ;branch b1 (aa!=2)DADD R1,R0,R0 ;aa=0

L1: DSUBUI R3,R2,#2BNEZ R3,L2 ;branch b2 (bb!=2)DADD R2,R0,R0 ;bb=0

L2: DSUBU R3,R1,R2 ;R3=aa−bbBEQZ R3,L3 ;branch b3 (aa==bb)

Consider the branches. The behavior of branch b3 is correlated with the behavior of branches b1 and b2: if both b1 and b2 are not taken, then b3 will be taken (as aa and bb are equal).

Page 8: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Simplified Example (p. 202)

Suppose that d has values 0, 1, and 2:if (d==0) d=1;if (d==1)

MIPS Code: Assume that d is assigned to R1:

BNEZ R1,L1 ;branch b1 (d!=0)DADDUI R1,R0,#1 ;d==0, so d=1

L1: DADDUI R3,R1,#−1BNEZ R3,L2 ;branch b2 (d!=1)

…L2:

Page 9: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.10. Possible execution sequence

Initial value of d

d==0? b1 Value of d before b2

d==1? b2

0 Yes NT 1 Yes NT

1 No T 1 Yes NT

2 No T 2 No T

Page 10: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.11. Behavior of 1-bit predictor initialized to NT

Suppose that d = 2, 0, 2, 0, …

Misprediction Rate = 100%!

d=? b1 prediction

b1 action New b1 prediction

b2 prediction

b2 action

New b2 prediction

2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT

Page 11: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.12. Meaning of Prediction BitsPrediction bits Prediction if last

branch not takenPrediction if last

branch taken

NT/NT NT NT

NT/T NT T

T/NT T NT

T/T T T

Page 12: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Fig. 3.13. Action of 1-bit predictor with 1 bit of correlation.

Initialized to NT/NTd=? b1

predictionb1

actionNew b1

predictionb2

predictionb2

actionNew b2

prediction

2 NT/NT T T/NT NT/NT T NT/T

0 T/NT NT T/NT NT/T NT NT/T

2 T/NT T T/NT NT/T T NT/T

0 T/NT NT T/NT NT/T NT NT/T

Page 13: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.14. A (2,2) Branch Prediction Buffer This buffer uses a 2-bit

global history to choose from among 22 predictors for each branch address. Each predictor is in turn a 2-bit predictor for that branch.

Figure 3.12 shows a (1,1) branch prediction buffer.

Page 14: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.15. Comparison of 2-bit Predictors

Page 15: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Tournament Predictors (p. 206)

Adaptively combine local and global predictors. Alpha 21264 has a tournament predictor using 4K 2-bit

counters indexed by the local branch address to choose from between a global predictor and a local predictor. The global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor. The local predictor consists of a 2-level predictor. The top level is a local history table consisting of 1024 10-bit entries. The entry is used to index a table of 1K entries consisting of 3-bit saturating counters, providing the local prediction. (Total = 29K bits. For SPECfp95 benchmarks, less than 1 misprediction per 1000 completed instructions.)

Page 16: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Fig. 3.16. State Transition Diagram for Tournament Predictor The counter is incremented whenever the “predicted” predictor is

correct and the other predictor is incorrect, and it is decremented in the reverse situation.

Page 17: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.17. Fraction of predictions from local predictor for a tournament predictor using SPEC89

Page 18: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation

Figure 3.18. Misprediction rates for three different predictors on SPEC89 as total # of bits is increased