34
André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata Krishnan, Stargen Inc Yiannakis Sazeides, University of Cyprus

André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Embed Size (px)

Citation preview

Page 1: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

André Seznec Caps Team

IRISA/INRIA

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

André Seznec, IRISA/INRIA

Stephen Felix, Intel

Venkata Krishnan, Stargen Inc

Yiannakis Sazeides, University of Cyprus

Page 2: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Alpha EV8 (cancelled june 2001)

SMT: 4 threads wide-issue superscalar processor:

8-way issue

Single process performance is the goal

Multithreaded performance is a bonus

5-10 % overhead for SMT

Page 3: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Challenges on the EV8 conditional branch predictor

High accuracy is needed: 14 cycles minimum miss penalty

Up to 16 predictions per cycle: from two non-contiguous fetch blocks!

Various implementation constraints: master the number of physical memory arrays use of single-ported memory cells timing constraints

Page 4: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

instruction fetch blocks on EV8

br br

takennottaken

br br

nottaken

nottaken

Page 5: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Alpha EV8 front-end pipeline

Fetches up to two, 8-instruction blocks per cycle from the I-cache: a block ends either on an aligned 8-instruction end or

on a taken control flow up to 16 conditional branches fetched and predicted

per cycle Next two block addresses must be predicted in a single

cycle: critical path: use of a line predictor backed with a

complex PC address generator: conditional branch predictor, RAS, jump predictor ..

Page 6: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

PC address generation pipeline

Cycle 1 Cycle 2 Cycle 3

Line prediction is completed

Prediction table read is completed

PC address generationis completed

C and D A and B Y and Z

Page 7: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

EV8 predictor: (derived from) (2Bc-gskew)

e-gskew

Page 8: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

2Bc-gskew: degrees of freedom partial update policy on correct predictions, only updates correct components:

do not destroy other predictions better accuracy !

On correct predictions: prediction bit is only read hysteresis bit is only written

USE OF DISTINCT PREDICTION AND HYSTERESIS ARRAYS !!

No reason for same size for hysteresis and prediction arrays

Page 9: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

EV8 predictor: leveraging degrees of freedom

Different historylengths

Smaller bimodaltable

Page 10: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

André Seznec Caps Team

IRISA/INRIA

Dealing with implementation constraints

Page 11: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Issues on global history

Blocks A and B Blocks Y and ZBlocks C and D

Branch infos from C, B and A are not valid to predict D!

On each cycle, upto 16 branch are predicted:0 to 16 bits to be inserted in the history vector !?

Page 12: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Block compressed history lghist

Incorporate at most one bit in the history per fetch block: 0, 1 or 2 bits to be incorporated in history vector per

cycle

Which bit ? Direction of the last conditional branch in the block

• previous ones are not taken XORed with position (1st half/ 2nd half) in the block

• more uniform distribution of the history vectors

Page 13: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

instruction fetch blocks on EV8

brbr

taken1 is inserted

br br

takennottaken

0 is inserted

Page 14: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

The EV8 branch predictor information vector

History information is not available on the three previous blocks A, B, and C but, addresses are available !!

Information vector to index the predictor: 1. Instruction address 2. Lghist (3-blocks-old history + path) 3. Path info on the last three blocks

Page 15: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Using single-ported memory arrays

The challenge:

16 predictions to be performed per cycle from two non-contiguous blocks !

8 updates per cycle: for two non-contiguous blocks !

But single-ported arrays are highly desirable :-)

Page 16: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Bank-interleaved or double-ported branch predictor ?

Reads of predictions for two 8-instructions blocks: double-porting: memory cells twice as large

• losing half of the entries ?

bank-interleaving: need for arbitration• longer critical electrical path• losing throughput• short loops fitting in a single 8-instruction block !?

????????

Page 17: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Conflict free interleaved bank predictor

Key idea:

Force adjacent predictions to lie in distinct banks

Bank for A is determined by Y and Z

if (y6,y5)== Bz then Ba =(y6,y5+1) else Ba = (y6,y5)

4-way interleaved:

Page 18: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Conflict free bank-interleaved predictor (2)

Conflicts are avoided by construction

Bank number is computed one cycle ahead not on the critical path

Single ported bank-interleaved memory arrays !

Page 19: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

« Logical view » vs real implementation

4 tables * 4 banks * 2 (pred. +hyst.): 32 memory arrays

Indexing functions are computed, then arrays are accessed

4 banks * 2 (pred. + hyst.) 4 tables in a single

array 8 memory arrays

No time to lose: start access and

compute part of the index in //

Page 20: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Reading the branch prediction tables

Bank selection 1 out of 4

Meta G0 G1 BIM

Wordline selection 1 out 64

Column selection:

8 out of 256

Unshuffle: 8 to 8

Page 21: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Reading the branch prediction tables (2)

Span over 5 cycle phases: Cycle -1:

• bank number computation• bank selection

Cycle 0:• phase 0: wordline selection• phase 1: column selection

Cycle 1:• phase 0: unshuffle permutation

Page 22: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Constraints for indices composition

Strong: Wordline bits: immediate availability common to the four logical tables

Medium: Column bits a single 2-entry XOR gate

Weak: Unshuffle bits: near complete freedom, a full tree of XOR gates if

needed

Page 23: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Designing the indexing functions (1)6 wordline bits

Must be available at the beginning of the cycle: block address bits 3-block old lghist bits path bits

Tradeoff: address bits for emphasizing bimodal component

behavior lghist bits are more uniformly distributed

4 lghist bits + 2 address bits

Page 24: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Designing the indexing functions (2)Column selection and unshuffle

Favor independance of the four indexing functions: if two (address,history) pairs conflict on a table then

try to avoid repeating the conflict on an other table

Guarantee that for a single address, two histories that differ by only one or two bits will not map on the same entry

Favor usage of the whole table: lghist bits are more uniformly distributed than address

bits

XORing 2 lghist bits for column bitsa XOR tree with up to 11 bits for unshuffle

Page 25: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

EV8 branch predictor configuration

208 Kbits for prediction and 144 Kbits for hysteresis «BIM»: 16 K + 16 K, 4 lghist bits (+ 3-block path) G0: 64 K + 32 K, 13 lghist bits G1: 64 K + 64 K, 21 lghist bits Meta: 64 K + 32 K, 17 lghist bits

4 prediction banks and 4 hysteresis banks

Page 26: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

André Seznec Caps Team

IRISA/INRIA

Performance evaluation

Sorry,

SPEC 95 :-)

Page 27: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Benchmarks characteristics

Highly optimized SPECint 95:

much more not-taken than taken

ratio lghist/ghist length: • from 1.12 to 1.59

from 8.9 to 16.2 branches per 100 instructions

Page 28: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

2Bc-gskew vs other global history predictors

0

2

4

6

8

10

12

com gcc go ijp li m88 perl vor

Mis

pre

dic

tio

ns

/KI

512K 2Bc-gskew 0,17,20,27576K YAGS 25256K 2Bc-gskew 0,13,16,23288K YAGS 23544K bimode 202M gshare

Page 29: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Quality of information vector

0

2

4

6

8

10

12

com gcc go ijp li m88 perl vor

Mis

pre

dic

tio

ns/

KI

ghist (512K 2Bc-gskew)

lghist,no path

lghist, path

3-old lghist

EV8 info vector

Page 30: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Reducing some table sizes no significant impact

0

2

4

6

8

10

12

com gcc go ijp li m88 perl vor

Mis

pre

dic

tio

ns/

KI

4*64K 2Bc-gskew ghist

small BIM

EV8 size

Page 31: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Quality of indexing functions

0

2

4

6

8

10

12

com gcc go ijp li m88 perl vor

Mis

pre

dic

tio

ns/K

I

address only, no pathaddress only, pathno pathEV8 EV8+complete hash4*64K 2Bc-gskew ghist

Page 32: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Conclusion

Design of a real branch predictor leads to challenges ignored in most academic studies: 3-block old history vector inability to maintain a complete history simultaneous accesses to the predictor minimization of the number of memory arrays timing constraints on the indexing functions

We overcame these difficulties and adapted a state of the art academic branch predictor to real world constraints.

Page 33: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Summary of the contributions

Efficient information vector can be built with mixing path and compressed history: don’t focus on the info vector, use what is convenient!

Use of different table sizes, history lengths in the predictor.

Sharing of hysteresis bits

Conflict free parallel access scheme for the predictor

Engineering of indexing functions

Page 34: André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata

Th

e A

lph

a E

V8

Co

nd

itio

nal

Bra

nc

h P

red

icto

r

André SeznecCaps Team

Irisa

Acknowledgements

To the whole EV8 design team

Special mention to:

Ta-chung Chang, George Chrysos, John Edmondson, Joel Emer, Tryggve Fossum, Glenn Giacalone, Balakrishnan Iyer, Manickavelu Balasubramanian, Harish Patil, George Tien and James Vash.