View
218
Download
0
Category
Preview:
Citation preview
André Seznec Caps Team
IRISA/INRIA
Design tradeoffs for the Alpha EV8 Conditional Branch Predictor
André Seznec, IRISA/INRIA
Stephen Felix, Intel
Venkata Krishnan, Stargen Inc
Yiannakis Sazeides, University of Cyprus
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Alpha EV8 (cancelled june 2001)
SMT: 4 threads wide-issue superscalar processor:
8-way issue
Single process performance is the goal
Multithreaded performance is a bonus
5-10 % overhead for SMT
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Challenges on the EV8 conditional branch predictor
High accuracy is needed: 14 cycles minimum miss penalty
Up to 16 predictions per cycle: from two non-contiguous fetch blocks!
Various implementation constraints: master the number of physical memory arrays use of single-ported memory cells timing constraints
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
instruction fetch blocks on EV8
br br
takennottaken
br br
nottaken
nottaken
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Alpha EV8 front-end pipeline
Fetches up to two, 8-instruction blocks per cycle from the I-cache: a block ends either on an aligned 8-instruction end or
on a taken control flow up to 16 conditional branches fetched and predicted
per cycle Next two block addresses must be predicted in a single
cycle: critical path: use of a line predictor backed with a
complex PC address generator: conditional branch predictor, RAS, jump predictor ..
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
PC address generation pipeline
Cycle 1 Cycle 2 Cycle 3
Line prediction is completed
Prediction table read is completed
PC address generationis completed
C and D A and B Y and Z
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
EV8 predictor: (derived from) (2Bc-gskew)
e-gskew
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
2Bc-gskew: degrees of freedom partial update policy on correct predictions, only updates correct components:
do not destroy other predictions better accuracy !
On correct predictions: prediction bit is only read hysteresis bit is only written
USE OF DISTINCT PREDICTION AND HYSTERESIS ARRAYS !!
No reason for same size for hysteresis and prediction arrays
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
EV8 predictor: leveraging degrees of freedom
Different historylengths
Smaller bimodaltable
André Seznec Caps Team
IRISA/INRIA
Dealing with implementation constraints
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Issues on global history
Blocks A and B Blocks Y and ZBlocks C and D
Branch infos from C, B and A are not valid to predict D!
On each cycle, upto 16 branch are predicted:0 to 16 bits to be inserted in the history vector !?
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Block compressed history lghist
Incorporate at most one bit in the history per fetch block: 0, 1 or 2 bits to be incorporated in history vector per
cycle
Which bit ? Direction of the last conditional branch in the block
• previous ones are not taken XORed with position (1st half/ 2nd half) in the block
• more uniform distribution of the history vectors
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
instruction fetch blocks on EV8
brbr
taken1 is inserted
br br
takennottaken
0 is inserted
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
The EV8 branch predictor information vector
History information is not available on the three previous blocks A, B, and C but, addresses are available !!
Information vector to index the predictor: 1. Instruction address 2. Lghist (3-blocks-old history + path) 3. Path info on the last three blocks
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Using single-ported memory arrays
The challenge:
16 predictions to be performed per cycle from two non-contiguous blocks !
8 updates per cycle: for two non-contiguous blocks !
But single-ported arrays are highly desirable :-)
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Bank-interleaved or double-ported branch predictor ?
Reads of predictions for two 8-instructions blocks: double-porting: memory cells twice as large
• losing half of the entries ?
bank-interleaving: need for arbitration• longer critical electrical path• losing throughput• short loops fitting in a single 8-instruction block !?
????????
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Conflict free interleaved bank predictor
Key idea:
Force adjacent predictions to lie in distinct banks
Bank for A is determined by Y and Z
if (y6,y5)== Bz then Ba =(y6,y5+1) else Ba = (y6,y5)
4-way interleaved:
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Conflict free bank-interleaved predictor (2)
Conflicts are avoided by construction
Bank number is computed one cycle ahead not on the critical path
Single ported bank-interleaved memory arrays !
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
« Logical view » vs real implementation
4 tables * 4 banks * 2 (pred. +hyst.): 32 memory arrays
Indexing functions are computed, then arrays are accessed
4 banks * 2 (pred. + hyst.) 4 tables in a single
array 8 memory arrays
No time to lose: start access and
compute part of the index in //
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Reading the branch prediction tables
Bank selection 1 out of 4
Meta G0 G1 BIM
Wordline selection 1 out 64
Column selection:
8 out of 256
Unshuffle: 8 to 8
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Reading the branch prediction tables (2)
Span over 5 cycle phases: Cycle -1:
• bank number computation• bank selection
Cycle 0:• phase 0: wordline selection• phase 1: column selection
Cycle 1:• phase 0: unshuffle permutation
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Constraints for indices composition
Strong: Wordline bits: immediate availability common to the four logical tables
Medium: Column bits a single 2-entry XOR gate
Weak: Unshuffle bits: near complete freedom, a full tree of XOR gates if
needed
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Designing the indexing functions (1)6 wordline bits
Must be available at the beginning of the cycle: block address bits 3-block old lghist bits path bits
Tradeoff: address bits for emphasizing bimodal component
behavior lghist bits are more uniformly distributed
4 lghist bits + 2 address bits
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Designing the indexing functions (2)Column selection and unshuffle
Favor independance of the four indexing functions: if two (address,history) pairs conflict on a table then
try to avoid repeating the conflict on an other table
Guarantee that for a single address, two histories that differ by only one or two bits will not map on the same entry
Favor usage of the whole table: lghist bits are more uniformly distributed than address
bits
XORing 2 lghist bits for column bitsa XOR tree with up to 11 bits for unshuffle
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
EV8 branch predictor configuration
208 Kbits for prediction and 144 Kbits for hysteresis «BIM»: 16 K + 16 K, 4 lghist bits (+ 3-block path) G0: 64 K + 32 K, 13 lghist bits G1: 64 K + 64 K, 21 lghist bits Meta: 64 K + 32 K, 17 lghist bits
4 prediction banks and 4 hysteresis banks
André Seznec Caps Team
IRISA/INRIA
Performance evaluation
Sorry,
SPEC 95 :-)
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Benchmarks characteristics
Highly optimized SPECint 95:
much more not-taken than taken
ratio lghist/ghist length: • from 1.12 to 1.59
from 8.9 to 16.2 branches per 100 instructions
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
2Bc-gskew vs other global history predictors
0
2
4
6
8
10
12
com gcc go ijp li m88 perl vor
Mis
pre
dic
tio
ns
/KI
512K 2Bc-gskew 0,17,20,27576K YAGS 25256K 2Bc-gskew 0,13,16,23288K YAGS 23544K bimode 202M gshare
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Quality of information vector
0
2
4
6
8
10
12
com gcc go ijp li m88 perl vor
Mis
pre
dic
tio
ns/
KI
ghist (512K 2Bc-gskew)
lghist,no path
lghist, path
3-old lghist
EV8 info vector
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Reducing some table sizes no significant impact
0
2
4
6
8
10
12
com gcc go ijp li m88 perl vor
Mis
pre
dic
tio
ns/
KI
4*64K 2Bc-gskew ghist
small BIM
EV8 size
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Quality of indexing functions
0
2
4
6
8
10
12
com gcc go ijp li m88 perl vor
Mis
pre
dic
tio
ns/K
I
address only, no pathaddress only, pathno pathEV8 EV8+complete hash4*64K 2Bc-gskew ghist
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Conclusion
Design of a real branch predictor leads to challenges ignored in most academic studies: 3-block old history vector inability to maintain a complete history simultaneous accesses to the predictor minimization of the number of memory arrays timing constraints on the indexing functions
We overcame these difficulties and adapted a state of the art academic branch predictor to real world constraints.
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Summary of the contributions
Efficient information vector can be built with mixing path and compressed history: don’t focus on the info vector, use what is convenient!
Use of different table sizes, history lengths in the predictor.
Sharing of hysteresis bits
Conflict free parallel access scheme for the predictor
Engineering of indexing functions
Th
e A
lph
a E
V8
Co
nd
itio
nal
Bra
nc
h P
red
icto
r
André SeznecCaps Team
Irisa
Acknowledgements
To the whole EV8 design team
Special mention to:
Ta-chung Chang, George Chrysos, John Edmondson, Joel Emer, Tryggve Fossum, Glenn Giacalone, Balakrishnan Iyer, Manickavelu Balasubramanian, Harish Patil, George Tien and James Vash.
Recommended