1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

1

Register Write Specialization

Register Read Specialization A path to complexity effective wide-issue superscalar processors

André Seznec, Eric Toullec, Olivier Rochecouste

IRISA/ INRIA

2

AS-ET-ORCaps Team

Irisa

Why designing wide issue superscalar processors

SMT Superscalar Processors !

3

AS-ET-ORCaps Team

Irisa

Doubling the issue width

Functional Units Silicon area: 2x

Power consumption: 2x

Same latency

Register file: Silicon area: > 8x Power consumption: > 4x access time: 1.5x

Wake-up logic entries: monitors twice as many

inputs area, consumption,

response time Bypass network:

wider multiplexors >2x longer communications

4

AS-ET-ORCaps Team

Irisa

An unwritten rule applied on all superscalar processor designs

For general purpose registers:

Any physical register can be the source or the result of any instruction executed

on any functional unit

5

AS-ET-ORCaps Team

Irisa

The register file issue

6

AS-ET-ORCaps Team

Irisa

Silicon area for the physical register file

7

AS-ET-ORCaps Team

Irisa

Conventional clustered design

C1C0 C2 C3

Register File

8

AS-ET-ORCaps Team

Irisa

Distributed register file

C0 C1 C3C2

Local register file: shorter read access time but larger silicon area

9

AS-ET-ORCaps Team

Irisa

8-way distributed register file 4 identical copies14.5 W (x 4.5)4 cycles (+1)256 x 1792 w2 x W (x11)

8-way monolithic register file16 W (x 5)5 cycles (+2)256 x 1120 w2 x W (x 8)

4-way distributed register file2 identical copies3.1W 3 cycles 128 x 320w2 x W

8-way against 4-way100nm, 5 Ghz

10

AS-ET-ORCaps Team

Irisa

Let us reduce the number of ports

on each individual register

11

AS-ET-ORCaps Team

Irisa


C1C0 C2 C3

S0 S1 S2 S3

12

AS-ET-ORCaps Team

Irisa

Distributed Register File and Register Write Specialization

C0 C1 C3C2

13

AS-ET-ORCaps Team

Irisa


Each cluster writes only a subset of the registers

Less write ports on every individual physical register

But allocation to clusters must precede register renaming

4-cluster 8-way distributed register file 512 entries

320 x w2 per register bit

3 cycles access time

8.5 W

14

AS-ET-ORCaps Team

Irisa

Register Write Specialization and Register Renaming

1:Op R6, R7 -> R52:Op R2, R5 -> R63:Op R6, R3 -> R44:Op R4, R6 -> R2

4 free odd reg4 free even reg

4-bit subset target vector

1:Op L6, L7 -> res12:Op L2, res1 -> res23:Op res2, L3 -> res34:Op res3,res2 -> res4

4 new free registers

+Old map table

1:Op P6, P7 -> RES12:Op P2, RES1 -> RES23:Op RES2, L3 -> RES34:Op RES3,RES2 -> RES4

New map table

15

AS-ET-ORCaps Team

Irisa

Register Write Specialization and Register Renaming (2)

Consumes a lot of registers : need for recycling

1:build two lists of registers to be recycled2: pack both lists 3: concatenate the two lists4: append to the free list

16

AS-ET-ORCaps Team

Irisa

Register Write Specialization and Register Renaming (3)

An alternative: Compute the number of registers in each register subset Pick the right number of registers from each of the free lists No need for recycling registers

Think about round-robin distribution !

17

AS-ET-ORCaps Team

Irisa

Performance issues

Register Write Specialization only: round robin allocation:

• no extra stage for register renaming • shorter register acces time

Overall shorter pipeline:

slightly better performances

18

AS-ET-ORCaps Team

Irisa

Register Read Specialization

C1C0 C2 C3

S0 S1

19

AS-ET-ORCaps Team

Irisa

Register Read Specialization

Limits number of read ports on each individual register

Puts strong constraints on allocation of instructions to clusters

Caution:

Personal opinion: don’t use it alone !

Interconnection topology must ensurethat every instruction is executable

20

AS-ET-ORCaps Team

Irisa

WSRS architectures

Combining Register Read Specialization and


21

AS-ET-ORCaps Team

Irisa

4-cluster WSRS architecture

S0

S0 C0

S1

S1C1

S2

C2

S3

S3C3S2

inst. operands positionsdetermine

the execution cluster

22

AS-ET-ORCaps Team

Irisa

4-cluster WSRS architecture: allocating instructions to clusters

S0

S0 C0

S1

S1C1

S2

C2

S3

S3C3S2

Op:R6,R7 R5 S1,S2 S0

First op determines top or down bicluster

Second op determines left or right bicluster

23

AS-ET-ORCaps Team

Irisa

4-cluster WSRS architecture :allocating instructions to clusters (2)

01

01

01

kji

j j 2 j

j i 2 k

i i 2 i

S S ,S :I

Op:R6,R7 R5 S1,S2 S0

Computation of the two bits are independent :-)

24

AS-ET-ORCaps Team

Irisa

Each individual physical register:4 identical copies of (2-read, 3-write) registers8x smaller than conventional monolithic approach12.8x smaller than conventional distributed approach

4-cluster 8-way WSRS architecture :the register file

WSRS512 registers

6.25W, 3 cycles

Conventional256 registers

(16W, 5 cycles) or (14.5W, 4 cycles)

25

AS-ET-ORCaps Team

Irisa

4-cluster 8-way WSRS architecture :the wake-up logic

The wake-up logic monitors all possible sources for each operand FUs from only two clusters are possible sources only 6 possible sources !

8-way WSRS architecture, wake-up logic entry complexity

=4-way issue

wake-up logic entry complexity

26

AS-ET-ORCaps Team

Irisa

4-cluster 8-way WSRS architecture :bypass network

Possible sources for each operand FUs from only two clusters are possible sources

Bypass point(pipeline length) x (possible FU sources) + register file

8-way dist.4 cycles

49 pos. op.

WSRS3 cycles

19 pos. op.

8-way mon.5 cycles

61 pos. op.

27

AS-ET-ORCaps Team

Irisa

Local fast-forwarding inside a single cluster2 out of 4 consumers are reached on the next cycle

Partial fast-forwarding inside a pair of adjacent clusters:3 out of 4 consumers are reached on the next cycle !

Complete fast-forwarding:consumer is close: may be possible to implement!

4-cluster WSRS architecture :fast-forwarding

28

AS-ET-ORCaps Team

Irisa

4-cluster WSRS architecture:Nothing is entirely free !

Strong constraint on allocation of instructions to clusters: The cluster executing a dyadic instruction depends on the

position of its operands in the register subsets.

Degrees of freedom: Monadic instructions can be executed on two clusters One out of two commutative dyadic instructions can be

executed on two clusters Design clusters able to execute instructions in two forms ?

• A-B and -B + A

29

AS-ET-ORCaps Team

Irisa

Using monadic instructions for load balancing

S0S

0 C0

S1

S1C1

S2

C2

S3S

3C3S2

S0 or S1

30

AS-ET-ORCaps Team

Irisa

Commutativity for load balancing

S0S

0 C0

S1

S1C1

S2

C2

S3S

3C3S2

S0 op S2

31

AS-ET-ORCaps Team

Irisa

4-cluster WSRS architecture :nothing comes from free (2)

Extra free lists and associated logic

Extra pipeline stage(s): Instructions must be allocated to clusters before the last

step in register renaming: + 3 cycles But shorter register access time : - 2 cycles

32

AS-ET-ORCaps Team

Irisa

Performance issues on 4-way WSRS architectures

Workload may be unbalanced among the clusters: Use of the degrees of freedom

• monadic instructions • « commutative » clusters

Higher probability of local consumption of a register

Naive allocation policies on WSRS competes favorably with naive policies on conventional architecture

33

AS-ET-ORCaps Team

Irisa

Summary

Register Write Specialization limiting the number of write ports on each physical register leads to naturally use distributed register file mastering power consumption, silicon area and access time

But

Some extra complexity in register renaming

34

AS-ET-ORCaps Team

Irisa

Summary (2)

Register Write Specialization + Register Read Specialization Further limits the number of ports on each physical register mastering power consumption, silicon area and access time

side effects: • mastering wake-up logic and bypass network complexity

But constraints instruction allocation to clusters

35

AS-ET-ORCaps Team

Irisa

Future works

Intelligent instruction allocation policies

Exploration of other possible interconnections

Use of heterogeneous clusters

SMT mode

Documents

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,