35
1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec, Olivier Rochecouste IRISA/ INRIA

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

Embed Size (px)

Citation preview

Page 1: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

1

Register Write Specialization

Register Read Specialization A path to complexity effective wide-issue superscalar processors

André Seznec, Eric Toullec, Olivier Rochecouste

IRISA/ INRIA

Page 2: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

2

AS-ET-ORCaps Team

Irisa

Why designing wide issue superscalar processors

SMT Superscalar Processors !

Page 3: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

3

AS-ET-ORCaps Team

Irisa

Doubling the issue width

Functional Units Silicon area: 2x

Power consumption: 2x

Same latency

Register file: Silicon area: > 8x Power consumption: > 4x access time: 1.5x

Wake-up logic entries: monitors twice as many

inputs area, consumption,

response time Bypass network:

wider multiplexors >2x longer communications

Page 4: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

4

AS-ET-ORCaps Team

Irisa

An unwritten rule applied on all superscalar processor designs

For general purpose registers:

Any physical register can be the source or the result of any instruction executed

on any functional unit

Page 5: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

5

AS-ET-ORCaps Team

Irisa

The register file issue

Page 6: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

6

AS-ET-ORCaps Team

Irisa

Silicon area for the physical register file

Page 7: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

7

AS-ET-ORCaps Team

Irisa

Conventional clustered design

C1C0 C2 C3

Register File

Page 8: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

8

AS-ET-ORCaps Team

Irisa

Distributed register file

C0 C1 C3C2

Local register file: shorter read access time but larger silicon area

Page 9: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

9

AS-ET-ORCaps Team

Irisa

8-way distributed register file 4 identical copies14.5 W (x 4.5)4 cycles (+1)256 x 1792 w2 x W (x11)

8-way monolithic register file16 W (x 5)5 cycles (+2)256 x 1120 w2 x W (x 8)

4-way distributed register file2 identical copies3.1W 3 cycles 128 x 320w2 x W

8-way against 4-way100nm, 5 Ghz

Page 10: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

10

AS-ET-ORCaps Team

Irisa

Let us reduce the number of ports

on each individual register

Page 11: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

11

AS-ET-ORCaps Team

Irisa

Register Write Specialization

C1C0 C2 C3

S0 S1 S2 S3

Page 12: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

12

AS-ET-ORCaps Team

Irisa

Distributed Register File and Register Write Specialization

C0 C1 C3C2

Page 13: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

13

AS-ET-ORCaps Team

Irisa

Register Write Specialization

Each cluster writes only a subset of the registers

Less write ports on every individual physical register

But allocation to clusters must precede register renaming

4-cluster 8-way distributed register file 512 entries

320 x w2 per register bit

3 cycles access time

8.5 W

Page 14: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

14

AS-ET-ORCaps Team

Irisa

Register Write Specialization and Register Renaming

1:Op R6, R7 -> R52:Op R2, R5 -> R63:Op R6, R3 -> R44:Op R4, R6 -> R2

4 free odd reg4 free even reg

4-bit subset target vector

1:Op L6, L7 -> res12:Op L2, res1 -> res23:Op res2, L3 -> res34:Op res3,res2 -> res4

4 new free registers

+Old map table

1:Op P6, P7 -> RES12:Op P2, RES1 -> RES23:Op RES2, L3 -> RES34:Op RES3,RES2 -> RES4

New map table

Page 15: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

15

AS-ET-ORCaps Team

Irisa

Register Write Specialization and Register Renaming (2)

Consumes a lot of registers : need for recycling

1:build two lists of registers to be recycled2: pack both lists 3: concatenate the two lists4: append to the free list

Page 16: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

16

AS-ET-ORCaps Team

Irisa

Register Write Specialization and Register Renaming (3)

An alternative: Compute the number of registers in each register subset Pick the right number of registers from each of the free lists No need for recycling registers

Think about round-robin distribution !

Page 17: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

17

AS-ET-ORCaps Team

Irisa

Performance issues

Register Write Specialization only: round robin allocation:

• no extra stage for register renaming • shorter register acces time

Overall shorter pipeline:

slightly better performances

Page 18: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

18

AS-ET-ORCaps Team

Irisa

Register Read Specialization

C1C0 C2 C3

S0 S1

Page 19: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

19

AS-ET-ORCaps Team

Irisa

Register Read Specialization

Limits number of read ports on each individual register

Puts strong constraints on allocation of instructions to clusters

Caution:

Personal opinion: don’t use it alone !

Interconnection topology must ensurethat every instruction is executable

Page 20: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

20

AS-ET-ORCaps Team

Irisa

WSRS architectures

Combining Register Read Specialization and

Register Write Specialization

Page 21: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

21

AS-ET-ORCaps Team

Irisa

4-cluster WSRS architecture

S0

S0 C0

S1

S1C1

S2

C2

S3

S3C3S2

inst. operands positionsdetermine

the execution cluster

Page 22: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

22

AS-ET-ORCaps Team

Irisa

4-cluster WSRS architecture: allocating instructions to clusters

S0

S0 C0

S1

S1C1

S2

C2

S3

S3C3S2

Op:R6,R7 R5 S1,S2 S0

First op determines top or down bicluster

Second op determines left or right bicluster

Page 23: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

23

AS-ET-ORCaps Team

Irisa

4-cluster WSRS architecture :allocating instructions to clusters (2)

01

01

01

kji

j j 2 j

j i 2 k

i i 2 i

S S ,S :I

Op:R6,R7 R5 S1,S2 S0

Computation of the two bits are independent :-)

Page 24: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

24

AS-ET-ORCaps Team

Irisa

Each individual physical register:4 identical copies of (2-read, 3-write) registers8x smaller than conventional monolithic approach12.8x smaller than conventional distributed approach

4-cluster 8-way WSRS architecture :the register file

WSRS512 registers

6.25W, 3 cycles

Conventional256 registers

(16W, 5 cycles) or (14.5W, 4 cycles)

Page 25: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

25

AS-ET-ORCaps Team

Irisa

4-cluster 8-way WSRS architecture :the wake-up logic

The wake-up logic monitors all possible sources for each operand FUs from only two clusters are possible sources only 6 possible sources !

8-way WSRS architecture, wake-up logic entry complexity

=4-way issue

wake-up logic entry complexity

Page 26: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

26

AS-ET-ORCaps Team

Irisa

4-cluster 8-way WSRS architecture :bypass network

Possible sources for each operand FUs from only two clusters are possible sources

Bypass point(pipeline length) x (possible FU sources) + register file

8-way dist.4 cycles

49 pos. op.

WSRS3 cycles

19 pos. op.

8-way mon.5 cycles

61 pos. op.

Page 27: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

27

AS-ET-ORCaps Team

Irisa

Local fast-forwarding inside a single cluster2 out of 4 consumers are reached on the next cycle

Partial fast-forwarding inside a pair of adjacent clusters:3 out of 4 consumers are reached on the next cycle !

Complete fast-forwarding:consumer is close: may be possible to implement!

4-cluster WSRS architecture :fast-forwarding

Page 28: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

28

AS-ET-ORCaps Team

Irisa

4-cluster WSRS architecture:Nothing is entirely free !

Strong constraint on allocation of instructions to clusters: The cluster executing a dyadic instruction depends on the

position of its operands in the register subsets.

Degrees of freedom: Monadic instructions can be executed on two clusters One out of two commutative dyadic instructions can be

executed on two clusters Design clusters able to execute instructions in two forms ?

• A-B and -B + A

Page 29: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

29

AS-ET-ORCaps Team

Irisa

Using monadic instructions for load balancing

S0S

0 C0

S1

S1C1

S2

C2

S3S

3C3S2

S0 or S1

Page 30: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

30

AS-ET-ORCaps Team

Irisa

Commutativity for load balancing

S0S

0 C0

S1

S1C1

S2

C2

S3S

3C3S2

S0 op S2

Page 31: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

31

AS-ET-ORCaps Team

Irisa

4-cluster WSRS architecture :nothing comes from free (2)

Extra free lists and associated logic

Extra pipeline stage(s): Instructions must be allocated to clusters before the last

step in register renaming: + 3 cycles But shorter register access time : - 2 cycles

Page 32: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

32

AS-ET-ORCaps Team

Irisa

Performance issues on 4-way WSRS architectures

Workload may be unbalanced among the clusters: Use of the degrees of freedom

• monadic instructions • « commutative » clusters

Higher probability of local consumption of a register

Naive allocation policies on WSRS competes favorably with naive policies on conventional architecture

Page 33: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

33

AS-ET-ORCaps Team

Irisa

Summary

Register Write Specialization limiting the number of write ports on each physical register leads to naturally use distributed register file mastering power consumption, silicon area and access time

But

Some extra complexity in register renaming

Page 34: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

34

AS-ET-ORCaps Team

Irisa

Summary (2)

Register Write Specialization + Register Read Specialization Further limits the number of ports on each physical register mastering power consumption, silicon area and access time

side effects: • mastering wake-up logic and bypass network complexity

But constraints instruction allocation to clusters

Page 35: 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

35

AS-ET-ORCaps Team

Irisa

Future works

Intelligent instruction allocation policies

Exploration of other possible interconnections

Use of heterogeneous clusters

SMT mode