77
1 Hardware acceleration of sequential loops ALF team INRIA Rennes – Bretagne Atlantique December 12, 2011 Pierre Michaud

Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

1

Hardware acceleration of sequential loops

ALF team INRIA Rennes – Bretagne Atlantique

December 12, 2011

Pierre Michaud

Page 2: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Outline

1.  Sequential accelerators

2.  Hardware acceleration of sequential loops

2

Page 3: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Monster core

•  Let’s assume we are given a large silicon area and power budget and we seek to maximize single-thread performance

•  What silicon area and power would we need to double the performance of a 4-wide superscalar core ? –  4 times the area and power of a 4-wide superscalar ? –  8 times ?

•  What about a monster core 128 times bigger than a 4-wide superscalar ?

3

Page 4: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

The multi-core era

4

Small single-thread performance improvement

2005 : superscalar 2009 : 4 cores

Page 5: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

The multi-core era

5

Small single-thread performance improvement

2013 : 16 cores

Page 6: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

The many-core era

6

Small single-thread performance improvement

2017 : 64 cores ?

Page 7: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

The many-core era

7

Small single-thread performance improvement

2021 : 256 cores ?

Page 8: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

What do we choose for 2021 ?

8

Do we prefer double parallel performance (maybe) or double sequential performance ?

monster core

Page 9: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

How do we build a monster core ?

•  8-wide superscalar ? –  Probably the first step, but that does not make a monster core

•  16-wide superscalar ? –  Never implemented so far –  can we increase the IPC without impacting the clock cycle ?

•  Huge NUCA cache ?

•  What else ?

9

Page 10: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Sequential accelerator

•  In the remaining, “core” = “superscalar core”

•  Sequential accelerator may consist of several physical cores

•  Sequential accelerator is specialized for sequential execution

10

Monster core = sequential accelerator

better term

Page 11: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Proposition A

•  Build a very aggressive superscalar core –  8-wide, high clock frequency

•  Replicate that core (e.g., 8 times)

•  Use a single core at a time, power-gate the unused ones

•  Migrate the execution periodically to maintain temperature at a safe level –  P. Michaud, Y. Sazeides, A. Seznec, “Proposition for a sequential

accelerator in future general-purpose many-core processors and the problem of migration-induced cache misses”, Computing Frontiers 2010

11

Page 12: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Proposition B

•  Hardware acceleration of sequential loops

•  The focus of this presentation

–  P. Michaud, “Hardware acceleration of sequential loops”, INRIA technical report, 2011.

–  http://hal.inria.fr/hal-00641350

12

Page 13: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Hardware acceleration of sequential loops

13

Page 14: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Dynamic loop

•  dynamic loop = periodic sequence of dynamic instructions

•  Dynamic loop ≠ program loop

–  A program loop does not always generate a dynamic loop •  control flow behavior inside the loop body may be more or less random •  e.g., “if” statements inside the loop body

–  A dynamic loop may result from a recursive function

14

Page 15: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Idea

•  Why not build a hardware accelerator for dynamic loops ?

–  Loop detector = hardware mechanism for detecting dynamic loops

–  When dynamic loop encountered, migrate the execution to the accelerator

–  When loop exit condition (e.g., branch direction changes), migrate the execution back to the superscalar core

15

Page 16: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Let’s kill the suspense

16

percentage of performance increase over baseline

Page 17: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Definitions

•  loop length = sequence length (in instructions)

•  loop body size B = smallest period

•  loop body = first B instructions

17

ABCDE ABCDE ABCDE ABCDE

body

length = 20

Page 18: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Some questions

•  What fraction of time programs spend in dynamic loops ?

•  Are dynamic loops long enough to amortize the penalty of migrating the execution to and from the accelerator ?

•  What is the typical loop body size ?

•  ➔ First let’s define the loop detector

18

Page 19: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Loop detector

•  Loop detector = loop monitor + loop table

•  Loop monitor –  Observe instructions as they retire from the processor reorder buffer –  Detect long dynamic loops

•  when loop length > 900, count subsequent instructions as loop instructions, until periodic behavior ends

•  Loop table –  Records information about loops already encountered

19

Page 20: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Remarks

•  A loop body may contain several instances of the same static instruction –  Tiny loop unrolled, same function called multiple times, …

•  There is a backward jump somewhere in the loop body

•  A loop body may contain several backward jumps

20

Page 21: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Loop monitor

•  Associate a monitor to each static backward jump

•  For each retired instruction –  Update a signature = hash of instructions PCs –  Increment the body size –  If body size > MaxBodySize, close the monitor

•  Each time we encounter that same backward jump –  Compare the signature with the body signature –  If signatures match, add body size to the loop length –  Reset signature and body size

•  Monitor several backward jumps simultaneously –  4 monitors are enough in practice ; that which first reaches 900 “wins”

21

Page 22: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Loop table

•  If loop length was greater than 900, record entry for that loop in the loop table

•  Loop is identified by its body signature

•  Next time we encounter that loop, immediately start counting instructions as loop instructions –  don’t wait till 900 instructions have been retired

22

Page 23: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Simulations

•  Trace-driven, based on Pin (x86)

•  Nehalem-like microarchitecture (roughly)

•  Hardware cache prefetchers –  L1,L2,L3

•  Benchmarks : SPEC CPU2006 –  1 trace per benchmark –  1 trace = 40 samples of 50 millions instructions (total = 2 billions) –  Samples regularly spaced throughout whole benchmark execution

23

Page 24: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

CINT2006 loop behavior

24

percentage dynamic instructions

percentage clock cycles

maximum body size

benchmarks

18000

average loop length

5000

5000 8000 5000

Page 25: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

CFP2006 loop behavior

25

6M

4000

2000

6000

3000 3000

5000

4000

3000

Page 26: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Summary

•  Long dynamic loops represent a significant part of the execution for half of the CPU2006 benchmarks –  Up to 90%

•  Typical loop length is several thousands instructions

•  Body size is diverse, from a few tens up to a few hundreds instructions

•  ➔ Worth trying to accelerate loops –  But all loops, with a small or large body

26

Page 27: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

How to accelerate loops ?

•  Modify a superscalar core ? –  instruction fetch (done), register renaming, what else ?

•  Loop accelerator –  Sits beside the superscalar core –  Specialized for executing periodic sequences of instructions

•  Try to remove repetitive work from the loop execution ➔ do that work before executing the loop –  E.g., register dependency analysis –  When possible, store information in the loop table for future

occurrences of that loop

27

Page 28: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Global view

28

superscalar core

DL1 arch. regs Loop

Detector Loop

Builder Loop

Accelerator

L2 L3

retired micro-ops

Page 29: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Transitions

29

Normal mode : execute on core

Build mode : •  for X loop iterations •  still execute on core •  prepare loop

Acceleration mode : execute on loop accelerator

long loop detected

early loop exit

loop exit ➔ update arch. regs

•  flush reorder buffer •  read loop input regs •  migrate the execution

Page 30: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Proposed loop accelerator

•  Get rid of main superscalar bottlenecks –  Register renaming, dynamic scheduling, bypass network,…

•  Scheduling is done during build mode –  Each static micro-op is mapped to an Execution Unit (EU) –  All dynamic instances of that micro-op will be executed by that EU,

in program order

•  Static micro-ops communicate through FIFO queues –  No registers –  Data-flow execution

30

Page 31: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Cyclic register dependency graph

31

example loop ➔ body = 5 micro-ops

A B C D E A B C D E A B C

A

B

C

D

E

cycles in the graph = long dependency chains

latency-tolerant dependency

Page 32: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Constraint edges

32

A

B

C

D

E

dynamic instances of a static micro-op execute in program order

this does not increase the length of the longest chain

Page 33: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Proposed loop accelerator (cont’d)

33

I-node F-node

S-node

L-node

I-node I-node F-node

I-node •  8 nodes •  4 EUs per node •  4-lane pipelined ring

Page 34: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Proposed loop accelerator (cont’d)

34

I-node F-node

S-node

L-node

I-node I-node F-node

I-node •  8 nodes •  4 EUs per node •  4-lane pipelined ring

1-cycle micro-ops (ALU, branch, address generation, ...)

Page 35: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Proposed loop accelerator (cont’d)

35

I-node F-node

S-node

L-node

I-node I-node F-node

I-node •  8 nodes •  4 EUs per node •  4-lane pipelined ring

FP add, FP mul, int mul,…

Page 36: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Proposed loop accelerator (cont’d)

36

I-node F-node

S-node

L-node

I-node I-node F-node

I-node •  8 nodes •  4 EUs per node •  4-lane pipelined ring

loads

Page 37: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Proposed loop accelerator (cont’d)

37

I-node F-node

S-node

L-node

I-node I-node F-node

I-node •  8 nodes •  4 EUs per node •  4-lane pipelined ring

stores

Page 38: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Remarks

•  Implement only a subset of the ISA –  Some operations are rare in the loops we have found –  E.g., integer division

•  Build mode determines if the loop cannot be executed by the accelerator

•  Floating-point divisions are not rare –  One specific I-node can be configured during build mode to execute

FP divs only

38

Page 39: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

I-node / F-node

39

EU CEC

operand Qs

OQ LEQ

EU CEC

operand Qs

OQ LEQ

EU CEC

operand Qs

OQ LEQ

EU CEC

operand Qs

OQ LEQ

DISTRIBUTOR

IQ IQ IQ IQ lane 0 lane 1 lane 2 lane 3

lane 0 lane 1 lane 2 lane 3

Page 40: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Lane 0

40

EU CEC

operand Qs

OQ LEQ

IQ Up to 12 static micro-ops may be mapped on that EU

The Cyclic Execution Controller schedules them in program order

DISTRIBUTOR

Page 41: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Lane 0

41

EU CEC

operand Qs

OQ LEQ

IQ

Dependent micro-ops can be mapped on the same EU

DISTRIBUTOR

Page 42: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Input Queue buffers copies of values traveling on the lane

The IQ must never get full ➔ early loop exit when this happens

Lane 0

42

EU CEC

operand Qs

OQ LEQ

IQ

DISTRIBUTOR

Page 43: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Values coming from other nodes are stored in one of 12 Operand Queues

A given queue holds up to 32 values from a single static micro-op

Lane 0

43

EU CEC

operand Qs

OQ LEQ

IQ

DISTRIBUTOR

Page 44: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Values produced by the EU are sent to other nodes through the Output Queue

Lane 0

44

EU CEC

operand Qs

OQ LEQ

IQ

DISTRIBUTOR

Page 45: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Thanks to the distributor, copies of values produced on a lane can be sent to EUs on other lanes

Lane 0

45

EU CEC

operand Qs

OQ LEQ

IQ

DISTRIBUTOR

Page 46: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Loop Exit Queue contains last 128 values produced by the EU

➔ for updating registers after loop exit

Lane 0

46

EU CEC

operand Qs

OQ LEQ

IQ

DISTRIBUTOR

Page 47: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

L-node, lane 0

47

LIB CEC

operand Qs

LOQ LEQ DL1 cache

Port 0

Port 1

Port 2

Port 3

MSHRs

Page 48: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

L-node, lane 0

48

LIB CEC

operand Qs

LOQ LEQ DL1 cache

Port 0

Port 1

Port 2

Port 3

MSHRs

Load address comes from an I-node

Page 49: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

L-node, lane 0

49

LIB CEC

operand Qs

LOQ LEQ DL1 cache

Port 0

Port 1

Port 2

Port 3

MSHRs

64-entry Load Issue Buffer ➔ select one load per cycle

Page 50: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

L-node, lane 0

50

LIB CEC

operand Qs

LOQ LEQ DL1 cache

Port 0

Port 1

Port 2

Port 3

MSHRs

128-entry Load Output Queue

Send data to lane 0 in program order

Page 51: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

S-node, lane 0

51

EU CEC

operand Qs

LSQ

64-entry Loop Store Queue

store address store data

Page 52: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

S-node, store validator

52

LSQ LSQ LSQ LSQ

Store Validator

DL1 cache

lane 0

lane 1

lane 2

lane 3

Validates stores in program order

A validated store can write in the DL1

Global synchronizer updates an executed iteration count that tells which stores can be validated

Page 53: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Global synchronization

•  At a given time, different EUs may be in different loop iterations

•  Make sure that no store is validated before older instructions have been executed

•  When a loop exit occurs, make sure that the values needed to update the architectural registers have not been pushed out of the Loop Exit Queue –  Prevent an EU to go too far ahead of the others

53

Page 54: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Global synchronizer

54

I-node F-node

S-node

L-node

I-node I-node F-node

I-node

synchronizer

Page 55: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Global synchronization (cont’d)

•  Each EU has an iteration counter (IC)

•  EU freezes when IC = ICMAX –  ICMAX not the same for all EUs –  For the S-node, ICMAX = ICMIN –  For other nodes, ICMAX ≥ ICMIN

•  When the IC reaches ICMIN, the EU sends a signal to the synchronizer

•  After the synchronizer has received a signal from each EU, it sends back an unfreeze signal to all EUs

•  Upon receiving the unfreeze signal, each EU subtracts ICMIN from its IC, and the store validator adds ICMIN to the executed iteration count

55

Page 56: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Loop exit

•  Typical loop exit a branch changes its direction

•  An EU that detects a loop exit condition sends a signal to the synchronizer –  The signal contains the loop exit point = the IC value and the micro-op rank

inside the body

•  The synchronizer forwards the loop exit point to all the EUs and to the store validator –  EUs freeze when they reach the loop exit point –  Store validation stops at the loop exit point

56

Page 57: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Memory dependencies

•  Instruction window can be very large

–  example : body size = 100, ICMAX=32 3200 instructions

•  Exploit loop properties

–  spatial locality is often very good

–  constant-stride accesses are frequent

–  store-load dependency generally repeats on each iteration

57

Page 58: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Memory independence checker

•  Accessed by loads and stores

•  At store validation, tells if the store might conflict with an already executed load –  exploits spatial locality and constant-stride accesses to keep structures

reasonably sized –  see tech report for details

•  If answer is no, everything is ok

•  If answer is yes, may be a memory order violation trigger an early loop exit –  the store is a safe loop exit point

58

Page 59: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Store-load bypassing

59

store

addr S

addr L

load

X Y

Z

transform the loop body during build mode

predicted dependency

store

addr S

addr L

X Y

Z

mov

mov

mov

check

Page 60: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Memory dependency (cont’d)

•  Assumes distance between store and load is less than a loop body –  sufficient in most cases –  similar method can be used for distances between 1 and 2 bodies

•  beneficial only for 456.hmmer

•  ‘check’ micro-op executed by I-node, triggers early loop exit if fails

•  Must verify that stores between the dependent store-load pair do not conflict

•  Bypassed-Store Table accessed by each store at validation –  one entry per bypassed store –  if conflict detected early loop exit

60

Page 61: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Loop body reduction

•  Micro-ops with no input dependencies inside the loop body produce the same result on each iteration

•  may remove these redundant micro-ops during build mode

•  Recursive definition : micro-ops depending only on redundant micro-ops are themselves redundant –  they produce constant results after a fixed number of iterations

•  Keep redundant stores (memory consistency)

•  Can remove redundant loads, but record their address in a Removed-Loads Table check conflicts at store validation

•  Typically, between 10% and 20% of micro-ops can be removed

61

Page 62: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Static mapping

•  On which EU do we put a static micro-op ?

•  Very important for performance

62

Page 63: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Cyclic register dependency graph

63

A

B

C

D

E

If we put micro-ops A and C on the same EU, we create an artificial CRDG cycle of size 3

Page 64: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Artificial CRDG cycle

64

I-node F-node

S-node

L-node

I-node I-node F-node

I-node A

B

C

CRDG cycle may increases loop execution time very much

When possible, try to avoid creating artificial CRDG cycles

Page 65: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

EU sharing

65

I-node F-node

S-node

L-node

I-node I-node F-node

I-node

Iteration rate is limited by the most loaded EU

A B C

D E

Page 66: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

EU sharing

66

I-node F-node

S-node

L-node

I-node I-node F-node

I-node

Iteration rate is limited by the most loaded EU

A B

C

D E Try to balance number of micro-ops per EU

Page 67: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Lane segment sharing

67

I-node F-node

S-node

L-node

I-node I-node F-node

I-node

Iteration rate is limited by the most loaded lane segment

A B

C D E

Page 68: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Lane segment sharing

68

I-node F-node

S-node

L-node

I-node I-node F-node

I-node

Iteration rate is limited by the most loaded lane segment

A B

C D

E

Try to balance number of dependencies per lane segment

Page 69: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Mapping heuristic

•  Try to balance EU sharing –  important for loops with small body

•  Try to detect natural and artificial CRDG cycles –  artificial CRDG cycles can be avoided if detected –  natural cycles try to put micro-ops on same EU

•  Try to put dependent micro-ops on the same EU –  that dependency does not use any lane segment

•  Try to minimize number of lane segments used

69

Page 70: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Configuration simulated

•  Loop detector MaxBodySize = 128 •  Build mode 5 iterations •  Transition penalty

–  loop table hit 100 cycles –  loop table miss 10000 cycles

•  Loop accelerator –  4 lanes, 8 nodes –  4 loads / cycle –  2 stores validated / cycle –  up to 12 static micro-ops per EU –  12 operand queues per EU (queue depth = 32) –  load issue buffer 64 loads –  unfreeze signal latency = 4 cycles

70

Page 71: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Performance

71

percentage of performance increase over baseline

Page 72: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Local loop speedup

72

1.4

2.2 1.8

1.6

2.7

2.1

1.8

1.3 1.4 1.3

Page 73: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Impact of disabling features

73

percentage of performance loss when disabling a single feature

Page 74: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Some future research

•  How does performance scale if we increase the number of nodes and/or the number of lanes ?

•  Can we over-clock the loop accelerator ?

•  Is a pipelined ring the best choice ?

•  Is it a good choice not to provide any direct data path between EUs on same node ?

•  What about applications other than the SPECs ?

•  How could a compiler exploit a hardware loop accelerator ?

74

Page 75: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Poor man’s loop acceleration ?

•  Some ideas may be useful even if you don’t buy the loop accelerator

•  loop detector

•  loop body reduction

•  store-load bypassing

•  Can we improve dynamic instruction scheduling in a superscalar core if we know that we are in a dynamic loop ?

75

you pay the overhead only once for several thousands instructions

Page 76: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Conclusions

•  Long dynamic loops are frequent –  > 30 % of all the instructions executed by the SPEC CPU2006 suite

•  Try doing the repetitive work only once, before executing the loop

•  The proposed loop accelerator avoids some important superscalar bottlenecks –  no registers, no register renaming –  no dynamic scheduling –  huge instruction window latency tolerance

•  This is preliminary research, many questions still pending

76

Page 77: Hardware acceleration of sequential loops · Monster core • Let’s assume we are given a large silicon area and power ... – E.g., register dependency analysis – When possible,

Questions ?

77