115
UW-Madison Computer Sciences Multifacet Group © 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Embed Size (px)

Citation preview

Page 1: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

UW-Madison Computer Sciences Multifacet Group © 2010

Scalable Cores in Chip Multiprocessors

Thesis Defense2 November 2010

Dan Gibson

Page 2: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 2

Executive Summary (1/2)

• “Walls & Laws” suggest future CMPs will need Scalable Cores– Scale Up for Performance (e.g., one thread)

– Scale Down for per-core energy conservation (e.g., many threads)

• Area 1: How to build efficient scalable cores.– Forwardflow, one scalable core– Overprovision rather than borrow

Page 3: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 3

Executive Summary (2/2)

• Area 2: How to use scalable cores:– Scale at fine granularity:

• Discover most-efficient configuration

– Scale for multi-threaded workloads• Scale up for sequential bottlenecks, improve

performance• Scale down for unimportant executions,

improve efficiency

– Using DVFS as a scalable core proxy

Page 4: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 4

Document Outline

1. Introduction2. Extended Motivation

1. Scalable Cores2. Background3. Related Work

3. Methods4. Serialized Successor Representation5. Forwardflow6. Scalable Cores for CMPs

1. Scalable Forwardflow2. Overprovisioning vs. Borrowing3. Power-Awareness4. Single-Thread Scaling5. Multi-Thread Scaling6. DVFS as a Proxy for Scalable Cores

7. Conclusions/Future Work/ReflectionsA/B. Supplements

Of Course

Mostly Old Material: Recap

Mostly New Material: Talk Focus

TALK Outline…

If there’s time and interest

Page 5: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Hello, Software. I am a single x86 processor.

D. Gibson Thesis Defense - 5

‘80s - `00s: Single-Core HeydayCore and Chip Microarchitecture Changed Enormously

386, 1985, 20MHz

486, 1989, 50MHz

P6, 1995 166MHzPIV, 2004, 3000MHz

10

100

1000

10000

1989 1994 1999 2004

Year

Fre

qu

ency

(M

Hz)

Intel

IBM

AMD

Clock Frequency Increased Dramatically

2.

1.

Hello, Software. I am still a single x86 processor.

Page 6: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Hitting the Power Wall

19711972

19741976

19781982

19851989

19931996

19981999

20002005

20072009

0.15

1.5

15

150

Ther

mal

Des

ign

Pow

er (W

)Core i7

4004

8008

8080

8085

8086 286386

486Pentium

Pentium MMX

Pentium II

Pentium III

Pentium 4Pentium D

Core 2

D. Gibson Thesis Defense - 6

• Kneejerk Reactions:• Reduce Clock Frequency

(e.g., 3.0 Ghz to 2.4-ish GHz)

• De-Emphasize Pipeline Depth (e.g., Pentium M)

• What about Performance?

Resource borrowed from Yasuko’s WiDGET ISCA 2010 TalkOne example data point represents a range of actual products.

Page 7: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Chip Multiprocessors (CMPs)

1. Can’t clock (much) faster…

2. Hard to make uArch faster…

Use Die Area for More Cores!

D. Gibson Thesis Defense - 7

Hello, Software. I am TWO x86 processors.(And my descendants will have more…)

• “Fundamental Turn Toward Concurrency” [Sutter2005]

• Software must now change to see continued performance gains.

This Won’t Be Easy.

Page 8: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

In 1965, Gordon Moore sketched out his prediction of the pace of silicon technology. Decades later, Moore’s Law remains true, driven largely by Intel’s unparalleled silicon expertise.

Copyright © 2005 Intel Corporation.

D. Gibson Thesis Defense - 8

• Cost per Device Falls Predictably– Density rises (Devices/mm2)

– Device size grows smaller

Rock, 65nm [JSSC2009] Rock16, 16nm [ITRS2007]

Moore’s Law in the Multicore Era

(If you want 1024 threads)

Or “Fell”

Page 9: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 9

Amdahl’s Law

(1 - f ) + f

N

0

0.2

0.4

0.6

0.8

1

1.2

0.000.10

0.250.50

0.750.85

0.900.95

0.99

Parallel Fraction f

No

rmal

ized

Ru

nti

me

Parallel Runtime =

f = Parallel Fraction

N = Number of Cores

N = 8

Sequential: Not Good

Partially-Parallel: OK

Highly-Parallel: Very Good

Page 10: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 10

Utilization Wall (aka SAF)

• Simultaneously Active Fraction (SAF): Fraction of devices in a fixed-area design that can be active at the same time, while still remaining within a fixed power budget.

[Venkatesh2009]

0

0.2

0.4

0.6

0.8

1

90nm 65nm 45nm 32nm

Dyn

amic

SA

F

LP DevicesHP Devices

[Chakraborty2008]

Page 11: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

UTILIZATION

D. Gibson Thesis Defense - 11

Architects Boxed In: Walls and Laws

• PW: Cannot clock much faster.

• UW: Cannot use all devices.

• AL: Single threads need help• Not all code is

parallel.Scalable CMPs

POWER

AMDAHL

Page 12: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Scalable CMPs→Scalable Cores

• Scale UP for Performance– Use more resources for more performance– (i.e., 2 Strong Oxen)

• Scale DOWN to Conserve Energy– Exploit TLP with many small cores– (i.e., 1024 Chickens)

If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?

-Attributed to Seymour Cray

D. Gibson Thesis Defense - 12

Page 13: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Scalable Cores in CMPs

1. How to build a Scalable Core?– Should be efficient– Should offer a wide power/perf. Range

2. How to use Scalable Cores? – Optimize single-thread efficiency– Detect and ameliorate bottlenecks

D. Gibson Thesis Defense - 13

This Thesis:

Page 14: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 14

Area 1: Efficient Scalable Cores

Fear leads to Anger, Anger leads to Hate, Hate leads to Suffering

Naming Association

Broadcast Inefficiency

• Forwardflow Core Architecture– Raise Average I/MLP, Not Peak– Efficient SRAMs, no CAMs

• Serialized Successor Representation (SSR)– Use pointers instead of names

Basis for a Scalable

Core Design

Page 15: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 15

Area 2: Scalable Cores in CMPs

• How to scale cores:– Overprovision each core?– Borrow/merge cores?

• When to scale cores:– For one thread?– For many threads?

• How to continue:– DVFS as a proxy for a scalable core

Page 16: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 16

Outline

• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition

• Scalable Cores for CMPs– How to scale:

• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?

– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads

• Conclusions/Wrap-Up

Page 17: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 17

Forwardflow (FF): A Scalable Core

• Forwardflow Core =Frontend (L1-I, Decode, ARF) +

FE

L1-D

Distributed Execution Logic/Window (DQ) +

L1-D CacheScale Down: Use a Smaller Window

Scale Up: Use a Bigger Window

Page 18: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Window Scaling vs. Core Scaling

• FF: Only scales instruction window– Not width,– Not registers, – etc.

D. Gibson Thesis Defense - 18

• How does window scaling scale the core? – By affecting demand– Analogous to

Bernoulli’s Principle

FE DQ

2ddL VCfP ...=a

Power of Unscaled Components?

“Activity Factor”

Page 19: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

FF Dynamic Configuration Space

CONFIG. SOBRIQUET DESCRIPTION

F-32 “Fully scaled down”

32-entry instruction window, single issue (1/4 of a DQ bank group)

F-6464-entry instruction window, dual issue (1/2 of a DQ bank group)

F-128 “Nominal”128-entry instruction window, quad issue, (one full DQ bank group)

F-256256-entry instruction window, “quad” issue, (2 BGs)

F-512512-entry instruction window, “quad” issue, (4 BGs)

F-1024 “Fully scaled up”

1K-entry instruction window, “quad” issue, (8 BGs)

D. Gibson Thesis Defense - 19

Page 20: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 20

Configuration…Component Configuration

Mem. Cons. Mod.

Sequential Consistency

Coherence Prot.

MOESI Directory (single chip)

Store Issue Policy

Permissions Prefetch at X

Frequency 3.0 GHz

Technology 32nm

Window Size Varied by experiment

Disambig. NoSQ

Branch Prediction

TAGE + 16-entry RAS + 256-entry BTB

Frontend 7 Cyc. Pred-to-dispatch

Component Configuration

L1-I Caches 32KB 4-way 64b 4cycle 2 proc. ports

L1-D Caches 32KB 4-way 64b 4 cycle LTU 4 proc. ports, WI/WT,

included by L2

L2 Caches 1MB 8-way 64b 11cycle WB/WA, Private

L3 Cache 8MB 16-way 64b 24cycle, Shared

Main Memory

8GB, 2 DDR2-like controllers (64 GB/s peak BW), 300 cycle latency

Inter-proc network

2D Mesh 16B link

Page 21: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

FF Scalable Core Performance

D. Gibson Thesis Defense - 21

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Gmean h264ref libquantum

No

rmal

ized

Ru

nti

me

F-32

F-64

F-128

F-256

F-512

F-1024

3.2Mostly Compute-Bound: Not much scaling

Mostly Memory-Bound: Great scaling

Page 22: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

FF Scalable Core Power

D. Gibson Thesis Defense - 22

0

0.2

0.4

0.6

0.8

1

1.2

1.4

F-3

2

F-6

4

F-1

28

F-2

56

F-5

12

F-1

024

No

rmal

ized

Po

wer

FE

DQ/ALU

MEM

Static

WRT Nominal, F-128:

Scale Up Scale Down

8x Window 1/4 Window

-32% MEM power

+27% MEM power

-54% DQ/ALU power

+91% DQ/ALU power

+28% FE power

-39% FE power

Page 23: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 23

FF Recap• Forwardflow Scalable Core

FE

L1-D

Scale Down: Use a Smaller Window

Scale Up: Use a Bigger Window

FE

L1-D

FE

L1-D

More details on Forwardflow

Page 24: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 24

Outline

• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition

• Scalable Cores for CMPs– How to scale:

• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?

– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads

• Conclusions/Wrap-Up

Page 25: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 25

Overprovisioning vs. Borrowing

• Scaling Core Performance means Scaling Core Resources– From where can a scaled core acquire

more resources?

• Option 1: Overprovision All Cores– Every core can scale up fully using a core-

private resource pool• Option 2: Borrow From Other Cores

– Cores share resources with neighbors

Page 26: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

What Resources?

• Forwardflow:– Resources = DQ Bank Groups = – (i.e., window space, functional units)

• Simple Experiment:– Overprovision: Each Core has 8 BGs,

enough for F-1024. What is the area cost?– Borrow: Each Core has 4 BGs, enough to

scale to F-512, borrows neighbors’ BGs to reach F-1024. What is the performance cost?

D. Gibson Thesis Defense - 26

Page 27: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 27

Per-Core Overprovisioning

FE

L1-D

L2

L3Bank

17.9

mm

16.6mm

Overprovisioned CMP:298mm2

Scale Up: Activate More Resources

8.9

6m

m

4.15mmOverprovisioned Tile:

37.2mm2

32nm

Page 28: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 28

Resource Borrowing

FE

L1-D

L2

L3Bank

FE

L1-D

L2

L3Bank

Scale Up: Borrow Resources from Neighbor

8.3

1m

m

4.15mmBorrowed Tile:

34.5mm2

16.6

mm

16.6mm

Borrowing CMP:276mm2

32nm

Page 29: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 29

Area CostPer-Core

12.3mm2 15.6mm2

+27%

Per-Tile

+8%

37.2mm234.5mm2

Per-CMP

+7%

276mm2 298mm2

32nm

Page 30: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Performance Cost of Borrowing

• Borrowing Slower?– Maybe Not:

Comparable Wire Delay (in this case)

– Maybe: Crosses Physical Core Boundary• Global vs. Local

Wiring?• Cross a clock

domain?

D. Gibson Thesis Defense - 30

• Simple Experiment– 2-cycle lag crossing

core boundary– Slows inter-BG

communication– Slows dispatch

32nm

Page 31: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

A Loose Loop

D. Gibson Thesis Defense - 31

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

No

rmal

ized

Ru

nti

me

F-1024O F-512

• 2 cycles lag:

F-1024B

– 9% Runtime Reduction from Borrowing w.r.t. Overprovisioning

– Essentially No Performance Improvement From Scaling Up!

Page 32: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Overprovisioning vs. Borrowing

• Overprovisioning CAN be cheap– FF: 7% CMP area– CF: 12.5% area from borrowing [Ipek2007]

• If Borrowing introduces even small delays, it may no longer be worthwhile to scale at all.– This effect is worse if borrowing occurs at

smaller design points.

D. Gibson Thesis Defense - 32

Page 33: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 33

Outline

• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition

• Scalable Cores for CMPs– How to scale:

• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?

– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads

• Conclusions/Wrap-Up

Page 34: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 34

What to do for f=0.00

• What is important?– Performance: Just

Scale Up (done)

– Efficiency: Pick the most efficient configuration? • How to find the right

configuration?• Can we do better?

0

0.2

0.4

0.6

0.8

1

1.2

0.000.10

0.250.50

0.750.85

0.900.95

0.99Parallel Fraction f

No

rma

lize

d R

un

tim

e

Page 35: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

What about local efficiency? (i.e., phases)

• Applications may exhibit phases at “micro-scale”– Not all phases are

equal

# Sum an arrayl_array: load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array...# Sum a listl_list: load [R1+ 8] -> R2 add R2 R3 -> R3 load [R1+ 0] -> R1 brnz l_list

D. Gibson Thesis Defense - 35

Great for Big Windows(Scale Up?)

Big Window Makes No Difference

(Scale Down?)

Page 36: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

00.20.40.60.8

11.21.41.61.8

2

E*D

2,

Norm

aliz

ed

to

Best

Sta

tic D

esig

n

Prior Art (some of it)

• POS (Positional Adaptation) [Huang03]:

– Code Configuration– POS: Static Profiling,

Measure Efficiency

• PAMRS (Power-Aware uArch Resource Scaling) [Iyer01]

– Detect “hot spots”– Measure all

configurations’ efficiency, pick best

D. Gibson Thesis Defense - 36

POS

PAMRS

Want: Efficiency of POS, but dynamic response of PAMRS

Page 37: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

MLP-based Window Size Estimation

• Play to the cards of the uArch:– FF: Pursue/measure

MLP– Something else:

Something else• Find the smallest

window that will expose as much MLP as the largest window

• Hardware:– Poison bits– Register names– Load miss bit– Counter– LFSR

D. Gibson Thesis Defense - 37

Results

Explain window size estimation in detail with a gory example

Page 38: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

FG Scaling Results

• MLP:– No profiling

needed– Safe, only hurts

efficiency >10% for 1 bmk.

– Compare to:• POS, 8 bmks• PAMRS, 20 bmks

D. Gibson Thesis Defense - 38

00.20.40.60.8

11.21.41.61.8

2

Norm

aliz

ed

E*D

2

POS

PAMRS

MLP

Fewer of these compared to PAMRS

Fewer of these compared to POS

Page 39: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 39

Recap: What to do for f=0.00

• Profiling (POS)– Can help, might hurt

• Dynamic Response: Seek MLP– Seldom hurts, usually

finds most-optimal configuration

0

0.2

0.4

0.6

0.8

1

1.2

0.000.10

0.250.50

0.750.85

0.900.95

0.99Parallel Fraction f

No

rma

lize

d R

un

tim

e

Page 40: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 40

Outline

• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition

• Scalable Cores for CMPs– How to scale:

• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?

– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads

• Conclusions/Wrap-Up

Page 41: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 41

What to do for f0.25-0.85

• Two Opportunities:1. Sequential

Bottlenecks• Detect, Fix• i.e., Scale Up• Better Performance

2. Useless Executions• Detect, Fix• i.e., Scale Down• Better Efficiency

0

0.2

0.4

0.6

0.8

1

1.2

0.000.10

0.250.50

0.750.85

0.900.95

0.99Parallel Fraction f

No

rma

lize

d R

un

tim

e

Page 42: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

What if the OS Knows?

• OS knows about bottlenecks– Can scale up a core

• OS knows about useless work– Can scale down, or,– Can shut off

unneeded cores (e.g., OPMS)

• Result: Amdahl’s Law in the Multicore Era [Hill/Marty08]

D. Gibson Thesis Defense - 42

N

f

k

f

1

0

0.2

0.4

0.6

0.8

1

1.2

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

Parallel Fraction f

No

rma

lize

d R

un

tim

e

5.1k

Page 43: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

If the OS doesn’t know

• Maybe programmer knows? (Prog)

D. Gibson Thesis Defense - 43

Dunce

• SLE-like lock detector to identify critical sections? (Crit)

• Hardware spin detection? [Wells2006] (Spin)

• Holding a lock… except when spinning? (CSpin)

• Every thread spinning except one? (ASpin)(limit study, pretend global communication is OK)

Page 44: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 44

Amdahl Microbenchmark

Bmks X Policies X Configs + Opacity

1

Unclear BehaviorSequential Parallel

N

f

k

f

1

f (1-f)

0

0.2

0.4

0.6

0.8

1

1.2

0.000.10

0.250.50

0.750.85

0.900.95

0.99Parallel Fraction f

No

rma

lize

d R

un

tim

e

Real HW

Page 45: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Prog (Programmer-Guided Scaling)

D. Gibson Thesis Defense - 45

0

1

2

3

4

5

P0P1P2P3P4P5P6P7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

F-128

F-1024

Norm

aliz

ed

Ru

nti

me

Parallel Fraction f

Prog

sc_hint(slow)

sc_hint(fast)

F-1024

F-512

F-256

F-128

F-64

F-32

Page 46: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

0

1

2

3

4

5

P0P1P2P3P4P5P6P7

Crit (SLE-Style Lock Detector for Scale-Up)

D. Gibson Thesis Defense - 46

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

F-128

F-1024

Norm

aliz

ed

Ru

nti

me

Parallel Fraction f

Prog

F-1024

F-512

F-256

F-128

F-64

F-32

Crit

Barrier::Arrive() { l.Lock(); … l.Unlock();}

Lock::Lock() { CAS(myValue); …}

Lock::Unlock() { CAS(myValue); …}

WTH?

Page 47: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Crit: What goes wrong

• Intuition Mismatch:– Lock Detector Implementer's expectations

don’t match pthread library implementer's expectations.1. Critical Section != Sequential Bottleneck2. Lock+Unlock != CAS+Temporal Silent Store

• More general lesson– SW is really flexible. Programmers do

strange things. • HW designer: Be careful, SW may not be doing

what you think

D. Gibson Thesis Defense - 47

Page 48: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

0

1

2

3

4

5

P0P1P2P3P4P5P6P7

Spin (Spin Detector for Scale-Down)

D. Gibson Thesis Defense - 48

F-1024

F-512

F-256

F-128

F-64

F-32

Spinning, Scale Down

Seldom/Never Spins

(Performs like CSpin, next)

Page 49: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

0

1

2

3

4

5

P0P1P2P3P4P5P6P7

CSpin (Lock Detector for Scale-Up, Spin Detector for Scale-Down)

D. Gibson Thesis Defense - 49

F-1024

F-512

F-256

F-128

F-64

F-32

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

F-128

F-1024

Norm

aliz

ed

Ru

nti

me

Parallel Fraction f

Prog

Crit

CSpin

LD thinks a lock is held,

but also spinning

Page 50: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

0

1

2

3

4

5

P0P1P2P3P4P5P6P7

ASpin (Spin, but Scale Up if all others Scaled Down)

D. Gibson Thesis Defense - 50

F-1024

F-512

F-256

F-128

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

F-128

F-1024

Norm

aliz

ed

Ru

nti

me

Parallel Fraction f

CSpin

Crit

Prog

F-64

F-32

All Spinning: Scale Up

Better late than never

ASpin

Page 51: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Amdahl Efficiency

D. Gibson Thesis Defense - 51

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.25 0.50 0.75 0.85 0.90 0.95 0.99

F-128

Prog

Spin

ASpin

Norm

aliz

ed

E*D

2

Parallel Fraction f

1. Hope of SW Parallelism for Efficiency Seems Sound

2. “Programmer” can help. Psychology?, Difficulty for non-toy programs?

3.a. Spin-detection helps, by scaling down.

3.b. Can scale up when others spin (“others”?)

Page 52: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Real Workloads?

D. Gibson Thesis Defense - 52

Workload Behavior

f0.90+ By design – graduate students

spend a lot of time making this so

No Prog Scaling Policy

Apache: Spin Det. helps. Synchronization Heavy.

JBB: Synchronization Heavy.

OLTP: (Just) Spin hurts a little, ASpin helps. Synchronization Heavy.

Zeus: (Just) Spin hurts a little, ASpin helps. Synchronization Heavy.

0

0.2

0.4

0.6

0.8

1

1.2

F-128

Spin

CSpin

ASpin

Norm

aliz

ed

E*D

2

F-1024

Page 53: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 53

Outline

• Introduction: Scalable Cores– Motivation (Why scale in the first place?)– Definition

• Scalable Cores for CMPs– How to scale:

• Dynamically-Scalable Core (Forwardflow)• Overprovision or Borrow Resources?

– When to scale: Hardware Scaling Policies• For single-thread efficiency• For multi-threaded workloads

– How to continue: DVFS/Models for Future Software Evaluations

• Conclusions/Wrap-Up

Page 54: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Conclusions (1/2)

• How to scale cores:– Forwardflow: An Energy-Proportional

Scalable Window Core Architecture• Scale up for performance• Scale down for energy conservation

– Overprovision Resources when cheap• Borrow only when necessary• Avoid loose loops

D. Gibson Thesis Defense - 54

Page 55: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Conclusions (2/2)

• When to scale cores:– For single-thread efficiency:

• Seek efficient operation intrinsically (FF: MLP)

• Profiling can help, if possible.

– For threaded workloads:• Scale up for sequential bottlenecks

– If you can find them• Scale down for useless work

• How to emulate scalable cores– Proxy with DVFS, with caveats

D. Gibson Thesis Defense - 55

Parallel Fraction f

No

rma

lize

d R

un

tim

e

1V

0V

Page 56: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 56

Other Contributions

• Side Projects with Collaborators– Deconstructing Scalable Cores, Coming

Soon– “Diamonds are an Architect’s Best Friend”,

ISCA 2009– To CMP or Not to CMP, TR & ANCS Poster

• Parallel Programming at Wisconsin– CS 838, CS 758

• Various Infrastructure Work– Ruby, Tourmaline, Lapis, GEM5

Page 57: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 57

Fun Facts About This Thesis

• Simulator:– C++: 135kl (101kl), Python: 16.7kl– 1188 Revs, 17,476 Builds ~15 builds per day since 5 July 2007

• Forwardflow used to be Fiberflow– Watch out, Metamucil

• Est. Simulation Time:– 2.9B CPU*Seconds = 95 Cluster*Days

(just in support of data in this thesis)

Page 58: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 58

Questions/Pointers

Overp./Borrowing

FG Uniproc. Scaling

Multiproc. Scaling

DVFS vs. W. Scaling

SSR

All about FF

Estimating Power

LBUS/RBUS

Scalable Scheduling

Seeking MLP

Other Scalable Cores

Related WorkBackward Ptrs.

In the DocumentAlways in motion is the future.

DVFS vs. Scaling

Page 59: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 59

DVFS Instead of Simulation

• So far:– “Benchmark” = 1ms – 10ms target time– Scaling “in the micro”

• i.e., Much faster than software

• What about longer runs?– “Benchmark” = minutes+– Scaling “in the macro”

• i.e., At the scale of systems

• No real hardware scalable core– Use DVFS instead, as a proxy.

You must unlearn what you have learned.

Page 60: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

1V

0V

DVFS Effects

D. Gibson Thesis Defense - 60

FEL1

-D

L2

L3

DRAM

DVFS Domain

1V

0V

+Freq: Compute operations are faster

+Freq: Memory seems slower

+Freq,+Volt: Dynamic Power Higher (~cubic)

+Pdyn: Higher temperature leads to higher static power

F-128 @ 3Ghz

F-128 @ 3.6Ghz

Page 61: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

HW Scaling Effects

D. Gibson Thesis Defense - 61

FEL1

-D

L2

L3

DRAM

+Window: Compute operations are not (much) faster

+Window: Memory seems faster

+Window: Dynamic Power Higher (~log)

Scale Up

F-128 @ 3Ghz

F-256 @ 3Ghz

How do they compare quantitatively?

Page 62: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

0

0.2

0.4

0.6

0.8

1

F-256

3.6GHz

DVFS/HW Scaling Performance

D. Gibson Thesis Defense - 62

More CPU-Bound: Prefer DVFS

More Memory-Bound: Prefer Window Scaling

a DVFS/ Scaling

config. pair with

comparable performance

Ru

nti

me N

orm

aliz

ed

to

F-1

28

Page 63: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

DVFS/HW Scaling Power

• DVFS: +~38% Chip Power+~70% DVFS Domain

Dynamic Power+~20% Temp-induced

Leakage

• FF Scaling: +~10% Chip Power+~2% Temp-induced

Leakage

D. Gibson Thesis Defense - 63

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

FE

DQ/ALU

MEM

Static

F-128

3.6

GH

z

F-256

Pow

er

Norm

aliz

ed

to

F-1

28

Page 64: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

DVFS Proxying Scalable Cores

• Performance: OK With Caveats– CPU-bound workloads: DVFS overestimates

scalable core performance– Memory-bound workloads: DVFS

underestimates scalable core performance

• Power: Not OK.– DVFS follows E*D2 curve– FF/Scalable Core should be better than

E*D2 curve.– Use a model instead.

D. Gibson Thesis Defense - 64

Page 65: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

SSR

• Per-Value Distributed Linked List– Starts at producer– Visits each successor– NULL pointer at last

successor

• Amenable to simple hardware– Serializes wakeup

D. Gibson Thesis Defense - 65

ld R4 4 R1add R1 R3 R3sub R4 16 R4st R3 R8breq R4 R3

Page 66: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Effect of Serialized Wakeup

D. Gibson Thesis Defense - 66

astar bzip2 gcc libquantum gmean0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Norm

alize

d R

unti

me

• Compared to idealized window– Low mean

performance loss from serialized wakeup (+2% runtime)

– Occasionally noticeable (i.e., bzip2, 50%+)

Page 67: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

SSR Compiler Optimization

D. Gibson Thesis Defense - 67

long split crit0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

RUUOoOSSR

Norm

alize

d R

unti

me

• long– Compiler cannot

identify dynamic repeated regs

• split– Compiler can identify

dynamic repeated regs, but cannot identify critical path

• crit– Compiler knows both

dynamic repeated regs and critical path

Page 68: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 68

Power-Awareness

• How much energy is used by a computation?– Measure (e.g., with a multimeter)– Detailed Simulation (e.g., SPICE)– Simple Simulation (e.g., WATTCH)– Simple Model (e.g., 10W/core)

ii ENNumber of activations of element i

Energy per activation of element i

Page 69: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Measuring Energy Online

D. Gibson Thesis Defense - 69

ii EN jestj EC

Event: “Easy” to measure

Activation: “Hard” to measure

Correlated

[Iyer01]: MAC in hardware.[Joseph01]: HW Perf. Ctrs, works for Pentium-eraThis work: Scalable core, use core’s resources to do the computation

Page 70: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 70

DVFS Won’t Cut It

• Near saturation in voltage scaling• Subthreshold DVFS never energy-efficient [Zhai04]

• Need microarchitectural alternative

1996 1998 2000 2002 2004 2006 2008 20100

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

IMB PowerPC 405LP

TransMeta Crusoe TM 5800Intel Xscale 80200

Intel Itanium Montecito

Atom Sil-verthorne

Vmin

Vmax

Ope

ratin

g Vo

ltage

(V)

~80% ~33%

Resource borrowed from David’s “Two Cores” Talk

Page 71: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Scalable Interconnect

D. Gibson Thesis Defense - 71

Logically: A Ring.Scale Down: A Ring with Fewer Elements

• Not straightforward• Overprovisioning won’t work well: Wrap-

around link is ugly• Needs to support 1-, 2-, 4-, 8-BG operation

Page 72: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Two Unidirectional Busses (gasp!)

D. Gibson Thesis Defense - 72

F-1024F-512

10 01 10 01 10 01 11

11 10 01 10 01 10 11

10 01 11 00 00 00 00

11 10 01 00 00 00 00

Page 73: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Window Estimation Example

D. Gibson Thesis Defense - 73

# Sum an arrayl_array: load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array

M

M

M

Miss, Start Profiling, Poison R2

W

Poison R3

Antidote R1

Indep. Miss,

Poison R3

Antidote R1

Indep. Miss, Set ELMRi, Poison R2

01234

ELMR4 8 16

1

Set ELMRi, Poison R2

8 1

MSb(ELMR) = 16 → Window size 16 needed.

Page 74: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Adding Hysteresis (1/2)

D. Gibson Thesis Defense - 74

0

1

2

3

4

5

Runtim

eE*D

E*D^2

0

0.5

1

1.5

MLP

F-1024

libquantum

F-1024

F-512

F-256

F-128

F-64

F-32 0

1

2

3

4

5

Runtim

eE*D

E*D^2

00.5

11.5

2

MLP

F-1024

astar

F-1024

F-512

F-256

F-128

F-64

F-32

1. Many reconfigs

2. Too small most of the time. Must anticipate, not react.

Page 75: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

Adding Hysteresis (2/2)

• Scale Down only “occasionally”– On full squash

D. Gibson Thesis Defense - 75

Runtim

eE*D

E*D^2

00.5

11.5

2UpDown

UpOnly

F-1024

astar

0

1

2

3

4

5

UpDownUpOnly

F-1024

F-512

F-256

F-128

F-64

F-32• Intuition:

– Assume big window not useful

– Show, occasionally, that a big window IS useful.

Page 76: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 76

Leakage Trends

• Leakage Starts to Dominate• SOI & DG Technology Helps (ca 2010/2013)• Tradeoffs Possible:

– Low-Leak Devices (slower access time)

DG Devices

LSP Devices

1MB Cache: Dynamic & Leakage Power [HP2008,ITRS2007]

Leakage Power by Circuit Variant [ITRS2007]

Pow

er (

mW

)

Nor

mal

ized

Pow

er

Page 77: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 77

Forwardflow Overview

• Design Philosophy:– Avoid ‘broadcast’ accesses (e.g., no CAMs)

• Avoid ‘search’ operations (via pointers)– Prefer short wires, tolerate long wires– Decouple frontend from backend details

• Abstract backend as a pipeline

Page 78: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 78

Forwardflow – Scalable Core Design

• Use Pointers to Explicitly Define Data Movement– Every Operand has a Next Use

Pointer– Pointers specify where data

moves (in log(N) space)– Pointers are agnostic of:

• Implementation• Structure sizes• Distance

– No search operation

ld R4 4 R1add R1 R3 R3sub R4 16 R4st R3 R8breq R4 R3

Page 79: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 79

Forwardflow – Dataflow Queue

• Table of in-flight instructions• Combination Scheduler, ROB,

and PRF– Manages OOO Dependencies– Performs Scheduling– Holds Data Values for All

Operands• Each operand maintains a

next use pointer hence the log(N)

• Implemented as Banked RAMs Scalable

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 R5

Op1 Op2 DestDataflow Queue

Bird’s Eye View of FF Detailed View of FF

Page 80: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 80

Forwardflow – DQ +/-’s

+ Explicit, Persistent Dependencies

+ No searching of any kind

- Multi-cycle Wakeup per value *

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 R5

Op1 Op2 Dest

* Average Number of Successors is Small [Ramirez04,Sassone07]

Dataflow Queue

Page 81: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 81

DQ: Banks, Groups, and ALUs

Logical Organization Physical OrganizationDQ Bank Group – Fundamental Unit of Scaling

Page 82: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 82

Forwardflow: Pipeline Tour

• RCT: Identifies Successors• ARF: Provides Architected Values• DQ: Chases Pointers

I$RCTRCTRCT

DQ

D$

ARF

PRED FETCH DECODE

DISPATCH

COMMIT

EXECUTE

Scalable, Decoupled Backend

Page 83: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 83

RCT: Summarizing Pointers

• Want to dispatch:breq R4 R5

• Need to know:– Where to get R4?

• Result of DQ Entry 3

– Where to get R5?• From the ARF

• Register Consumer Table summarizes where most-recent version of registers can be found

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5

Op1 Op2 DestDataflow Queue

Page 84: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 84

RCT: Summarizing Pointers

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 7

Op1 Op2 DestDataflow Queue

REF WRR1 2-S1 1-D

R2

R3 4-S1 2-D

R4 3-D 3-D

R5

Register Consumer Table (RCT)

breq R4 R5

5-S1

R4 Comes From DQ Entry 3-D

R5 Comes From ARF

Page 85: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 85

Wakeup/Issue: Walking Pointers

• Follow Dest Ptr When New Result Produced– Continue following

pointers to subsequent successors

– At each successor, read ‘other’ value & try to issue

• NULL Ptr Last Successor

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 7

Op1 Op2 DestDataflow Queue

Page 86: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 86

DQ: Fields and Banks

• Independent Fields Independent RAMs– I.E. accessed independently, independent ports,

etc.• Multi-Issue ≠ Multi-Port

– Multi-Issue Multi-Bank– Dispatch, Commit access contiguous DQ regions

• Bank on low-order bits for dispatch/commit BW

• Port Contention + Wire Delay = More Banks– Dispatch, Commit Share a Port

• Bank on a high-order bit to reduce contention

Page 87: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 87

DQ: Banks, Groups, and ALUs

Logical Organization Physical OrganizationDQ Bank Group – Fundamental Unit of Scaling

Page 88: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 88

Related Work

• Scalable Schedulers– Direct Instruction Wakeup [Ramirez04]:

• Scheduler has a pointer to the first successor• Secondary table for matrix of successors

– Hybrid Wakeup [Huang02]:• Scheduler has a pointer to the first successor• Each entry has a broadcast bit for multiple

successors– Half Price [Kim02]:

• Slice the scheduler in half• Second operand often unneeded

Page 89: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 89

Related Work

• Dataflow & Distributed Machines– Tagged-Token [Arvind90]

• Values (tokens) flow to successors– TRIPS [Sankaralingam03]:

• Discrete Execution Tiles: X, RF, $, etc.• EDGE ISA

– Clustered Designs [e.g. Palacharla97]• Independent execution queues

Page 90: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 90

RW: Scaling, etc.

• CoreFusion [Ipek07]– Fuse individual core structures into bigger

cores• Power aware microarchitecture resource

scaling [Iyer01]– Varies RUU & Width

• Positional Adaptation [Huang03]– Adaptively Applies Low-Power Techniques:

• Instruction Filtering, Sequential Cache, Reduced ALUs

Page 91: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 91

RW: Scalable Cores

• CoreFusion [Ipek07]

– Fuse individual core structures into bigger cores

• Composable Lightweight Processors [Kim07]

– Many very small cores operate collectively, ala TRIPS

• WiDGET [Watanabe10]

– Scale window via smart steering

Page 92: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 92

RW: Seeking MLP

• Big Windows [Many]

• Runahead Execution [Dundas97][Multu06]

– “Just keep executing”• WIB [Lebeck02]

– Defer, re-schedule later• Continual Flow [Srinivasan04]

– & Friends [Hilton09][Chaudhry09]

– Defer, re-dispatch later

Page 93: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 93

Operand NetworksC

DF

Pointer Span

SPAN=5

Observation: ~85% of pointers designate near successors

Intuition: Most of these pointers yield IB traffic, some IBG-N, none IBG-D.

SPAN=16

Observation: Nearly all pointers (>95%) designate successors 16 or fewer entries away

Intuition: There will be very little IBG-D traffic.

astarsjengjbb

Page 94: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 94

Is It Correct?

• Impossible to tell– Experiments do not prove, they support or refute

• What support has been observed of the hypothesis “This is correct”?– Reasonable agreement with published

observations (e.g. consumer fanouts)– Few timing-first functional violations– Predictable uBenchmark behavior

• Linked list: No parallelism• Streaming: Much parallelism

Page 95: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 95

CoreFusion

• Borrow Everything– Merges multiple

discrete elements in multiple discrete cores into larger components

– Troublesome for N>2

BPRED

Decode

Sched.

PRF

I$

BPRED

Decode

Sched.

PRF

I$

Page 96: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 96

“Vanilla” CMOS

P-N

N+ N+

Page 97: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 97

Double-Gate, Tri-Gate, Multigate

Page 98: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 98

ITRS-HP vs. ITRS-LSP Device

P-

N

N+ N+

LSP: ~2x Thicker Gate Oxides

LSP: ~2x Longer Gates

LSP: ~4x Vth

Page 99: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 99

OoO Scaling

Decode Width = 2

op1 dest src1 src2

op2 dest src1 src2

op1 dest src1 src2

op2 dest src1 src2

op1 dest src1 src2

op2 dest src1 src2

Decode Width = 4

Number of Comparators ~ O(N2) Bypassing Complexity ~ O(N2)

Two-way Fully Bypassed

Four-way fully bypassed is beyond my powerpoint skill

Page 100: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 100

OoO Scaling

• ROB Complexity: O(N), O(I~3/2)• PRF Complexity: O(ROB), O(I~3/2)• Scheduler Complexity:

– CAM: O(N*log(N)) (size of reg tag increases log(N))

– Matrix: O(N2) (in fairness, the constant in front is small)

Page 101: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 101

Flavors of “Off”

Dynamic Power

Static Power

Response Lag Time

Active (Not Off)

U% 100% 0 cycles

Drowsy(Vdd Scaled)

1-5% 40% 1-2 cycles

Clock-Gated

1-5% 100% ~0 cycles

Vdd-Gated <1% <1% 100s cycles

Freq. Scaled

F% 100% ~0 cycles

Page 102: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 102

Forwardflow – Resolving Branches

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 R5

6 ld R4 4 R1

7 add R1 R3 R3

Op1 Op2 DestDataflow Queue

• On Branch Pred.:– Checkpoint RCT– Checkpoint Pointer

Valid Bits

• Checkpoint Restore– Restores RCT– Invalidates Bad

Pointers

Page 103: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 103

add

A Day in the Life of a Forwardflow Instruction: Decode

4-S1R4R3R2

7-DR1Register Consumer History

R1@7D

8-D

8-D

R3=0

8-S1

8-S1

add R1 R3 R3

Page 104: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 104

A Day in the Life of a Forwardflow Instruction: Dispatch

9

8

7

R3R1add

R14R4ld

Op1 Op2 DestDataflow Queue

add R1@7D R3=0

Implicit -- Not actually written

0

Page 105: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 105

A Day in the Life of a Forwardflow Instruction: Wakeup

7 ld R4 4 R1

8 add R1 0 R3

9 sub R4 16 R4

10

st R3 R8

Op1 Op2 DestDataflow Queue

DQ7 Result is 0!

7-Dnext

Update HW

value 0

DestPtr.Read(7)

DestVal.Write(7,0)

8-S1

0

8-S1

Page 106: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 106

A Day in the Life of a Forwardflow Instruction: Issue (…and Execute)

7 ld R4 4 R1

8 add R1 0 R3

9 sub R4 16 R4

10

st R3 R8

Op1 Op2 DestDataflow Queue

S2Val.Read(8)

Meta.Read(8)

8-S1next

Update HW

value 0

S1Ptr.Read(8)

S1Val.Write(8,0)

00

add 0

add 0 + 0 → DQ8

Page 107: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 107

A Day in the Life of a Forwardflow Instruction: Writeback

10

9

8

7

R8R3st

R416R4sub

R30R1add

Op1 Op2 DestDataflow Queue

8-Dnext

Update HW

value 0

DestPtr.Read(8)

DestVal.Write(8,0)

0

10-S1 10-S1R3:0

Page 108: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 108

A Day in the Life of a Forwardflow Instruction: Commit

10

9

8

7

R8R3st

R416R4sub

Op1 Op2 DestDataflow Queue

0R1add R3:0Commit Logic

Meta.Read(8)

DestVal.Read(8)

add R3:0

ARF.Write(R3,0)

Page 109: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 109

5-S1R4

4-S1R3

R2

2-S1R1

DQ Q&A

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 R5

6 ld R4 4 R1

7 add R1 R3 R3

8 sub R4 16 R4

9 st R3 R8

Op1 Op2 DestDataflow Queue

Register Consumer History

8-DR4

9-S1R3

R2

7-S1R1

Page 110: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 110

Forwardflow – Wakeup

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 0

Op1 Op2 DestDataflow Queue

DQ1 Result is 7!

1-Dnext

Update HW

value 7

DestPtr.Read(1)

DestVal.Write(1,7)

2-S1

7

2-S1

Page 111: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 111

S2Val.Read(2)

Meta.Read(2)

2-S1

Forwardflow – Selection

1 ld R4 4 R1

2 add R1 R3 R3

3 sub R4 16 R4

4 st R3 R8

5 breq R4 0

Op1 Op2 DestDataflow Queue

next

Update HW

value 7

S1Ptr.Read(2)

S1Val.Write(2,7)

77

add 44

DQ2

Issue

Page 112: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 112

Forwardflow – Building Pointer Chains: Decode

• Decode must determine, for each operand, where the operand’s value will originate– Vanilla-OOO: Register Renaming– Forwardflow-OOO: Register Consumer Table

• RCT records last instruction to reference a particular architectural register– RAM-based table, analogous to renamer

Page 113: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 113

Decode Example

7-S1R4R3R2

7-DR1Register Consumer History 5 ld R4 4 R4

6 add R4 R1 R4

7 ld R4 16 R1

8

9

Op1 Op2 DestDataflow Queue

Page 114: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 114

8: add R1 R3 R3

Decode Example

4-S1R4R3R2

7-DR1Register Consumer History

R1@7D

8-D

8-D

R3=0

8-S1

8-S1

add→R3

Page 115: UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

D. Gibson Thesis Defense - 115

Forwardflow –Dispatch

• Dispatch into DQ:– Writes metadata and

available operands– Appends instruction

to forward pointer chains

5 ld R4 4 R4

6 add R4 R1 R4

7 ld R4 16 R1

8 add R1 0 R3

9

Op1 Op2 Dest

Dataflow Queue

R3=0add→R3 R1@7D