111
Duke:: March 18, 2010 Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era Andrew Hilton University of Pennsylvania [email protected]

Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Embed Size (px)

DESCRIPTION

Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era. Andrew Hilton University of Pennsylvania [email protected]. Duke :: March 18, 2010. Multi-Core Architecture. Atom. Atom. Atom. Atom. Single-thread performance growth has diminished - PowerPoint PPT Presentation

Citation preview

Page 1: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Duke:: March 18, 2010

Energy Efficient Latency Tolerance:Single-Thread Performance for the Multi-Core Era

Andrew HiltonUniversity of [email protected]

Page 2: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Multi-Core Architecture

[ 2 ][ 2 ]

Core i7

Single-thread performance growth has diminished• Clock frequency has hit an energy wall• Instruction level parallelism (ILP) has hit energy, memory, idea walls

Future chips will be heterogeneous multi-cores • Few high-performance out-of-order cores (Core i7) for serial code• Many low-power in-order cores (Atom) for parallel code

Atom Atom Atom Atom

Atom Atom Atom Atom

Page 3: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Multi-Core Performance

[ 3 ][ 3 ]

Core i7

Obvious performance key: write more parallel software

Less obvious performance key: speed up existing cores• Core i7? Keep serial portion from becoming a bottleneck (Amdahl)• Atoms? Parallelism is typically not elastic

Key constraint: energy• Thermal limitations of chip, cost of energy, cooling costs,…

Atom Atom Atom Atom

Atom Atom Atom Atom

Page 4: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

“TurboBoost”

[ 4 ][ 4 ]

Existing technique: Dynamic Voltage Frequency Scaling?• Increase clock frequency (requires increasing voltage)+Simple+Applicable to both types of cores- Not very energy-efficient (energy ≈ frequency2)- Doesn’t help “memory bound” programs (performance < frequency)

Page 5: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Effectiveness of “TurboBoost”

[ 5 ][ 5 ]

Hig

her

Is b

ett

er

Low

er

is b

ett

er

Example: TurboBoost 3.2 GHz 4.0 GHz (25%)• Ideal conditions: 25% speedup, constant Energy * Delay2

• Memory bound programs: far from ideal

Page 6: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

“Memory Bound”

Main memory is slow relative to core (~250 cycles)

Cache hierarchy makes most accesses fast• “Memory bound” = many L3 misses• … or in some cases many L2 misses• … or for in-order cores many L1 misses• Clock frequency (“TurboBoost”) accelerates only core/L1/L2

[ 6 ][ 6 ]

Core i7

Main Memory (250 cycles)

Atom

Atom

L1$

L2$ (10)

L3$ (40 cycles)

L1$

L2$ (10)L1

$L1$

L2$ (10)L1

$L1$

L2$ (10)L1

$Atom

Atom

Atom

Atom

Page 7: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Goal: Help Memory Bound Programs

Wanted: complementary technique to TurboBoost

Successful applicants should• Help “memory bound” programs• Be at least as energy efficient as TurboBoost (at least ED2 constant)• Work well with both out-of-order and in-order cores

Promising previous idea: latency tolerance• Helps “memory bound” programs

My work: energy efficient latency tolerance for all cores• Today: primarily out-of-order (BOLT) [HPCA’10]

[ 7 ][ 7 ]

Page 8: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Talk Outline

Introduction

Background: memory latency & latency tolerance

My work: energy efficient latency tolerance in BOLT • Implementation aspects• Runtime aspects

Other work and future plans

[ 8 ][ 8 ]

Page 9: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 9 ][ 9 ]

LLC (Last-Level Cache) Misses

What is this picture? Loads A & H miss caches

This is an in-order processor• Misses serialize latencies add dominate performance

We want Miss Level Parallelism (MLP): overlap A & H

Time

250

250(not to scale)

Page 10: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 10 ][ 10 ]

Miss-Level Parallelism (MLP)

One option: prefetching• Requires predicting address of H at A

Another option: out-of-order execution (Core i7)• Requires sufficiently large “window” to do this

Time

250

250 250

Page 11: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 11 ][ 11 ]

Out-of-Order Execution & “Window”

Important “window” structures• Register file (number of in-flight instructions): 128 insns on Core i7 • Issue queue (number of un-executed instructions): 36 on Core i7• Sized to “tolerate” (keep core busy for) ~30 cycle latencies• To tolerate ~250 cycles need order of magnitude bigger structures

Latency tolerance big idea: scale window virtually

Rename

IssueQueu

e

Reorder BufferFetch

Register

FileFU

D$

I$

BC A

AD

B C D

D

A

LLC miss

completed

unexecuted

Page 12: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 12 ][ 12 ]

Latency Tolerance

Prelude: Add slice buffer• New structure (not in conventional processors)• Can be relatively large: low bandwidth, not in critical execution core

Slice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC A

D

D

A B C DA

RenameFetch

Page 13: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 13 ][ 13 ]

Latency Tolerance

Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot

Slice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC A

D

D

A B C DA

RenameFetch

Page 14: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 14 ][ 14 ]

Latency Tolerance

Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot

Slice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC A

D

D

B C D

A

RenameFetch

Page 15: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 15 ][ 15 ]

Latency Tolerance

Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents

Slice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC A

D

D

B C D

A

miss dependent

RenameFetch

Page 16: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 16 ][ 16 ]

Latency Tolerance

Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents• Pseudo-execute them too

Slice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AD

B C

AD

RenameFetch

Page 17: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 17 ][ 17 ]

Latency Tolerance

Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents• Pseudo-execute them too• Proceed under miss

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

ADEHEF

F

H

I I

G

G

RenameFetch

Page 18: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 18 ][ 18 ]

Latency Tolerance

Phase #2: Cache miss return slice in

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

ADEHEF

F

H

I I

G

G

RenameFetch

Page 19: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 19 ][ 19 ]

Latency Tolerance

Phase #2: Cache miss return slice in• Allocate new registers

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

ADEHEF

F

H

I I

G

GA

RenameFetch

Page 20: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 20 ][ 20 ]

Latency Tolerance

Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

ADEHEF

F

H

I I

G

GA

A

RenameFetch

Page 21: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 21 ][ 21 ]

Latency Tolerance

Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

DEHEF

F

H

I I

G

GA

RenameFetch

Page 22: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 22 ][ 22 ]

Latency Tolerance

Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction• Problems with sliced in instructions (exceptions, mis-predictions)?

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

EHEF

F

H

I I

G

GA

EED

Exception!

RenameFetch

Page 23: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 23 ][ 23 ]

Latency Tolerance

Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction• Problems with sliced in instructions (exceptions, mis-predictions)?• Recover to checkpoint (taken before A)

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

EHEF

F

H

I I

G

GA

EED

Exception!

ChkRenam

eFetch

Page 24: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Slice Self Containment

Important for latency tolerance: self-contained slices• A,D, & E have miss-independent inputs• Capture these values during slice out• This decouples slice from rest of program

[ 24 ][ 24 ]

AB

C

D

E

F

G

H

Page 25: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Latency Tolerance

Latency tolerance example• Slice out miss and dependent instructions “grow” window• Slice in after miss returns

[ 25 ][ 25 ]

Time

Energy:

1.5x

Delay:0.5

x

Energy ≈ #Boxes

Combine into ED2

ED2 < 1.0 = Good

ED2: 0.38x

Page 26: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Previous Design: CFP

[ 26 ][ 26 ]

Prior design: Continual Flow Pipelines [Srinivasan’04]

• Obtains speedups, but…

Hig

her

Is b

ett

er

Page 27: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Previous Design: CFP

[ 27 ][ 27 ]

Prior design: Continual Flow Pipelines [Srinivasan’04]

• Obtains speedups, but also slowdowns

Hig

her

Is b

ett

er

Page 28: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Previous Design: CFP

[ 28 ][ 28 ]

Prior design: Continual Flow Pipelines [Srinivasan’04]

• Obtains speedups, but also slowdowns• Typically not energy efficient

Hig

her

Is b

ett

er

Low

er

is b

ett

er

Page 29: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Energy-Efficient Latency Tolerance?

Efficient Implementation• Re-use existing structures when possible• New structures must be simple, low-overhead

Runtime efficiency• Minimize superfluous re-executions

Previous designs have not achieved (or considered) these• Waiting Instruction Buffer [Lebeck’02]

• Continual Flow Pipeline [Srinivasan’04]

• Decoupled Kilo Instruction Processor [Pericas ’06,’07]

[ 29 ][ 29 ]

Page 30: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Sneak Preview: Final Results

[ 30 ][ 30 ]

This talk: my work on efficient latency tolerance+Improved performance+Performance robustness (do no harm)+Performance is energy efficient

Hig

her

Is b

ett

er

Low

er

is b

ett

er

Page 31: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Talk Outline

Introduction

Background: memory latency & latency tolerance

My work: energy efficient latency tolerance in BOLT • Implementation aspects• Runtime aspects

Other work and future plans

[ 31 ][ 31 ]

Page 32: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 32 ][ 32 ]

Examination of the Problem

Problem with existing design: register management• Miss-dependent instructions free registers when they execute

Slice BufferADHK L E

Chk

IssueQueu

e

Register

FileFU

D$

I$

BC AI D

B C

EF

F

H

I I

G

G

RenameFetch

Reorder Buffer

Page 33: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 33 ][ 33 ]

Examination of the Problem

Problem with existing design: register management• Miss-dependent instructions free registers when they execute• Actually, all instructions free registers when they execute

What’s wrong with this?• No instruction level precise state hurts on branch mispredictions• Execution order slice buffer

Slice BufferADHK L E

IssueQueu

e

Register

FileFU

D$

I$

I

ChkChkRenam

eFetch

hard to re-rename & re-acquire registers

Page 34: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 34 ][ 34 ]

BOLT Register Management

Youngest instructions: keep in re-order buffer• Conventional, in-order register freeing

Miss-dependent instructions: in slice buffer• Execution based register freeing

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AD

B C DA

RenameFetch

Page 35: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 35 ][ 35 ]

BOLT Register Management

In-order speculative retirement stage• Head of ROB completed or poison?

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AD

B C DA

RenameFetch

Page 36: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 36 ][ 36 ]

BOLT Register Management

In-order speculative retirement stage• Head of ROB completed or poison?• Release registers

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AD

B C DA

RenameFetch

Page 37: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 37 ][ 37 ]

BOLT Register Management

In-order speculative retirement stage• Head of ROB completed or poison?• Release registers

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AD

B C D

RenameFetch

Page 38: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 38 ][ 38 ]

BOLT Register Management

In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BCD

B C D

RenameFetch

A

Page 39: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 39 ][ 39 ]

BOLT Register Management

In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer• Completed instructions are done and simply removed

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BCD

B C D

RenameFetch

A

Page 40: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 40 ][ 40 ]

BOLT Register Management

In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer• Completed instructions are done and simply removed

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

CD

C D

RenameFetch

A

Page 41: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 41 ][ 41 ]

BOLT Register Management

Benefits of BOLT’s management• Youngest instructions (ROB) get conventional recovery (do no harm)

Slice Buffer Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT ADHKL E

Page 42: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 42 ][ 42 ]

BOLT Register Management

Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)

Slice BufferADHKL E

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT

Page 43: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 43 ][ 43 ]

BOLT Register Management

Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)• Scale single, conventionally sized register file

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT HKL E

Register File Contribution #1:Hybrid register management—best of both

worlds

Page 44: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 44 ][ 44 ]

BOLT Register Management

Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)• Scale single, conventionally sized register file

Challenging part: two algorithms, one register file• Note: two register files not a good solution

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT HKL E

Page 45: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 45 ][ 45 ]

Two Algorithms, One Register File

Conventional algorithm (ROB)• In-order allocation/freeing from circular queue• Efficient squashing support by moving queue pointer

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT HKL E

Page 46: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 46 ][ 46 ]

Two Algorithms, One Register File

Conventional algorithm (ROB)• In-order allocation/freeing from circular queue• Efficient squashing support by moving queue pointer

Aggressive algorithm (slice instructions)• Execution driven reference counting scheme

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT HKL E

Page 47: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 47 ][ 47 ]

Two Algorithms, One Register File

How to combine these two algorithms?• Execution based algorithm uses reference counting• Efficiently encode conventional algorithm as reference counting• Combine both into one reference count matrix

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT HKL E

Register File Contribution #2:Efficient implementation of new hybrid

algorithm

Page 48: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 48 ][ 48 ]

Management of Loads and Stores

Large window requires support for many loads and stores • Window effectively A-V now, what about the loads & stores ?• This could be an hour+ talk by itself… so just a small piece

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT HKL E

Page 49: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Store to Load Dependences

Different from register state: cannot capture inputs• Store -> load dependences determined by addresses

• Cannot “capture” like registers• Must be able to find proper (older, matching) stores

[ 49 ][ 49 ]

B

C

D

E

A

F

?

Page 50: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Store to Load Dependences

Different from register state: cannot capture inputs• Store -> load dependences determined by addresses

• Cannot “capture” like registers• Must be able to find proper (older, matching) stores• Must avoid younger matching stores (“write-after-read” hazards)

[ 50 ][ 50 ]

B

C

D

E

A

F

?

X

Page 51: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 51 ][ 51 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 12

addressvalue

0 0 1 0 1 0 0poison

Tail (younger) Head (older)Store Buffers

86 85 84 83 82 81 80

Conventional store queue/store buffer• Holds stores in program order• Loads search “associatively” (all entries in parallel)• Doesn’t scale to sizes we need

For latency tolerance, we need….• Poison (easy)• Scalable way to search, accounting for age (hard)

Page 52: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 52 ][ 52 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

valuepoison

Tail (younger) Head (older)Chained Store Buffer

86 85 84 83 82 81 80

Replace associative search with iterative indexed search• Overlay store buffer with address-based hash table• Exploits in-order nature of speculative retirement to build

44 81 0 15 0 77 0link

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

address

Page 53: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 53 ][ 53 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

valuepoison

Tail (younger) Head (older)Chained Store Buffer

86 85 84 83 82 81 80

44 81 0 15 0 77 0link

Loads follow chain starting at appropriate root table entry• For example, load to address 1AC

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

85AC 2AC85

81

1AC81

Match, forward

address

Page 54: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 54 ][ 54 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

valuepoison

Tail (younger) Head (older)Chained Store Buffer

86 85 84 83 82 81 80

44 81 0 15 0 77 0link

Loads follow chain starting at appropriate root table entry• For example, load to address 1AC

Deferred loads ignore younger stores, avoid WAR hazards• For example, deferred load to address 1B4 …• … whose immediately older store 81 (note when entering pipeline)

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

83B4

1B483

Younger store, ignore

15

Go to D$

address

Page 55: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 55 ][ 55 ]

Chained Store Buffer

+ Non-speculative search, scalable …+ Fast• Most non-forwarding loads access root table only• Most forwarding loads find store on first shot• Average number of excess hops < 0.05 with 64-entry root table

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

valuepoison

Tail (younger) Head (older)

86 85 84 83 82 81 80

44 81 0 15 0 77 0link

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

address

Page 56: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 56 ][ 56 ]

BOLT: Implementation Recap

Three key implementation efficiencies in BOLT1. Re-use of existing renaming hardware

2. Hybrid register management algorithm in single register file

3. Efficient management of loads and stores

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT HKL E

Chained Store Buffer

Page 57: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 57 ][ 57 ]

Experimental Evaluation

SPEC 2006 Benchmarks • Focus on memory bound programs (TurboBoost gets < 15%)

Performance: detailed cycle-level timing simulation in x86• Baseline “Core i7” (includes prefetching)

Energy: re-execution overhead + new structures• Estimate energy of new structures using CACTI-4.1 [Tarjan06]

Page 58: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

CFP vs. BOLT

[ 58 ][ 58 ]

• Speedups: Overall 5% 11% MEM 14% 18%

Hig

her

Is b

ett

er

Page 59: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

CFP vs. BOLT

[ 59 ][ 59 ]

• Re-execution: increases due to more latency tolerance

Hig

her

Is b

ett

er

Low

er

is

bett

er

Page 60: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

CFP vs. BOLT

[ 60 ][ 60 ]

• ED2: overall improvement• Fewer and simpler new structures (lower energy)• Increased re-executions typically correspond to higher performance

Hig

her

Is b

ett

er

Low

er

is

bett

er

Low

er

is

bett

er

Page 61: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Talk Outline

Introduction

Background: memory latency & latency tolerance

My work: energy efficient latency tolerance in BOLT • Implementation aspects• Runtime aspects

Other work and future plans

[ 61 ][ 61 ]

Page 62: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Non-Blocking Latency Tolerance

Latency tolerance = non-blocking execution• Re-execution should not block pipeline either• Suppose B & C miss (C depends on B)• C should also not block the pipeline: reapply latency tolerance

[ 62 ][ 62 ]

Time

Page 63: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Execution Inefficiency

Dynamic inefficiency: excessive multiple re-execution• Observe: multiple re-execution dependence on multiple loads• Two possibilities: loads in parallel or loads in series• Different approaches to each

[ 63 ][ 63 ]

C

A B

C

A

B

Page 64: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Loads in Parallel

Example: accumulating sumfor(i = 0; i < n; i++)

total += array[i];

Assembly:loop:

load [r1] -> r2

add r2 + r3 -> r3

add r1 + 4 -> r1

bnz r1, r5 loop

load

addload

addload

addload

addload

addload

addload

addload

add

add

add

add

add

add

add

add

bnz

bnz

bnz

bnz

bnz

bnz

bnz

Page 65: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Loads in Parallel

Example: accumulating sumfor(i = 0; i < n; i++)

total += array[i];

Assembly:loop:

load [r1] -> r2

add r2 + r3 -> r3

add r1 + 4 -> r1

bne r1, r5 loop

load

addload

addload

addload

addload

addload

addload

addload

add

add

add

add

add

add

add

add

bnz

bnz

bnz

bnz

bnz

bnz

bnz

Page 66: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Loads in Parallel

[ 66 ][ 66 ]Time

A

BC

DE

FG

HI

JK

LM

NO

P

A

B

D

F

H

J

L

N

P

C

D

MLP!

Energy:3.8xDelay: 0.4x

ED2: 0.6x

Keep PerformanceReduce re-executions

Page 67: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Join Pruning

A’s miss poisoned B… so A’s return provides its antidote

[ 67 ][ 67 ]

A

BC

DE

FG

H

Page 68: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Join Pruning

A’s miss poisoned B… so A’s return provides its antidote

B now executes correctly, provides antidote to D• D must capture this input

[ 68 ][ 68 ]

A

BC

DE

FG

H

Page 69: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Join Pruning

A’s miss poisoned B… so A’s return provides its antidote

B now executes correctly, provides antidote to D• D must capture this input

D is still poisoned by C, cannot provide antidote

[ 69 ][ 69 ]

A

BC

DE

FG

H

Page 70: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Join Pruning

A’s miss poisoned B… so A’s return provides its antidote

B now executes correctly, provides antidote to D• D must capture this input

D is still poisoned by C, cannot provide antidote

F is not receiving any antidote, no need to re-execute

[ 70 ][ 70 ]

A

BC

DE

FG

H

Page 71: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 71 ][ 71 ]

Antidote Vector

BOLT filters re-execution using an antidote bit-vector• Track (per-logical register) if antidote is available• Also through store to load dependences (know poisoning store)

Antidote

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT HKL E

Chained Store Buffer

Page 72: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Join Pruning

[ 72 ][ 72 ]Time

A

BC

DE

FG

HI

JK

LM

NO

P

D

F

B

A

C

D

H

J

L

N

P

E

FG

HI

JK

LM

NO

P

Energy:

2.8x

Delay:0.4

x

ED2:0.45

x

Energy:

3.8x

Delay:0.4

x

ED2:0.6

x

Page 73: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Join Pruning Performance

[ 73 ][ 73 ]

• Performance: strictly better (especially lbm)

Hig

her

Is

bett

er

Page 74: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Join Pruning Performance

[ 74 ][ 74 ]

• Performance: strictly better (especially lbm)• Execution overhead: strictly lower (especially lbm)

Hig

her

Is

bett

er

Low

er

is

bett

er

Page 75: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Join Pruning Performance

[ 75 ][ 75 ]

• Performance: strictly better (especially lbm)• Execution overhead: strictly lower (especially lbm)• ED2: overall improvements (again, especially lbm)

Hig

her

Is

bett

er

Low

er

is

bett

er

Low

er

is

bett

er

Page 76: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Loads in Series

Example: Count elements of linked listwhile (node != NULL) {

count++;

node = node->next;

}

Assemblyloop:

load [r1] -> r1

add r2 + 1 -> r2

bnz r1, loop

[ 76 ][ 76 ]

load

add

bnzload

add bnzload

add bnzload

add bnzload

add bnzload

add bnzload

add bnz

Page 77: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

D

G

J

M

Pointer Chasing

[ 77 ][ 77 ]Time

A

CB D

FE G

IH J

LK M

ON …

Energy:

2.2x

Delay: 1x

ED2: 2.2x

A

CD

FG

IJ

LM

O

B

E

H

K

N

Dr! Dr! It hurts when I apply latency tolerance to pointer

chasing…

Page 78: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

D

G

J

M

So Don’t Do It…

[ 78 ][ 78 ]Time

A

CB

FE

IH

LK

ON …

Energy:

1xDelay: 1x

ED2: 1x

Energy:

2.2x

Delay: 1x

ED2:2.2

x

Page 79: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Loads in Series

Not all dependent loads are badfor (int i =0; i < n; i++)

x += objects[i]->val;

Assemblyloop:

load [r1] -> r2

load [r2] -> r3

add r4, r3 -> r4

add r1, 4 -> r1

bne r1, r5 loop

Important: prune pointer chasing only• Preserve general indirection

[ 79 ][ 79 ]

load

load

load

load

load

load

load

load

add

add

add

add

add

add

add

add

load

load

load

load

load

load

load

load

Parallel

Parallel

Page 80: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Pointer Chasing

How to distinguish the two?

[ 80 ][ 80 ]

loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1

loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2

Page 81: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Pointer Chasing

How to distinguish the two?• Pointer chasing: load poisons younger instances of itself

[ 81 ][ 81 ]

loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2

loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1

Page 82: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Pointer Chasing

How to distinguish the two?• Pointer chasing: load poisons younger instances of itself• Benign indirection: poison comes from different (static) load

[ 82 ][ 82 ]

loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2

loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1

Page 83: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Pointer Chasing

How to distinguish the two?• Pointer chasing: load poisons younger instances of itself• Benign indirection: poison comes from different (static) load

• Loop induction not chain of poison loads

[ 83 ][ 83 ]

loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2

loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1

Page 84: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 84 ][ 84 ]

Extended Antidote Vector

Idea: extend poison information with low bits of PC• Poison from same PC pointer chasing

One implementation: detect at execution• Shuts pointer-chasing down immediately• Complicates latency-critical execution structures

A better one: detect at re-dispatch (extend antidotes) • Learn identity of pointer-chasing PC and shut down future instances

Antidote

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT HKL E

Chained Store Buffer

Page 85: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Pointer Chasing Performance

[ 85 ][ 85 ]

• Speedups: same (good: no harm)

Hig

her

Is

bett

er

Page 86: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Pointer Chasing Performance

[ 86 ][ 86 ]

• Speedups: same (good: no harm)• Execution overhead: reduced (mcf 290% 44%)

Hig

her

Is

bett

er

Low

er

is

bett

er

Page 87: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Pointer Chasing Performance

[ 87 ][ 87 ]

• Speedups: same (good: no harm)• Execution overhead: reduced (mcf 290% 44%)• ED2: overall improvement (mcf basically breaks even now)

Hig

her

Is

bett

er

Low

er

is

bett

er

Low

er

is

bett

er

Page 88: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

BOLT vs. TurboBoost

[ 88 ][ 88 ]

BOLT able to help performance where TurboBoost cannot

…and more energy efficiently

Hig

her

Is

bett

er

Low

er

is

bett

er

Page 89: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

BOLT vs. TurboBoost

[ 89 ][ 89 ]

BOLT + TurboBoost? • Synergistic: BOLT “un-memory-bounds” programs• BOLT + TurboBoost still an ED2 win!

Hig

her

Is

bett

er

Low

er

is

bett

er

Page 90: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Partial Summary

Latency tolerance • Scale window virtually under long cache misses• No good implementations + excessive overhead• Potentially good complement to TurboBoost

Energy-efficient latency tolerance• Low-cost implementation: re-use SMT, registers & load/stores• Low runtime overhead: prune pointer-chasing and “joins”• Actually good complement to TurboBoost• Applicable to both in-order and out-of-order cores

[ 90 ][ 90 ]

Page 91: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 91 ][ 91 ]

iCFP: In-order Latency Tolerance

BOLT – (out-of-order core) + (in-order core) = ?

Antidote

Slice Buffer Chk

IssueQueu

e

Register

FileFU

D$

I$

RenameFetch

Reorder Buffer

Chained Store Buffer

Page 92: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 92 ][ 92 ]

iCFP: In-order Latency Tolerance

BOLT – (out-of-order core) + (in-order core) = ?

Chained Store Buffer

Slice Buffer Chk

Antidote

Page 93: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 93 ][ 93 ]

iCFP: In-order Latency Tolerance

BOLT – (out-of-order core) + (in-order core) = iCFP [HPCA’09]

• Some details obviously different due to in-order pipeline• Useful for any level cache miss (L1, L2, L3)• Joint work with Santosh Nagarakatte

Other in-order latency tolerant designs• Sun’s Rock “processor” [Chaudhry’09]

• Simple Latency Tolerant Processor [Nekkalapu’09]

RF 0

RF 1

FetchI$

FU

D$

Chained Store Buffer

Slice Buffer Chk

Antidote

Page 94: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Talk Outline

Introduction

Background: memory latency & latency tolerance

My work: energy efficient latency tolerance in BOLT

Other work and future plans

[ 94 ][ 94 ]

Page 95: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Other Work / Future Directions

Micro-architecture• Control independence [ISCA’07]

• Plans for more work related to latency tolerance• Store latency tolerance• Possibly changes out-of-order “sweet spot”

• In submission / in progress: energy efficient load/store data path• Trident: reduce D$ accesses improve energy + perf• SMT-directory: reduce lq coherence searches in SMT

• Future work: register reference counting register file gating• Generally interested in performance and energy

[ 95 ][ 95 ]

Page 96: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Other Work / Future Directions

Simulation and workload methodologies• Multi-programming workload methodology [MobS’09]

• Future plans include adapting ideas to multi-threaded applications• Generally interested in research on better simulation

Operating systems and security• Operating system based security project for layered sandboxing

• Provides system calls to restrict behavior of less trusted code• Many future plans on this project, most involving hardware support• Generally interested in how hardware can improve software

[ 96 ][ 96 ]

Page 97: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 97 ][ 97 ]

Page 98: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 98 ][ 98 ]

Sun’s Rock

Rock [Chaudhry’09] does in-order latency tolerance• Slice buffer (“Deferral Queues”) divided by multiple checkpoints• Re-execution limited to oldest region• Values from slices reintegrated to main register file when DQs empty

RF 0

RF 1

FetchI$

FU

D$

Slice Buffer

ChkChkChk

ADHKL E

Page 99: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Unrolled Loops

What if compiler unrolled loop with pointer chasing?• Still detectable, just takes one time per unrolling

[ 99 ][ 99 ]

loop1: load [r1] -> r1 bz r1, endMidLoop load [r1] -> r1 add r2 + 2 -> r2 bnz r1, loop1

Page 100: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 100 ][ 100 ]

Page 101: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

FIESTA

Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?

[ 101 ][ 101 ]

Page 102: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

FIESTA

Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?

Fixed workloads: run all programs for X million insns

[ 102 ][ 102 ]

Page 103: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Other Work: FIESTA

Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?

Fixed workloads: run all programs for X million insns

Variable workloads: run both until sum = X million insns

[ 103 ][ 103 ]

Page 104: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Other Work: FIESTA

Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?

Fixed workloads: run all programs for X million insns

Variable workloads: run both until sum = X million insns

[ 104 ][ 104 ]

Two very different answers for one question…

Why is this? Which is the right answer?

Page 105: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 105 ][ 105 ]

Traditional Fixed-Workload

Single-program workload x N• X insns (i.e. 5M/sample) from each program• Workload composition is fixed across experiments + Direct comparisons between experiments– Load imbalance: time spent executing only slowest programs

A:A: 5M5M

B:B: 5M5M

timetime

Page 106: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 106 ][ 106 ]

Variable-Workload

Multi-program execution defines workload• Execute all programs until some condition (i.e. total insns = 10M)• Normalize to single-program region defined by this execution+Eliminates load imbalance (by construction)- Naturally oversamples programs which perform better

A:A: 3M3M

B:B: 7M7M

timetime

Page 107: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 107 ][ 107 ]

Deconstructing Load Imbalance

Fixed-workload runs experience two forms of imbalance

Sample imbalance: different standalone runtimes• Artifact of finite experiments• Should be eliminated• Easy: choose samples with same standalone runtimes

Schedule imbalance: asymmetric (“unfair”) contention• Characteristic of concurrent execution• Should be preserved, measured

Page 108: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

[ 108 ][ 108 ]

FIESTA

FIESTA: Fixed-Instruction with Equal STAndalone runtimes• Run single-programs for C cycles, record insn count• Build fixed workloads from time-balanced samples+ Eliminates sample imbalance+ Remaining imbalance is schedule imbalance

A:A: 5M5M

B:B: 7M7Mtimetime

A:A: 5M5M

B:B: 7M7M

timetime

schedule imbalanceschedule imbalance

Page 109: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Other Work: FIESTA

Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?

Fixed workloads: run all programs for X million insns

Variable workloads: run both until sum = X million insns

FIESTA [MobS’09]: create a-priori balanced samples

Joint work with Neeraj Eswaran[ 109 ][ 109 ]

Page 110: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Other Work: Paladin

Large software systems: many components, different trust• No way to restrict behavior of called modules

[ 110 ][ 110 ]

Trusted Code

Junior Developer’s

Module

PluginThird Party

Library

Page 111: Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Other Work: Paladin

Large software systems: many components, different trust• No way to restrict behavior of called modules

Paladin [In submission]: OS support for layered sandboxing• New system calls to restrict system call behavior• Also ensure restrictions only removed when module returns• Joint work with Jeff Vaughan

[ 111 ][ 111 ]

Trusted Code

Junior Developer’s

Module

PluginThird Party

Library