24
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu Chang Joo Lee* Yale N. Patt * * HPS Research Group The University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

1

Coordinated Control of Multiple Prefetchers in Multi-Core Systems

Eiman Ebrahimi*

Onur Mutlu‡

Chang Joo Lee*

Yale N. Patt*

* HPS Research Group The University of Texas at Austin

‡ Computer Architecture LaboratoryCarnegie Mellon University

Page 2: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

2

Motivation Aggressive prefetching improves

memory latency tolerance of many applications when they run alone

Prefetching for concurrently-executing applications on a CMP can lead to Significant system performance degradation and

bandwidth waste

Problem:Prefetcher-caused inter-core interference Prefetches of one application contend with

prefetches and demands of other applications

Page 3: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

3

Potential PerformanceSystem performance improvement of ideally removing all prefetcher-caused inter-core interference in shared resources

00.20.40.60.8

11.21.41.61.8

22.2

WL

1

WL

2

WL

3

WL

4

WL

5

WL

6

WL

7

WL

8

WL

9

WL

10

WL

11

WL

12

WL

13

WL

14

Gm

ean

-32

Per

f. N

orm

aliz

ed t

o N

o T

hro

ttlin

g

56%

Exact workload combinations can be found in paper

Page 4: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

4

Outline

Background Shortcoming of Prior Approaches to

Prefetcher Control Hierarchical

Prefetcher Aggressiveness Control Evaluation Conclusion

Page 5: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

5

Increasing Prefetcher Accuracy

Increasing prefetcher accuracy can reduce prefetcher-caused inter-core interference Single-core prefetcher aggressiveness

throttling (e.g., Srinath et al., HPCA ’07) Filtering inaccurate prefetches

(e.g., Zhuang and Lee, ICPP ’03) Dropping inaccurate prefetches at memory

controller (Lee et al., MICRO ’08)

All such techniques operate independently on the prefetches of each application

Page 6: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

6

Feedback-Directed Prefetching (FDP)(Srinath et al., HPCA ’07) Uses prefetcher feedback information local to the

prefetcher’s core Prefetch accuracy Prefetch timeliness Prefetch cache pollution

Dynamically adapts the prefetcher’s aggressiveness Stream Prefetcher Aggressiveness

Prefetch Distance Prefetch Degree

Access Stream

A

Prefetch Distance P+

1

Prefetch DegreeA+1

P+

2P

+3

P+

4

PShown to perform better than and consume less bandwidth

than static aggressiveness configurations

Page 7: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

7

Outline

Background Shortcoming of Prior Approaches to

Prefetcher Control Hierarchical

Prefetcher Aggressiveness Control Evaluation Conclusion

Page 8: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

8

DRAM

Memory Controller

High Interference caused by Accurate Prefetchers

Core2 Core3Core0

Dem 2Addr:A

Dem 2Addr:B Pref 0 Dem 0

Miss

Shared Cache

Pref 1 Pref 3

Dem 2

Bank 0 Bank 1

Pref 3Pref 1

RowBuffers

Pref 1Row Addr.

Pref 3Row Addr.

RequestsBeing

Serviced

Row Buffer

Hit…

Dem 2Addr:A

Core1Dem 1

In a CMP system, accurate prefetchers cancause significant interference with

concurrently-executing applications

Dem XAddr: Y

Demand Request From Core XFor Addr Y

Page 9: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

9

Set 2

Shortcoming of Per-Core (Local-Only) Prefetcher Aggressiveness Control

Core 0 Core 1 Core 2 Core 3

Dem 2 Dem 2 Dem 3 Dem 3 Dem 2 Dem 2 Dem 3 Dem 3

Dem 3 Dem 3 Dem 2 Dem 2 Dem 3 Dem 3 Dem 3 Dem 3

Dem 2 Dem 2 Dem 2 Dem 2 Dem 3 Dem 3 Dem 3 Dem 3

Pref 0Used_P Pref 0 Pref 1 Pref 1

Prefetcher Degree:

Prefetcher Degree:

Used_P Used_P Used_P

Pref 0Pref 0 Pref 1 Pref 1Used_P Used_P Used_P Used_P

FDP Throttle Up

24 24

Pref 0 Pref 0 Pref 0 Pref 0 Pref 1 Pref 1 Pref 1 Pref 1

Dem 2 Dem 3Dem 2 Dem 3

Local-only prefetcher control techniqueshave no mechanism to detect inter-core interference

Shared Cache

Set 0

Set 1

FDP Throttle Up

Page 10: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

10

Shortcoming of Local-Only Prefetcher Control

0

0.2

0.4

0.6

0.8

1

lbm

_06

swim

_00

craf

ty_0

0

bzi

p2_

00

Sp

eed

up

ove

r A

lon

e R

un

No PrefetchingPref. + No ThrottlingFeedback-Directed PrefetchingHPAC

0

0.1

0.2

0.3

0.4

0.5

Hspeedup

Our Approach: Use both global and per-core feedback to determine each prefetcher’s aggressiveness

4-core workload example: lbm_06 + swim_00 + crafty_00 + bzip2_00

Page 11: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

11

Outline

Background Shortcoming of Prior Approaches to

Prefetcher Control Hierarchical

Prefetcher Aggressiveness Control Evaluation Conclusion

Page 12: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

12

Hierarchical Prefetcher Aggressiveness Control (HPAC)

Memory Controller

Cache Pollution Feedback

Accuracy

Bandwidth Feedback

Local control’s goal: Maximize the prefetching performance of core i independently

Global control’s goal: Keep track of and control prefetcher-caused inter-core interference in shared memory system

GlobalControl

Global Control: accepts or overrides decisions made by local control to improve overall system performance

Core i

LocalControl

Pref. i

Shared Cache

Throttling Decision

LocalThrottling Decision

FinalThrottling Decision

Page 13: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

13

Terminology Global feedback metrics used in our mechanism:

For each core i: Core i’s prefetcher accuracy – Acc (i)

Core i’s prefetcher caused inter-core cache pollution Pol (i) Demand cache lines of other cores evicted by this core’s

prefetches that are requested subsequent to eviction

Bandwidth consumed by core i - BW (i) Accounts for how long requests from this core tie up DRAM

banks

Bandwidth needed by other cores j != i - BWNO (i) Accounts for how long requests from other cores have to

wait for DRAM banks because of requests from this core

Page 14: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

14

Calculating Inter-Core Cache Pollution

0

0

0

0

Pollution bit

Coreid

2

0

0

0

1

0

2

2

Hash Function

.

.

.

Prefetch from core i, evicts a core j’s demand from shared cache

Evicted line’sAddressFrom core j

1 j

Core j experiences a demand cache miss

Missing line’sAddressFrom core j

Increment Pol (i)

Pollution Filter of core i

Page 15: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

15

Hierarchical Prefetcher Aggressiveness Control (HPAC)

Memory Controller

Pol (i)

Acc (i)

BW (i)BWNO (i)

GlobalControl

Core i

LocalControl

Pref. i

Shared Cache

LocalThrottling Decision

FinalThrottling Decision

High Acc (i)

LocalThrottle Up High Pol (i)

High BW (i)

High BWNO (i)

Pol. Filter i

- High accuracy- High pollution- High bandwidth consumedwhile other cores need bandwidth

EnforceThrottle Down

Page 16: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

16

Heuristics for Global Control

Classification of global control heuristics based on interference severity Severe interference

Action: Reduce the aggressiveness of interfering prefetcher

Borderline interference Action: Prevent prefetcher from transitioning

into severe interference: Allow local-control to only throttle-down

No interference or moderate interference from an accurate prefetcher Action: Allow local control to maximize

local benefits from prefetching

Page 17: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

17

HPAC Control Policies

Causing LowPollution

Inaccurate

Highly Accurate

Others’ lowBW need

throttle down

Causing HighPollution

ActionInterference ClassBWNO (i)

High BWConsumption

Low BWConsumption Others’ high

BW need

Others’ lowBW need

Inaccuratethrottle down

Highly Accurate

High BWConsumption

Low BWConsumption

Others’ lowBW need

Others’ highBW need

Others’ lowBW need

Others’ highBW need

throttle downSevere interference

Severe interference

Severe interference

Pol (i) Acc (i) BW (i)

Page 18: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

18

Hardware Cost (4-Core System)

Total hardware cost local-control & global control 15.14 KB

Additional cost on top of FDP 1.55 KB

Additional cost on top of FDP only 1.55 KB

HPAC does not require any structures or logic that are on the processor’s critical path

Page 19: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

19

Outline

Background Shortcoming of Prior Approaches to

Prefetcher Control Hierarchical

Prefetcher Aggressiveness Control Evaluation Conclusion

Page 20: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

20

Evaluation Methodology x86 cycle accurate simulator

Baseline processor configuration Per core

4-wide issue, out-of-order, 256-entry ROB Stream prefetcher with 32 streams, prefetch degree:4,

prefetch distance:64 Shared

2MB, 16-way L2 cache (4MB, 32-way for 8-core) DDR3 1333Mhz 8B wide core to memory bus 128, 256 L2 MSHRs for 4-, 8-core Latency of 15ns per command (tRP, tRCD, CL)

HPAC thresholds used

Acc BW Pol BWNO

0.6 50k 90 75k

Page 21: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

21

Performance Results

0.70.80.9

11.11.21.31.41.5

WL

1

WL

2

WL

3

WL

4

WL

5

WL

6

WL

7

WL

8

WL

9

WL

10

WL

11

WL

12

WL

13

WL

14

AV

G-3

2

Sys

tem

Per

form

ance

No

rmal

ized

to

No

Th

rott

ling No Prefetching FDP HPAC

Class 1 Class 2 Class 3 Class 4

15%

Exact workload combinations can be found in paper

Page 22: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

22

Summary of Other Results

Further results and analysis are presented in the paper Results with different types of memory controllers

Prefetch-Aware DRAM Controllers (PADC) First-Ready First-Come-First-Served (FR-FCFS)

Effect of HPAC on system fairness

HPAC performance on 8-core systems

Multiple types of prefetchers per core and different local-control policies

Sensitivity to system parameters

Page 23: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

23

Conclusion Prefetcher-caused inter-core interference can destroy

potential performance of prefetching When prefetching for concurrently executing applications in

CMPs Did not exist in single-application environments

Develop one low-cost hierarchical solution which throttles different cores’ prefetchers in a coordinated manner

The key is to take global feedback into account to determine aggressiveness of each core’s prefetcher Improves system performance by 15% compared to

no throttling on a 4-core system Enables performance improvement from prefetching that is

not possible without it on many workloads

Page 24: 1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The

24

Thank you!

Questions?