PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS ANIL KRISHNA Advisor: Dr. YAN SOLIHIN PhD Defense Examination, August 6 th 2013

PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS

ANIL KRISHNAAdvisor: Dr. YAN SOLIHIN

PhD Defense Examination, August 6th 2013

Image Source: http://en.kioskea.net/faq/372-choosing-the-right-cpu 1

2

Good Morning!

PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS

ANIL KRISHNAAdvisor: Dr. YAN SOLIHIN

PhD Defense Examination, August 6th 2013

Image Source: http://en.kioskea.net/faq/372-choosing-the-right-cpu 3

4

o RESEARCH OVERVIEWo Questions I have been researching all these years

o SUMMARY – Motivation, Problem, Contributiono Quick overview of my latest research

o DETAILS of ReSHAPEo A performance estimation tool

o VALIDATIONo Does this tool work?

o USE CASESo Where can it be used?

o CONCLUSIONS and FUTURE DIRECTIONo Where are we? Where to next?

AGENDAHow this talk is organized

Motivationo Off-chip bandwidth is pin limited, pins are area limited, area not growing

Problem Statemento To what extent does the bandwidth wall restrict future multi-core scaling?o To what extent can bandwidth conservation techniques help?

Contributions and Findingso Developed simple but effective analytical performance modelo Core to cache ratio changes from 50:50 to 10:90 in 4 generationso Core scaling is only 3x vs. 16x in 4 generationso Different bandwidth conservation techniques have different benefitso Combining techniques can delay this problem significantly

o 3D-stacked DRAM caches + link and cache compression gives >16x scaling


Problem Statemento To what extent does the bandwidth wall restrict future multi-core scaling?o To what extent can bandwidth conservation techniques help?

Single Core

core

cache

Multi CoreScaling the bandwidth wall: challenges in and avenues for CMP scaling

Brian Rogers, Anil Krishna, Gordon Bell, Ken Vu, Xiaowei Jiang, Yan Solihin International Symposium on Computer Architecture, ISCA 2009


RESEARCH OVERVIEWIn the context of processor chip design trends

5

Single Core

core

cache

Multi CoreData sharing in multi-threaded applications and its impact on chip design

Anil Krishna, Ahmad Samih, Yan Solihin Intl. Symp. on Performance Analysis of Systems and Software, ISPASS 2012

Motivationo Parallel applications moving from SMP to a single chipo No analytical models exist that can capture the effect of data sharing

Problem Statemento What is the right way to quantify the impact of data sharing on miss rates?o How can this be incorporated into an analytical performance model?o Does data sharing impact optimal on-chip core vs. cache ratios?

Contributions and Findingso Developed novel approach to quantifying the true impact of data sharing o Developed analytical performance model that incorporates data sharingo Showed that core area increases 33% to 49%; throughput increases 58%o Presence of data sharing encourages larger cores over smaller ones

Motivationo Parallel applications moving from SMP to a single chipo No analytical models exist that can capture the effect of data sharing

Problem Statemento What is the right way to quantify the impact of data sharing on miss rates?o How can this be incorporated into an analytical performance model?o Does data sharing impact optimal on-chip core vs. cache ratios?

Motivationo Parallel applications moving to a single chip, but no change in chip designo No analytical models exist that can capture the effect of data sharing


6

Motivationo Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study

Problem Statemento How were the hardware accelerators in IBM’s PowerEN selected and designed? How well do they perform?o How did the presence of hardware accelerators impact the architecture of the rest of the chip?

Contributions and Findingso Analyzed design and performance of each hardware accelerator in PowerEN (Crypto, XML, Compression, RegX, HEA) in detailo Identified tradeoffs in what to accelerate (vs. execute on general purpose core) and when to accelerate (large vs. small packets)o Found that reducing communication overhead and easing programmability requires supporting many new features

o shared memory model between cores and accelerators, direct cache injection of data from accelerators, ISA extensions


Problem Statemento How were the hardware accelerators in IBM’s PowerEN selected and designed? How well do they perform?o How did the presence of hardware accelerators impact the architecture of the rest of the chip?

Single Core

core

cache

Multi CoreHomogeneous

Multi CoreHybrid

Hardware acceleration in the IBM PowerEN processor: architecture and performanceAnil Krishna, Timothy Heil, Nicholas Lindberg, Farnaz Toussi, Steven VanderWiel

International conference on Parallel Architectures and Compilation Techniques, PACT 2012



7

Single Core

core

cache

Multi CoreHomogeneous

Multi CoreHybrid

Multi CoreHeterogeneous


ReSHAPE: Resource Sharing and Heterogeneity-aware Analytical Performance EstimatorAnil Krishna, Ahmad Samih, Yan Solihin

being submitted to Intl. Symposium on High Performance Computer Architecture, HPCA 2013

Large design spaceo How many cores/cores-types?o What cache hierarchy?o Heterogeneity in caches too?

Large configuration spaceo How to schedule applications?o What DVFS settings to use?o What cores and caches to power-gate?

8

Design and configuration space explosion with multi-core chipsDesign and configuration space explosion with multi-core chipso As number and types of cores designs need to be evaluated Design and configuration space explosion with multi-core chipso As number and types of cores designs need to be evaluated o n! static schedules for a single design with n core types

Design and configuration space explosion with multi-core chipso As number and types of cores designs need to be evaluated o n! static schedules for a single design with n core typeso Very large configuration space with per-core DVFS even in a single design with a single core type

SUMMARY – Motivation

Detailed simulation too slow

Analytical models fast, but existing models lacking

Detailed simulation too slowo Be it trace or execution driven, be it cycle-by-cycle simulation or discrete-event simulation

Analytical models fast, but existing models lackingo Too abstract and lacking sufficient fidelityAnalytical models fast, but existing models lackingo Too abstract and lacking sufficient fidelityo Not flexible enough to handle shared caches, heterogeneity across cores, multi-program mixes.

9

Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solvero Flexible

Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solvero Flexibleo Typically runs in under a second (10,000x faster than detailed simulation)

Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solvero Flexibleo Typically runs in under a second (10,000x faster than detailed simulation)o Accuracy is promising – IPC error < 5% and cache miss rate error <15% (validated up to 4 cores)

SUMMARY – Problem, Contribution

Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)

Problem: Need a tool for early design space explorationProblem: Need a tool for early design space explorationo Fast: At least 1000x faster than detailed simulationProblem: Need a tool for early design space explorationo Fast: At least 1000x faster than detailed simulationo Accurate: < 20% error in performance projection

Problem: Need a tool for early design space explorationo Fast: At least 1000x faster than detailed simulationo Accurate: < 20% error in performance projectiono Flexible : Able to model shared cache hierarchies, shared memory bandwidth, heterogeneity across

cores and caches on chip and multi-programmed workload mixes

Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver

10

ReSHAPE – Inputs and Outputs

L2

Core 1Core 0

L2

L3

L1I L1DL1I L1D

Core 0

L1I L1DCore 1

L1IL1D

Core 0

L1I L1D

Core 1

L1IL1D

∞ L2

∞ L2

∞ L2

∞ L2

Chip Configurationo core countso core typeso Frequencieso Cache hierarchyo memory bandwidtho application schedule

App-Core pair profile

ReSHAPEIterative solver of an underlying analytical

model

Resource Sharing and Heterogeneity-aware Analytical Performance Estimator

App-Core pair profileo Base IPCApp-Core pair profileo Base IPCo Cache accesses per Inst.

App-Core pair profileo Base IPCo Cache accesses per Inst.o Hit Rate Profiles

C1C0

L1I L1DL1I

L1DC2

L1IL1D

L2 L2

L3

L2

C1C0

L2L3

L1I L1DL1I

L1DC2

L1IL1D

L2

L4

L2

C1C0

L2L3

L1I L1DL1I

L1DC2

L1IL1D

L2

L4

Throughput (Instructions per Second)

11

Chip Configurationo core countso core typeso Frequencieso Cache hierarchy (sizes, latencies)o memory bandwidtho application schedule

ReSHAPE – The Analytical ComponentResource Sharing and Heterogeneity-aware Analytical Performance Estimator





12

𝑠𝑖=𝑠𝑏𝑎𝑠𝑒

𝑖𝑠𝑖


𝑖+𝐿2𝑎𝑐𝑐

𝑖×𝐿2 𝑙𝑎𝑡


ReSHAPE – The Analytical Component

Core 0

L1I L1D

L2


h h𝑤 𝑖𝑐 𝑖𝑠 h𝑡 𝑒𝑠𝑎𝑚𝑒𝑎𝑠𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑏𝑎𝑠𝑒𝐼𝑃𝐶


𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠(𝑖)𝑠𝑒𝑐𝑜𝑛𝑑(𝑠)

13



𝑖×𝐿2 𝑙𝑎𝑡+

𝐿3𝑎𝑐𝑐𝑖

× 𝐿3 𝑙𝑎𝑡𝑠𝑖=𝑠𝑏𝑎𝑠𝑒


𝑖×𝐿2 𝑙𝑎𝑡

𝑠𝑎𝑚𝑒𝑎𝑠𝐿2𝑎𝑐𝑐

𝑖× 𝐿2𝑚𝑖𝑠𝑠𝑟𝑎𝑡𝑒



Core 0

L1I L1D

L2



L3

14



𝑖+𝐿 2𝑎𝑐𝑐

𝑖×𝐿 2 𝑙𝑎𝑡+

𝐿 3𝑎𝑐𝑐𝑖

× 𝐿3 𝑙𝑎𝑡𝑠𝑖=𝑠𝑏𝑎𝑠𝑒

𝑖+𝐿 2𝑎𝑐𝑐

𝑖×𝐿 2 𝑙𝑎𝑡+

𝐿 3𝑎𝑐𝑐𝑖

× 𝐿3 𝑙𝑎𝑡+𝑀𝑒𝑚𝑎𝑐𝑐

𝑖×𝑀𝑒𝑚𝑙𝑎𝑡




Core 0

L1I L1D

L2

L3


15


λ=𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑟𝑎𝑡𝑒=𝑀𝑒𝑚𝑎𝑐𝑐

𝑖×𝑖𝑠




Core 0

L1I L1D

L2

L3


𝑀𝑒𝑚𝑜𝑟𝑦 𝑖𝑠𝑚𝑜𝑑𝑒𝑙𝑒𝑑𝑎𝑠𝑎𝑛𝑀 /𝐷 /1 𝑠𝑦𝑠𝑡𝑒𝑚𝑀𝑒𝑚𝑞𝑢𝑒𝑢𝑒=

λ2μ (μ− λ )

μ=𝑠𝑒𝑟𝑣𝑖𝑐𝑒𝑟𝑎𝑡𝑒=𝑏𝑦𝑡𝑒𝑠𝑠

×1

𝑀𝑒𝑚𝐴𝑐𝑐𝑆𝑧𝐼𝑛𝐵𝑦𝑡𝑒𝑠16


L3

L3L3

ReSHAPE’s Novelty

L2 L2

Core 0

L1I L1D

Core 1

L1I L1D

L2 L2

Novelty 1o Separate chip into vertical silos


ReSHAPE’s partition optimizer

17

L3

L3

ReSHAPE’s Novelty

Novelty 1o Separate chip into vertical silos


L3L3

Core 0

L1I L1D

Core 1

L1I L1D

L2 L2

L3L3

Novelty 2 o Use newly computed IPC as baseIPC o Re-evaluate traffic and partitionso Iterate until convergence (IPC change <1%)

After convergence o Use final IPCs to compute throughput

18

ReSHAPE’s Cache partitioning strategyResource Sharing and Heterogeneity-aware Analytical Performance Estimator

L3 L3

L3

? Cache size Cache size

Hits

per

sec

Hits

per

sec

Greedy Approacho O(n.k) for n cache slices and k sharerso May be sub-optimal, but does quite well in practice

19

ReSHAPE’s Cache partitioning strategyResource Sharing and Heterogeneity-aware Analytical Performance Estimator

L3 L3

L3

? Cache size Cache size

Hits

per

sec

Hits

per

sec

Minimize Misses Strategyo O(log2n. 2k) for n cache slices and k sharerso May be too slow for large ko We use this strategy for all evaluations presented here

20

Loose

Locality

Med

ium

Lo

calit

y

Tight

Locality

VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator

Step 1: Analyze benchmark applications

21

Step 1: Analyze benchmark applicationsStep 2: Construct workload mixes


m00 xalan namdm01 xalan xalanm02 omnetpp libm03 povray povraym04 mcf namdm05 milc milcm06 omnetpp tontom07 leslie3d omnetppm08 xalan mcfm09 tonto namdm10 milc tontom11 lib mcf

m00 povray povray tonto namdm01 povray tonto tonto xalanm02 mcf tonto namd namdm03 omnetpp xalan leslie3d povraym04 omnetpp leslie3d leslie3d xalanm05 omnetpp leslie3d xalan libm06 mcf lib milc povraym07 omnetpp mcf milc libm08 mcf lib lib milcm09 povray namd leslie3d xalanm10 mcf milc tonto namdm11 mcf xalan leslie3d lib

m00 povray tonto namd deal2 games astar leslie3d xalan omnetppm01 deal2 games astar perl calculix gromacs lib milc mcfm02 perl calculix gromacs leslie3d xalan omnetpp hmmer soplex bzipm03 povray tonto namd leslie3d xalan omnetpp lib milc mcfm04 perl calculix gromacs lib milc mcf lbm sphinx gemsm05 leslie3d xalan omnetpp hmmer soplex bzip lbm sphinx gemsm06 hmmer soplex bzip lib milc mcf lbm sphinx gems

2 core

12 m

ixes

4 core

9 core

12 m

ixes

7 m

ixes

22



Step 3: Construct configurations to be validated

256KB

32K 32K

10Gb/s

1MB

32K 32K

10Gb/s1Gbp/s100Mb/s10MB/s

32K 32K

10Gb/s

32K 32K

256KB

32K 32K 32K 32K

1MB


32K 32K

10Gb/s

32K 32K

512KB

32K32K 32K32K

32K 32K 32K 32K

2MB

32K32K 32K32K


32K 32K 32K 32K

512KB

32K32K 32K32K

10Gb/s

512KB

512KB 512KB

32K 32K 32K 32K

256KB

32K32K 32K32K

10Gb/s

256KB

1MB 1MB

32K 32K 32K 32K

128KB

32K32K 32K32K

10Gb/s

128KB

2MB 2MB

23



Step 3: Construct configurations to be validatedStep 4: Set up identical configurations in SIMICS and ReSHAPE

Step 5: Compare projections from SIMICS and ReSHAPE

Each mix is checkpointed (under SIMICS) after running for 100 Billion instructions per applicationAt least 1 Billion instructions beyond this are used for validation run

24


256KB

32K 32K

256KB

32K 32K

10Gb/s

Average 1-core IPC Error : 1.5% (std. dev. = 1.4%)

astar lbm

tonto gc

cperl bzip lib

zeusm

p

gromac

ssje

ngh264

soplex

hmmer

sphinx

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IPC

(Sim

ics)

astar lbm

tonto gc

cperl bzip lib

zeusm

p

gromac

ssje

ngh264

soplex

hmmer

sphinx

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IPC

(ReS

HAPE

)

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0 IPC ComparisonIdealObserved

Simics

ReSH

APE

25


32K 32K 32K 32K

1MB

10Gb/s

m00m01

m02m03

m04m05

m06m07

m08m09

m10m11

0

0.1

0.2

0.3

0.4

0.5

0.6C0 C1

IPC

(Sim

ics)

0.1 0.2 0.3 0.4 0.5 0.6-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6 IPC Comparison

Ideal

Simics

ReSH

APE

m00m01

m02m03

m04m05

m06m07

m08m09

m10m11

0

0.1

0.2

0.3

0.4

0.5

0.6C0 C1

IPC

(ReS

HAPE

)

Average 2-core IPC Error: 2.7% (std. dev. = 2.1%)

26


32K 32K 32K 32K

1MB

10Gb/s

Average miss rate projection error: 13.4 % (std. dev. = 12.6%)

0.001

0.01

0.1

1Miss Rate Comparison

IdealC0

Simics

Re

SH

AP

E

m00

m01

m02

m03

m04

m05

m06

m07

m08

m09

m10

m11

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1C0 C1

Mis

ses

Per

Acc

ess

(Sim

ics)

m00

m01

m02

m03

m04

m05

m06

m07

m08

m09

m10

m11

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1C0 C1

Mis

ses

Per

Acc

ess

(ReS

HA

PE

)

27


32K 32K 32K 32K

1MB

10Gb/s

Average partition size projection error: 3.7 % (std. dev. = 4.5%)

0.1 0.2 0.3 0.4 0.5 0.6-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6Partition Comparison

Ideal

Simics

Re

SH

AP

E

m00

m02

m04

m06

m08

m10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

C1

C0

Par

titi

on

s (S

imic

s)

m00

m02

m04

m06

m08

m10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

C1

C0

Par

titi

on

s (R

eSH

AP

E)

28


32K 32K 32K 32K

2MB

32K32K 32K32K

10Gb/s

m00m01

m02m03

m04m05

m06m07

m08m09

m10m11

0

0.1

0.2

0.3

0.4

0.5

0.6C0 C1 C2 C3

IPC

(Sim

ics)

0.1 0.2 0.3 0.4 0.5 0.6-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6 IPC Comparison

IdealC0

Simics

ReSH

APE

m00m01

m02m03

m04m05

m06m07

m08m09

m10m11

0

0.1

0.2

0.3

0.4

0.5

0.6C0 C1 C2 C3

IPC

(ReS

HAPE

)

Average 4-core IPC Error: 2.5% (std. dev. = 1.8%)

29


32K 32K 32K 32K

2MB

32K32K 32K32K

10Gb/s


0.001

0.01

0.1


IdealC0C1C2

Simics

Re

SH

AP

E

m00

m01

m02

m03

m04

m05

m06

m07

m08

m09

m10

m11

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1C0 C1 C2 C3

Mis

ses

Per

Acc

ess

(Sim

ics)

m00

m01

m02

m03

m04

m05

m06

m07

m08

m09

m10

m11

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1C0 C1 C2 C3

Mis

ses

Per

Acc

ess

(ReS

HA

PE

)

30


32K 32K 32K 32K

2MB

32K32K 32K32K

10Gb/s

Average partition size projection error: 20.9% (std. dev. = 12.8%)

0.0 0.1 0.2 0.3 0.4 0.5 0.6-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6Partition Comparison

IdealC0C1C2C3

Simics

Re

SH

AP

E

m00

m02

m04

m06

m08

m10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

C3

C2

C1

C0

Par

titi

on

s (S

imic

s)

m00

m02

m04

m06

m08

m10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

C3

C2

C1

C0Par

titi

on

s (R

eSH

AP

E)

31


32K 32K 32K 32K

2MB

32K32K 32K32K

10Gb/s1Gb/s0.1Gb/s0.01Gb/s

Average IPC Error: 17.3% (std. dev. = 5.4%)

IPC Com. (0.01GBps)

Simics

IPC Comp. (0.1GBps)

Simics

IPC Comp. (1 GBps)

Simics0.100

1.000

10.000IPC Comparison (10 GBps)

IdealC0C1

Simics

ReSH

APE

32


m00

m01

m02

am

03am

04am

05am

06am

07am

08a

m09

m10

am

11a

0

0.1

0.2

0.3

0.4

0.5

0.6C0 C1 C2 C3

IPC

(S

imic

s)

0.1 0.2 0.3 0.4 0.5 0.6-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6IPC Comparison

IdealC0

Simics

Re

SH

AP

E

m00

m01

m02

am

03am

04am

05am

06am

07am

08a

m09

m10

am

11a

0

0.1

0.2

0.3

0.4

0.5

0.6C0 C1 C2 C3

IPC

(R

eSH

AP

E)

32K 32K 32K 32K

128KB

32K32K 32K32K

10Gb/s

128KB

2MB 2MB

Private Caches: Average 4-core IPC Error: 3.1% (std. dev. = 1.6%)

33


32K 32K 32K 32K

128KB

32K32K 32K32K

10Gb/s

128KB

2MB 2MB


0.0001

0.001

0.01

0.1


IdealC0

Simics

Re

SH

AP

E

m00

m01

m02

am

03am

04am

05am

06am

07am

08a

m09

m10

am

11a

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1C0 C1 C2 C3

Mis

ses

Per

Acc

ess

(Sim

ics)

m00

m01

m02

am

03am

04am

05am

06am

07am

08a

m09

m10

am

11a

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1C0 C1 C2 C3

Mis

ses

Per

Acc

ess

(ReS

HA

PE

)

34

USE CASESPutting ReSHAPE to use

Homogeneous Heterogeneous Caches Heterogeneous Cores Heterogeneous Both

Does increasing the sources of heterogeneity buy us performance?

35

Max Min Mean



ABCD

App0App1App2App3

C0C1C2C3

Up to 4! unique schedules for a 4-application workload mix

ABDC

ACBD

ACDB

ADBC

ADCB

BCDA

BCAD

BDCA

BDAC

BACD

BADC

CDAB

CDBA

CADB

CABD

CBDA

CBAD

DABC

DACB

DBAC

DBCA

DCAB

DCBA

What one might expect to seeo Small improvement with heterogeneous caches. Some loss for bad scheduleso Larger improvement with heterogeneous coreso Even larger improvement with heterogeneous cores + heterogeneous caches1

Het

. Cac

he

Het

. Cor

e

Het

. Bot

h

Wei

ghte

d sp

eedu

p no

rmal

ized

to

Hom

ogen

eous

des

ign

36




o Smaller cores hurting more than the larger cores helping

o Heterogeneous caches better than heterogeneous cores in this case

37




9-co

re d

esig

ns

> 350,000 ReSHAPE simschart represents > 10 million ReSHAPE sims

o As core count scales (4->9) benefit of heterogeneity increases significantly

o Heterogeneous cores better than heterogeneous caches in this case; but schedule still crucial

38



9-co

re d

esig

nsw

ith 3

cor

e/ca

che

type

s

o 3-core types and 3-cache sizes does not buy any more performance

39

o How much and what form of heterogeneity needs careful analysis depending on the design being evaluated


o Different settings for different workload mixes; and not always the fastest setting!

Weighted Speedup Perf/Watt 1/(Energy*Delay)

c0 c1 c2 c3 c0 c1 c2 c3 c0 c1 c2 c3

m00 3 3 1 1 m00 1 1 1 1 m00 3 3 1 1

m01 3 3 3 1 m01 1 1 1 1 m01 3 3 3 1

m02 1 3 3 3 m02 1 1 1 1 m02 1 3 3 3

m03 1 1 1 3 m03 1 1 1 2 m03 1 1 1 3

m04 1 3 3 3 m04 1 1 1 1 m04 1 1 1 1

m05 1 3 3 1 m05 1 1 1 1 m05 1 1 1 1

m06 1 1 1 3 m06 1 1 1 2 m06 1 1 1 3

m07 1 3 1 1 m07 1 1 1 1 m07 1 1 1 1

m08 3 1 1 1 m08 1 1 1 1 m08 1 1 1 1

m09 3 3 1 1 m09 2 1 1 1 m09 3 3 1 1

m10 1 1 3 3 m10 1 1 2 1 m10 1 1 3 3

m11 1 3 3 1 m11 1 1 1 1 m11 1 1 1 1

Legend 1 250MHz, 0.5W 2 1GHz, 2W 3 4GHz, 16W

32K 32K 32K 32K

2MB

32K32K 32K32K

10Gb/s

o Not always the slowest setting when optimizing performance/watt

o Somewhere in between when optimizing Energy x Delay product

32K 32K

250MHz0.5W

32K 32K

1GHz2W

32K 32K

4GHz16W

40

CONCLUSIONS + FUTURE DIRECTION

ReSHAPE extends this classical analytical performance model in novel ways

Rich design/configuration space for multi-core chips

Accuracy + speed make ReSHAPE a useful tool for early exploration

Validate across unique microarchitecturesFuture direction – extend ReSHAPE

Extend key parameters and model - memory level parallelism, writeback traffic, prefetching

Evaluate more use cases o best power-gating strategy based on workload mixo dynamic schedules based on per-phase application statistics

Explore the rich constrained optimization problem of cache partitioning

41

Analytical modeling can be a promising approach to tackling these large search spaces

Thank you!

42

RELATED WORK

Wentzlaff et al. (MIT Tech Report 2010), Li et al. (ISPASS 2005), Yavits et al. (CAL 2013) all tackle different aspects of multicore chip design, but only consider homogeneous cores.

Wu et. Al (ISCA 2013) use locality profiles to identify how the application’s cache locality degrades as the application is spread across more threads – they consider multi-threaded applications.

Analytical Modeling of multi-core chips

Navada et al. (PACT 2010, PACT 2013) consider simulation based, criticality driven, design space exploration and mechanisms for selecting the best way to schedule a single application across multiple cores.

Kumar et al. (Micro 2003, PACT 2006, ISCA 2004) did most of the seminal work in the area of heterogeneous multi-core. However, they have typically relied on detailed simulations, private cache hierarchies and single application scheduling.

Several works related to heterogeneous design/scheduling

43


256KB

32K 32K

256KB

32K 32K

10Gb/s

Average miss rate projection error: 7.6% (std. dev. = 12.4%)

astar lbm

tonto gc

cperl bzip lib

zeusm

p

gromac

ssje

ngh264

soplex

hmmer

sphinx

00.10.20.30.40.50.60.70.80.9

1

Miss

es P

er A

cces

s (Si

mics

)

astar lbm

tonto gc

cperl bzip lib

zeusm

p

gromac

ssje

ngh264

soplex

hmmer

sphinx

00.10.20.30.40.50.60.70.80.9

1

Miss

es P

er A

cces

s (Re

SHAP

E)

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Miss Rate Error

Ideal

Simics

ReSH

APE

44

Documents

PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS ANIL KRISHNA Advisor: Dr. YAN SOLIHIN PhD Defense Examination, August 6 th 2013