Upload
zachery-harewood
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS
ANIL KRISHNAAdvisor: Dr. YAN SOLIHIN
PhD Defense Examination, August 6th 2013
Image Source: http://en.kioskea.net/faq/372-choosing-the-right-cpu 1
2
Good Morning!
PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS
ANIL KRISHNAAdvisor: Dr. YAN SOLIHIN
PhD Defense Examination, August 6th 2013
Image Source: http://en.kioskea.net/faq/372-choosing-the-right-cpu 3
4
o RESEARCH OVERVIEWo Questions I have been researching all these years
o SUMMARY – Motivation, Problem, Contributiono Quick overview of my latest research
o DETAILS of ReSHAPEo A performance estimation tool
o VALIDATIONo Does this tool work?
o USE CASESo Where can it be used?
o CONCLUSIONS and FUTURE DIRECTIONo Where are we? Where to next?
AGENDAHow this talk is organized
Motivationo Off-chip bandwidth is pin limited, pins are area limited, area not growing
Problem Statemento To what extent does the bandwidth wall restrict future multi-core scaling?o To what extent can bandwidth conservation techniques help?
Contributions and Findingso Developed simple but effective analytical performance modelo Core to cache ratio changes from 50:50 to 10:90 in 4 generationso Core scaling is only 3x vs. 16x in 4 generationso Different bandwidth conservation techniques have different benefitso Combining techniques can delay this problem significantly
o 3D-stacked DRAM caches + link and cache compression gives >16x scaling
Motivationo Off-chip bandwidth is pin limited, pins are area limited, area not growing
Problem Statemento To what extent does the bandwidth wall restrict future multi-core scaling?o To what extent can bandwidth conservation techniques help?
Single Core
core
cache
Multi CoreScaling the bandwidth wall: challenges in and avenues for CMP scaling
Brian Rogers, Anil Krishna, Gordon Bell, Ken Vu, Xiaowei Jiang, Yan Solihin International Symposium on Computer Architecture, ISCA 2009
Motivationo Off-chip bandwidth is pin limited, pins are area limited, area not growing
RESEARCH OVERVIEWIn the context of processor chip design trends
5
Single Core
core
cache
Multi CoreData sharing in multi-threaded applications and its impact on chip design
Anil Krishna, Ahmad Samih, Yan Solihin Intl. Symp. on Performance Analysis of Systems and Software, ISPASS 2012
Motivationo Parallel applications moving from SMP to a single chipo No analytical models exist that can capture the effect of data sharing
Problem Statemento What is the right way to quantify the impact of data sharing on miss rates?o How can this be incorporated into an analytical performance model?o Does data sharing impact optimal on-chip core vs. cache ratios?
Contributions and Findingso Developed novel approach to quantifying the true impact of data sharing o Developed analytical performance model that incorporates data sharingo Showed that core area increases 33% to 49%; throughput increases 58%o Presence of data sharing encourages larger cores over smaller ones
Motivationo Parallel applications moving from SMP to a single chipo No analytical models exist that can capture the effect of data sharing
Problem Statemento What is the right way to quantify the impact of data sharing on miss rates?o How can this be incorporated into an analytical performance model?o Does data sharing impact optimal on-chip core vs. cache ratios?
Motivationo Parallel applications moving to a single chip, but no change in chip designo No analytical models exist that can capture the effect of data sharing
RESEARCH OVERVIEWIn the context of processor chip design trends
6
Motivationo Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study
Problem Statemento How were the hardware accelerators in IBM’s PowerEN selected and designed? How well do they perform?o How did the presence of hardware accelerators impact the architecture of the rest of the chip?
Contributions and Findingso Analyzed design and performance of each hardware accelerator in PowerEN (Crypto, XML, Compression, RegX, HEA) in detailo Identified tradeoffs in what to accelerate (vs. execute on general purpose core) and when to accelerate (large vs. small packets)o Found that reducing communication overhead and easing programmability requires supporting many new features
o shared memory model between cores and accelerators, direct cache injection of data from accelerators, ISA extensions
Motivationo Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study
Problem Statemento How were the hardware accelerators in IBM’s PowerEN selected and designed? How well do they perform?o How did the presence of hardware accelerators impact the architecture of the rest of the chip?
Single Core
core
cache
Multi CoreHomogeneous
Multi CoreHybrid
Hardware acceleration in the IBM PowerEN processor: architecture and performanceAnil Krishna, Timothy Heil, Nicholas Lindberg, Farnaz Toussi, Steven VanderWiel
International conference on Parallel Architectures and Compilation Techniques, PACT 2012
Motivationo Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study
RESEARCH OVERVIEWIn the context of processor chip design trends
7
Single Core
core
cache
Multi CoreHomogeneous
Multi CoreHybrid
Multi CoreHeterogeneous
RESEARCH OVERVIEWIn the context of processor chip design trends
ReSHAPE: Resource Sharing and Heterogeneity-aware Analytical Performance EstimatorAnil Krishna, Ahmad Samih, Yan Solihin
being submitted to Intl. Symposium on High Performance Computer Architecture, HPCA 2013
Large design spaceo How many cores/cores-types?o What cache hierarchy?o Heterogeneity in caches too?
Large configuration spaceo How to schedule applications?o What DVFS settings to use?o What cores and caches to power-gate?
8
Design and configuration space explosion with multi-core chipsDesign and configuration space explosion with multi-core chipso As number and types of cores designs need to be evaluated Design and configuration space explosion with multi-core chipso As number and types of cores designs need to be evaluated o n! static schedules for a single design with n core types
Design and configuration space explosion with multi-core chipso As number and types of cores designs need to be evaluated o n! static schedules for a single design with n core typeso Very large configuration space with per-core DVFS even in a single design with a single core type
SUMMARY – Motivation
Detailed simulation too slow
Analytical models fast, but existing models lacking
Detailed simulation too slowo Be it trace or execution driven, be it cycle-by-cycle simulation or discrete-event simulation
Analytical models fast, but existing models lackingo Too abstract and lacking sufficient fidelityAnalytical models fast, but existing models lackingo Too abstract and lacking sufficient fidelityo Not flexible enough to handle shared caches, heterogeneity across cores, multi-program mixes.
9
Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solvero Flexible
Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solvero Flexibleo Typically runs in under a second (10,000x faster than detailed simulation)
Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solvero Flexibleo Typically runs in under a second (10,000x faster than detailed simulation)o Accuracy is promising – IPC error < 5% and cache miss rate error <15% (validated up to 4 cores)
SUMMARY – Problem, Contribution
Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)
Problem: Need a tool for early design space explorationProblem: Need a tool for early design space explorationo Fast: At least 1000x faster than detailed simulationProblem: Need a tool for early design space explorationo Fast: At least 1000x faster than detailed simulationo Accurate: < 20% error in performance projection
Problem: Need a tool for early design space explorationo Fast: At least 1000x faster than detailed simulationo Accurate: < 20% error in performance projectiono Flexible : Able to model shared cache hierarchies, shared memory bandwidth, heterogeneity across
cores and caches on chip and multi-programmed workload mixes
Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver
10
ReSHAPE – Inputs and Outputs
L2
Core 1Core 0
L2
L3
L1I L1DL1I L1D
Core 0
L1I L1DCore 1
L1IL1D
Core 0
L1I L1D
Core 1
L1IL1D
∞ L2
∞ L2
∞ L2
∞ L2
Chip Configurationo core countso core typeso Frequencieso Cache hierarchyo memory bandwidtho application schedule
App-Core pair profile
ReSHAPEIterative solver of an underlying analytical
model
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
App-Core pair profileo Base IPCApp-Core pair profileo Base IPCo Cache accesses per Inst.
App-Core pair profileo Base IPCo Cache accesses per Inst.o Hit Rate Profiles
C1C0
L1I L1DL1I
L1DC2
L1IL1D
L2 L2
L3
L2
C1C0
L2L3
L1I L1DL1I
L1DC2
L1IL1D
L2
L4
L2
C1C0
L2L3
L1I L1DL1I
L1DC2
L1IL1D
L2
L4
Throughput (Instructions per Second)
11
Chip Configurationo core countso core typeso Frequencieso Cache hierarchy (sizes, latencies)o memory bandwidtho application schedule
ReSHAPE – The Analytical ComponentResource Sharing and Heterogeneity-aware Analytical Performance Estimator
App-Core pair profileo Base IPCo Cache accesses per Inst.o Hit Rate Profiles
App-Core pair profileo Base IPCo Cache accesses per Inst.o Hit Rate Profiles
App-Core pair profileo Base IPCo Cache accesses per Inst.o Hit Rate Profiles
App-Core pair profileo Base IPCo Cache accesses per Inst.o Hit Rate Profiles
12
𝑠𝑖=𝑠𝑏𝑎𝑠𝑒
𝑖𝑠𝑖
𝑠𝑖=𝑠𝑏𝑎𝑠𝑒
𝑖+𝐿2𝑎𝑐𝑐
𝑖×𝐿2 𝑙𝑎𝑡
Chip Configurationo core countso core typeso Frequencieso Cache hierarchy (sizes, latencies)o memory bandwidtho application schedule
ReSHAPE – The Analytical Component
Core 0
L1I L1D
L2
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
h h𝑤 𝑖𝑐 𝑖𝑠 h𝑡 𝑒𝑠𝑎𝑚𝑒𝑎𝑠𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑏𝑎𝑠𝑒𝐼𝑃𝐶
App-Core pair profileo Base IPCo Cache accesses per Inst.o Hit Rate Profiles
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠(𝑖)𝑠𝑒𝑐𝑜𝑛𝑑(𝑠)
13
𝑠𝑖=𝑠𝑏𝑎𝑠𝑒
𝑖+𝐿2𝑎𝑐𝑐
𝑖×𝐿2 𝑙𝑎𝑡+
𝐿3𝑎𝑐𝑐𝑖
× 𝐿3 𝑙𝑎𝑡𝑠𝑖=𝑠𝑏𝑎𝑠𝑒
𝑖+𝐿2𝑎𝑐𝑐
𝑖×𝐿2 𝑙𝑎𝑡
𝑠𝑎𝑚𝑒𝑎𝑠𝐿2𝑎𝑐𝑐
𝑖× 𝐿2𝑚𝑖𝑠𝑠𝑟𝑎𝑡𝑒
Chip Configurationo core countso core typeso Frequencieso Cache hierarchy (sizes, latencies)o memory bandwidtho application schedule
ReSHAPE – The Analytical Component
Core 0
L1I L1D
L2
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
App-Core pair profileo Base IPCo Cache accesses per Inst.o Hit Rate Profiles
L3
14
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠(𝑖)𝑠𝑒𝑐𝑜𝑛𝑑(𝑠)
𝑠𝑖=𝑠𝑏𝑎𝑠𝑒
𝑖+𝐿 2𝑎𝑐𝑐
𝑖×𝐿 2 𝑙𝑎𝑡+
𝐿 3𝑎𝑐𝑐𝑖
× 𝐿3 𝑙𝑎𝑡𝑠𝑖=𝑠𝑏𝑎𝑠𝑒
𝑖+𝐿 2𝑎𝑐𝑐
𝑖×𝐿 2 𝑙𝑎𝑡+
𝐿 3𝑎𝑐𝑐𝑖
× 𝐿3 𝑙𝑎𝑡+𝑀𝑒𝑚𝑎𝑐𝑐
𝑖×𝑀𝑒𝑚𝑙𝑎𝑡
App-Core pair profileo Base IPCo Cache accesses per Inst.o Hit Rate Profiles
Chip Configurationo core countso core typeso Frequencieso Cache hierarchy (sizes, latencies)o memory bandwidtho application schedule
ReSHAPE – The Analytical Component
Core 0
L1I L1D
L2
L3
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
15
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠(𝑖)𝑠𝑒𝑐𝑜𝑛𝑑(𝑠)
λ=𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑟𝑎𝑡𝑒=𝑀𝑒𝑚𝑎𝑐𝑐
𝑖×𝑖𝑠
Chip Configurationo core countso core typeso Frequencieso Cache hierarchy (sizes, latencies)o memory bandwidtho application schedule
App-Core pair profileo Base IPCo Cache accesses per Inst.o Hit Rate Profiles
ReSHAPE – The Analytical Component
Core 0
L1I L1D
L2
L3
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
𝑀𝑒𝑚𝑜𝑟𝑦 𝑖𝑠𝑚𝑜𝑑𝑒𝑙𝑒𝑑𝑎𝑠𝑎𝑛𝑀 /𝐷 /1 𝑠𝑦𝑠𝑡𝑒𝑚𝑀𝑒𝑚𝑞𝑢𝑒𝑢𝑒=
λ2μ (μ− λ )
μ=𝑠𝑒𝑟𝑣𝑖𝑐𝑒𝑟𝑎𝑡𝑒=𝑏𝑦𝑡𝑒𝑠𝑠
×1
𝑀𝑒𝑚𝐴𝑐𝑐𝑆𝑧𝐼𝑛𝐵𝑦𝑡𝑒𝑠16
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠(𝑖)𝑠𝑒𝑐𝑜𝑛𝑑(𝑠)
L3
L3L3
ReSHAPE’s Novelty
L2 L2
Core 0
L1I L1D
Core 1
L1I L1D
L2 L2
Novelty 1o Separate chip into vertical silos
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
ReSHAPE’s partition optimizer
17
L3
L3
ReSHAPE’s Novelty
Novelty 1o Separate chip into vertical silos
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
L3L3
Core 0
L1I L1D
Core 1
L1I L1D
L2 L2
L3L3
Novelty 2 o Use newly computed IPC as baseIPC o Re-evaluate traffic and partitionso Iterate until convergence (IPC change <1%)
After convergence o Use final IPCs to compute throughput
18
ReSHAPE’s Cache partitioning strategyResource Sharing and Heterogeneity-aware Analytical Performance Estimator
L3 L3
L3
? Cache size Cache size
Hits
per
sec
Hits
per
sec
Greedy Approacho O(n.k) for n cache slices and k sharerso May be sub-optimal, but does quite well in practice
19
ReSHAPE’s Cache partitioning strategyResource Sharing and Heterogeneity-aware Analytical Performance Estimator
L3 L3
L3
? Cache size Cache size
Hits
per
sec
Hits
per
sec
Minimize Misses Strategyo O(log2n. 2k) for n cache slices and k sharerso May be too slow for large ko We use this strategy for all evaluations presented here
20
Loose
Locality
Med
ium
Lo
calit
y
Tight
Locality
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
Step 1: Analyze benchmark applications
21
Step 1: Analyze benchmark applicationsStep 2: Construct workload mixes
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
m00 xalan namdm01 xalan xalanm02 omnetpp libm03 povray povraym04 mcf namdm05 milc milcm06 omnetpp tontom07 leslie3d omnetppm08 xalan mcfm09 tonto namdm10 milc tontom11 lib mcf
m00 povray povray tonto namdm01 povray tonto tonto xalanm02 mcf tonto namd namdm03 omnetpp xalan leslie3d povraym04 omnetpp leslie3d leslie3d xalanm05 omnetpp leslie3d xalan libm06 mcf lib milc povraym07 omnetpp mcf milc libm08 mcf lib lib milcm09 povray namd leslie3d xalanm10 mcf milc tonto namdm11 mcf xalan leslie3d lib
m00 povray tonto namd deal2 games astar leslie3d xalan omnetppm01 deal2 games astar perl calculix gromacs lib milc mcfm02 perl calculix gromacs leslie3d xalan omnetpp hmmer soplex bzipm03 povray tonto namd leslie3d xalan omnetpp lib milc mcfm04 perl calculix gromacs lib milc mcf lbm sphinx gemsm05 leslie3d xalan omnetpp hmmer soplex bzip lbm sphinx gemsm06 hmmer soplex bzip lib milc mcf lbm sphinx gems
2 core
12 m
ixes
4 core
9 core
12 m
ixes
7 m
ixes
22
Step 1: Analyze benchmark applicationsStep 2: Construct workload mixes
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
Step 3: Construct configurations to be validated
256KB
32K 32K
10Gb/s
1MB
32K 32K
10Gb/s1Gbp/s100Mb/s10MB/s
32K 32K
10Gb/s
32K 32K
256KB
32K 32K 32K 32K
1MB
10Gb/s1Gbp/s100Mb/s10MB/s
32K 32K
10Gb/s
32K 32K
512KB
32K32K 32K32K
32K 32K 32K 32K
2MB
32K32K 32K32K
10Gb/s1Gbp/s100Mb/s10MB/s
32K 32K 32K 32K
512KB
32K32K 32K32K
10Gb/s
512KB
512KB 512KB
32K 32K 32K 32K
256KB
32K32K 32K32K
10Gb/s
256KB
1MB 1MB
32K 32K 32K 32K
128KB
32K32K 32K32K
10Gb/s
128KB
2MB 2MB
23
Step 1: Analyze benchmark applicationsStep 2: Construct workload mixes
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
Step 3: Construct configurations to be validatedStep 4: Set up identical configurations in SIMICS and ReSHAPE
Step 5: Compare projections from SIMICS and ReSHAPE
Each mix is checkpointed (under SIMICS) after running for 100 Billion instructions per applicationAt least 1 Billion instructions beyond this are used for validation run
24
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
256KB
32K 32K
256KB
32K 32K
10Gb/s
Average 1-core IPC Error : 1.5% (std. dev. = 1.4%)
astar lbm
tonto gc
cperl bzip lib
zeusm
p
gromac
ssje
ngh264
soplex
hmmer
sphinx
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
IPC
(Sim
ics)
astar lbm
tonto gc
cperl bzip lib
zeusm
p
gromac
ssje
ngh264
soplex
hmmer
sphinx
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
IPC
(ReS
HAPE
)
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0 IPC ComparisonIdealObserved
Simics
ReSH
APE
25
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K 32K 32K
1MB
10Gb/s
m00m01
m02m03
m04m05
m06m07
m08m09
m10m11
0
0.1
0.2
0.3
0.4
0.5
0.6C0 C1
IPC
(Sim
ics)
0.1 0.2 0.3 0.4 0.5 0.6-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6 IPC Comparison
Ideal
Simics
ReSH
APE
m00m01
m02m03
m04m05
m06m07
m08m09
m10m11
0
0.1
0.2
0.3
0.4
0.5
0.6C0 C1
IPC
(ReS
HAPE
)
Average 2-core IPC Error: 2.7% (std. dev. = 2.1%)
26
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K 32K 32K
1MB
10Gb/s
Average miss rate projection error: 13.4 % (std. dev. = 12.6%)
0.001
0.01
0.1
1Miss Rate Comparison
IdealC0
Simics
Re
SH
AP
E
m00
m01
m02
m03
m04
m05
m06
m07
m08
m09
m10
m11
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1C0 C1
Mis
ses
Per
Acc
ess
(Sim
ics)
m00
m01
m02
m03
m04
m05
m06
m07
m08
m09
m10
m11
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1C0 C1
Mis
ses
Per
Acc
ess
(ReS
HA
PE
)
27
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K 32K 32K
1MB
10Gb/s
Average partition size projection error: 3.7 % (std. dev. = 4.5%)
0.1 0.2 0.3 0.4 0.5 0.6-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6Partition Comparison
Ideal
Simics
Re
SH
AP
E
m00
m02
m04
m06
m08
m10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
C1
C0
Par
titi
on
s (S
imic
s)
m00
m02
m04
m06
m08
m10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
C1
C0
Par
titi
on
s (R
eSH
AP
E)
28
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K 32K 32K
2MB
32K32K 32K32K
10Gb/s
m00m01
m02m03
m04m05
m06m07
m08m09
m10m11
0
0.1
0.2
0.3
0.4
0.5
0.6C0 C1 C2 C3
IPC
(Sim
ics)
0.1 0.2 0.3 0.4 0.5 0.6-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6 IPC Comparison
IdealC0
Simics
ReSH
APE
m00m01
m02m03
m04m05
m06m07
m08m09
m10m11
0
0.1
0.2
0.3
0.4
0.5
0.6C0 C1 C2 C3
IPC
(ReS
HAPE
)
Average 4-core IPC Error: 2.5% (std. dev. = 1.8%)
29
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K 32K 32K
2MB
32K32K 32K32K
10Gb/s
Average miss rate projection error: 12.8 % (std. dev. = 13.1%)
0.001
0.01
0.1
1Miss Rate Comparison
IdealC0C1C2
Simics
Re
SH
AP
E
m00
m01
m02
m03
m04
m05
m06
m07
m08
m09
m10
m11
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1C0 C1 C2 C3
Mis
ses
Per
Acc
ess
(Sim
ics)
m00
m01
m02
m03
m04
m05
m06
m07
m08
m09
m10
m11
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1C0 C1 C2 C3
Mis
ses
Per
Acc
ess
(ReS
HA
PE
)
30
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K 32K 32K
2MB
32K32K 32K32K
10Gb/s
Average partition size projection error: 20.9% (std. dev. = 12.8%)
0.0 0.1 0.2 0.3 0.4 0.5 0.6-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6Partition Comparison
IdealC0C1C2C3
Simics
Re
SH
AP
E
m00
m02
m04
m06
m08
m10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
C3
C2
C1
C0
Par
titi
on
s (S
imic
s)
m00
m02
m04
m06
m08
m10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
C3
C2
C1
C0Par
titi
on
s (R
eSH
AP
E)
31
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K 32K 32K
2MB
32K32K 32K32K
10Gb/s1Gb/s0.1Gb/s0.01Gb/s
Average IPC Error: 17.3% (std. dev. = 5.4%)
IPC Com. (0.01GBps)
Simics
IPC Comp. (0.1GBps)
Simics
IPC Comp. (1 GBps)
Simics0.100
1.000
10.000IPC Comparison (10 GBps)
IdealC0C1
Simics
ReSH
APE
32
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
m00
m01
m02
am
03am
04am
05am
06am
07am
08a
m09
m10
am
11a
0
0.1
0.2
0.3
0.4
0.5
0.6C0 C1 C2 C3
IPC
(S
imic
s)
0.1 0.2 0.3 0.4 0.5 0.6-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6IPC Comparison
IdealC0
Simics
Re
SH
AP
E
m00
m01
m02
am
03am
04am
05am
06am
07am
08a
m09
m10
am
11a
0
0.1
0.2
0.3
0.4
0.5
0.6C0 C1 C2 C3
IPC
(R
eSH
AP
E)
32K 32K 32K 32K
128KB
32K32K 32K32K
10Gb/s
128KB
2MB 2MB
Private Caches: Average 4-core IPC Error: 3.1% (std. dev. = 1.6%)
33
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K 32K 32K
128KB
32K32K 32K32K
10Gb/s
128KB
2MB 2MB
Average miss rate projection error: 7.5 % (std. dev. = 7.1%)
0.0001
0.001
0.01
0.1
1Miss Rate Comparison
IdealC0
Simics
Re
SH
AP
E
m00
m01
m02
am
03am
04am
05am
06am
07am
08a
m09
m10
am
11a
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1C0 C1 C2 C3
Mis
ses
Per
Acc
ess
(Sim
ics)
m00
m01
m02
am
03am
04am
05am
06am
07am
08a
m09
m10
am
11a
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1C0 C1 C2 C3
Mis
ses
Per
Acc
ess
(ReS
HA
PE
)
34
USE CASESPutting ReSHAPE to use
Homogeneous Heterogeneous Caches Heterogeneous Cores Heterogeneous Both
Does increasing the sources of heterogeneity buy us performance?
35
Max Min Mean
USE CASESPutting ReSHAPE to use
Does increasing the sources of heterogeneity buy us performance?
ABCD
App0App1App2App3
C0C1C2C3
Up to 4! unique schedules for a 4-application workload mix
ABDC
ACBD
ACDB
ADBC
ADCB
BCDA
BCAD
BDCA
BDAC
BACD
BADC
CDAB
CDBA
CADB
CABD
CBDA
CBAD
DABC
DACB
DBAC
DBCA
DCAB
DCBA
What one might expect to seeo Small improvement with heterogeneous caches. Some loss for bad scheduleso Larger improvement with heterogeneous coreso Even larger improvement with heterogeneous cores + heterogeneous caches1
Het
. Cac
he
Het
. Cor
e
Het
. Bot
h
Wei
ghte
d sp
eedu
p no
rmal
ized
to
Hom
ogen
eous
des
ign
36
Homogeneous Heterogeneous Caches Heterogeneous Cores Heterogeneous Both
USE CASESPutting ReSHAPE to use
Does increasing the sources of heterogeneity buy us performance?
o Smaller cores hurting more than the larger cores helping
o Heterogeneous caches better than heterogeneous cores in this case
37
Homogeneous Heterogeneous Caches Heterogeneous Cores Heterogeneous Both
USE CASESPutting ReSHAPE to use
Homogeneous Heterogeneous Caches Heterogeneous Cores Heterogeneous Both
9-co
re d
esig
ns
> 350,000 ReSHAPE simschart represents > 10 million ReSHAPE sims
o As core count scales (4->9) benefit of heterogeneity increases significantly
o Heterogeneous cores better than heterogeneous caches in this case; but schedule still crucial
38
USE CASESPutting ReSHAPE to use
Homogeneous Heterogeneous Caches Heterogeneous Cores Heterogeneous Both
9-co
re d
esig
nsw
ith 3
cor
e/ca
che
type
s
o 3-core types and 3-cache sizes does not buy any more performance
39
o How much and what form of heterogeneity needs careful analysis depending on the design being evaluated
USE CASESPutting ReSHAPE to use
o Different settings for different workload mixes; and not always the fastest setting!
Weighted Speedup Perf/Watt 1/(Energy*Delay)
c0 c1 c2 c3 c0 c1 c2 c3 c0 c1 c2 c3
m00 3 3 1 1 m00 1 1 1 1 m00 3 3 1 1
m01 3 3 3 1 m01 1 1 1 1 m01 3 3 3 1
m02 1 3 3 3 m02 1 1 1 1 m02 1 3 3 3
m03 1 1 1 3 m03 1 1 1 2 m03 1 1 1 3
m04 1 3 3 3 m04 1 1 1 1 m04 1 1 1 1
m05 1 3 3 1 m05 1 1 1 1 m05 1 1 1 1
m06 1 1 1 3 m06 1 1 1 2 m06 1 1 1 3
m07 1 3 1 1 m07 1 1 1 1 m07 1 1 1 1
m08 3 1 1 1 m08 1 1 1 1 m08 1 1 1 1
m09 3 3 1 1 m09 2 1 1 1 m09 3 3 1 1
m10 1 1 3 3 m10 1 1 2 1 m10 1 1 3 3
m11 1 3 3 1 m11 1 1 1 1 m11 1 1 1 1
Legend 1 250MHz, 0.5W 2 1GHz, 2W 3 4GHz, 16W
32K 32K 32K 32K
2MB
32K32K 32K32K
10Gb/s
o Not always the slowest setting when optimizing performance/watt
o Somewhere in between when optimizing Energy x Delay product
32K 32K
250MHz0.5W
32K 32K
1GHz2W
32K 32K
4GHz16W
40
CONCLUSIONS + FUTURE DIRECTION
ReSHAPE extends this classical analytical performance model in novel ways
Rich design/configuration space for multi-core chips
Accuracy + speed make ReSHAPE a useful tool for early exploration
Validate across unique microarchitecturesFuture direction – extend ReSHAPE
Extend key parameters and model - memory level parallelism, writeback traffic, prefetching
Evaluate more use cases o best power-gating strategy based on workload mixo dynamic schedules based on per-phase application statistics
Explore the rich constrained optimization problem of cache partitioning
41
Analytical modeling can be a promising approach to tackling these large search spaces
Thank you!
42
RELATED WORK
Wentzlaff et al. (MIT Tech Report 2010), Li et al. (ISPASS 2005), Yavits et al. (CAL 2013) all tackle different aspects of multicore chip design, but only consider homogeneous cores.
Wu et. Al (ISCA 2013) use locality profiles to identify how the application’s cache locality degrades as the application is spread across more threads – they consider multi-threaded applications.
Analytical Modeling of multi-core chips
Navada et al. (PACT 2010, PACT 2013) consider simulation based, criticality driven, design space exploration and mechanisms for selecting the best way to schedule a single application across multiple cores.
Kumar et al. (Micro 2003, PACT 2006, ISCA 2004) did most of the seminal work in the area of heterogeneous multi-core. However, they have typically relied on detailed simulations, private cache hierarchies and single application scheduling.
Several works related to heterogeneous design/scheduling
43
VALIDATIONComparing ReSHAPE’s projections against SIMICS full system simulator
256KB
32K 32K
256KB
32K 32K
10Gb/s
Average miss rate projection error: 7.6% (std. dev. = 12.4%)
astar lbm
tonto gc
cperl bzip lib
zeusm
p
gromac
ssje
ngh264
soplex
hmmer
sphinx
00.10.20.30.40.50.60.70.80.9
1
Miss
es P
er A
cces
s (Si
mics
)
astar lbm
tonto gc
cperl bzip lib
zeusm
p
gromac
ssje
ngh264
soplex
hmmer
sphinx
00.10.20.30.40.50.60.70.80.9
1
Miss
es P
er A
cces
s (Re
SHAP
E)
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0Miss Rate Error
Ideal
Simics
ReSH
APE
44