Multicore Scaling:
The ECM Model
Single-core performance prediction
The saturation point
Three stencil code examples
Assumptions and shortcomings of the roofline model
Assumes one of two bottlenecks
1. In-core execution
2. Bandwidth of a single hierarchy level
Latency effects are not modeled pure data streaming assumed
In-core execution is sometimes hard to
model
Saturation effects in multicore
chips are not explained
ECM model gives more insight
A(:)=B(:)+C(:)*D(:)
Roofline predicts full socket BW
(c) RRZE 2014 ECM Model 2
The Execution-Cache-Memory (ECM)
model
G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908
ECM Model
ECM = “Execution-Cache-Memory”
Assumptions:
Single-core execution time is composed of
1. In-core execution
2. Data transfers in the memory hierarchy
Data transfers may or may not overlap with
each other or with in-core execution
Scaling is linear until the relevant bottleneck
is reached
Input:
Same as for Roofline
+ data transfer times in hierarchy
(c) RRZE 2014 4 ECM Model
Example: Schönauer Vector Triad in L2 cache
REPEAT[ A(:) = B(:) + C(:) * D(:)] @ double precision
Analysis for Sandy Bridge core w/ AVX (unit of work: 1 cache line)
(c) RRZE 2014 5 ECM Model
1 LD/cy + 0.5 ST/cy
Registers
L1
L2
32 B/cy (2 cy/CL)
Machine characteristics:
Arithmetic: 1 ADD/cy+ 1 MULT/cy
Registers
L1
L2
Triad analysis (per CL):
6 cy/CL
10 cy/CL
Arithmetic: AVX: 2 cy/CL
LD LD ST/2
LD ST/2 LD LD
ST/2 LD
ST/2
LD
ADD MULT
ADD MULT
LD LD WA ST
Roofline prediction: 16/10 F/cy
Timeline:
16 F/CL (AVX)
Measurement: 16F / ≈17cy
Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) on a Sandy Bridge Core with AVX
(c) RRZE 2014 6 ECM Model
CL transfer
Write-allocate CL transfer
Full vs. partial vs. no overlap
(c) RRZE 2014 7 ECM Model
Results suggest no overlap!
Multicore scaling in the ECM model
Identify relevant bandwidth bottlenecks
L3 cache
Memory interface
Scale single-thread performance until first bottleneck is hit:
(c) RRZE 2014 8 ECM Model
𝑃 𝑡 = min(𝑡𝑃0, 𝑃roof), with 𝑃roof= min(𝑃max, 𝐼 ∙ 𝑏𝑆)
. . . Example: Scalable L3
on Sandy Bridge
ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:)
on a Sandy Bridge socket (no-overlap assumption)
Model: Scales until saturation
sets in
Saturation point (# cores) well
predicted
Measurement: scaling not perfect
Caveat: This is specific for this
architecture and this benchmark!
Check: Use “overlappable” kernel
code
(c) RRZE 2014 9 ECM Model
ECM prediction vs. measurements for A(:)=B(:)+C(:)/D(:)
on a Sandy Bridge socket (full overlap assumption)
(c) RRZE 2014 10 ECM Model
In-core execution is dominated by
divide operation
(44 cycles with AVX, 22 scalar)
Almost perfect agreement with
ECM model
General observation:
If the L1 cache is 100% occupied
by LD/ST, there is no overlap
throughout the hierarchy
If there is “slack” at the L1, there is
overlap in the hierarchy
Performance Modeling of Stencil Codes
Applying the ECM model to stencil updates:
- 2D Jacobi smoother
- Long-range stencil
(H. Stengel, RRZE)
Example 1: 2D Jacobi in DP with SSE2 on Sandy Bridge
(c) RRZE 2014 12 ECM Model
Example 1: 2D Jacobi in DP with SSE2 on SNB
(c) RRZE 2014 13 ECM Model
Instruction count - 13 LOAD - 4 STORE - 12 ADD - 4 MUL
4-way unrolling 8 LUP / iteration
Example 1: 2D Jacobi in DP with SSE2 on SNB
(c) RRZE 2014 14 ECM Model
Code characteristics (SSE instructions per iteration) - 13 LOAD - 4 STORE - 12 ADD - 4 MUL
Processor characteristics (SSE instructions per cycle) - 2 LOAD || (1 LOAD + 1 STORE) - 1 ADD - 1 MUL
LD LD LD LD 2LD 2LD 2LD 2LD L
ST ST ST ST
+ + + + + + + + + + + +
* * * *
core execution: 12 cy
Example 1: 2D Jacobi in DP with SSE2 on SNB
(c) RRZE 2014 15 ECM Model
Situation 1: Data set fits into L1 cache
ECM prediction:
(8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s
Measurement: 2.2 GLUP/s
Situation 2: Data set fits into L2 cache (not into L1)
3 additional transfer streams from L2 to L1 (data delay)
ECM prediction:
(8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s
Measurement: 1.9 GLUP/s
Overlap?
12 cy
6 cy t0 RFO t1
Example 1: 2D Jacobi in DP with SSE2 on SNB
(c) RRZE 2014 16 ECM Model
LD LD LD LD 2LD 2LD 2LD 2LD L
ST ST ST ST
+ + + + + + + + + + + +
* * * *
core execution: 12 cycles
ECM prediction w/ overlap:
(8 LUP / (8.5+6) cy) * 3.5 GHz = 1.9 GLUP/s
Measurement: 1.9 GLUP/s
L1 „single ported“ no overlap during LD/ST
data delay: 6 cycles
12 cy
6 cy RFO t0 t1
“If the model fails, we learn something”
Example 2: 3D long-range stencil in single precision
with AVX on Sandy Bridge
(c) RRZE 2014 17 ECM Model
Example 2: 3D long-range stencil in SP with AVX on SNB
4 neighbors per direction
Operations per update
27 LOAD (25 V, 1 ROC, 1 U)
1 STORE (U)
26 ADD
15 MUL
Core time
IACA
(c) RRZE 2014 18 ECM Model
Interlude: Intel Architecture Code Analyzer (IACA)
Performs architecture-specific code analysis
Prerequisite: Mark start and end of dominant work loop
In high-level code (documented)
In assembly code (see iacaMarks.h)
Does not influence code optimization (e.g. vectorization)
Assembly loop might perform multiple updates per iteration (unrolling, SIMD)
Important reports (throughput mode):
Block throughput: runtime of one loop iteration ( core-time)
Throughput bottleneck: limiting resource for code execution
Port pressure: dominant pipeline port
(c) RRZE 2014 19 ECM Model
IACA example output
(c) RRZE 2014 20 ECM Model
AVX vectorization, no unrolling: One iteration updates 8 SP (float) elements
Example 2: 3D long-range stencil in SP with AVX on SNB
4 neighbors per direction
Operations per update
27 LOAD (25 V, 1 ROC, 1 U)
1 STORE (U)
26 ADD
15 MUL
Core time (IACA)
34.25 cy / 8 LUP (SP)
69 cy / CL
LOAD dominated
(c) RRZE 2014 21 ECM Model
Example 2: 3D long-range stencil in SP with AVX on SNB
Data delay
Spatial blocking for minimal traffic between L3 cache and memory
8 additional streams for smaller caches
Single-core performance
2.7GHz / (134cy / 16LUP) = 322MLUP/s
Measurement: 320MLUP/s
LOAD dominated no overlap expected
Socket scaling (8 cores)
8*322MLUP/s = 2576MLUP/s
Limit: 40GB/s / 16B/LUP = 2500MLUP/s
Saturation at 8 cores
Measurement: ~2200MLUP/s (88% max.)
Optimization possibilities
Data transfer well optimized (blocking)
Reduce core time (LD); limit: ADD (2*26cy)
Possible speedup: 69cy-52cy=17cy (~15%)
(c) RRZE 2014 22 ECM Model
per
cyc
le t
ran
sfer
wid
ths
per
cac
he
line
up
dat
e ti
me
69cy
24cy
24cy 24cy
17cy
Socket scaling
(c) RRZE 2014 23 ECM Model
memory bandwidth limit
Example 3: 3D red-black tetrahedral smoother in DP
with AVX on Sandy Bridge
“Terra-Neo” DFG project
(c) RRZE 2014 24 ECM Model
Stencil inner loop code and pattern
Notes
Stride-two access pattern on the central line (red-black pattern)
15-point stencil
Compiler “vectorizes” inner loop, but loads and stores are scalar
(c) RRZE 2014 25 ECM Model
ECM for Terra-Neo kernel
measured manually, optimization target!
ECM Model (c) RRZE 2014 26
8 LUP (1 CL):
layer condition:
Socket scaling of Terra-Neo kernel
ECM Model (c) RRZE 2014 27
no overlap
max. overlap
“some“ overlap
What we have learned
Code does not saturate the relevant bottleneck
We have to improve single-core performance
Core analysis: Bottleneck is register-register moves
Reason: Vectorized arithmetic, but scalar loads
Optimization goal: Truly vectorized kernel with SIMD loads
Options:
SIMD-friendly data layout
(red-black separation)
→ # iterations?
Different smoother algorithm
(e.g. Jacobi) on central line
→ convergence?
ECM Model (c) RRZE 2014 28
ECM model: Conclusions & outlook
Saturation effects are ubiquitous; understanding them gives us
opportunity to
Find out about optimization opportunities
Save energy by letting cores idle see power model later on
Putting idle cores to better use see spMVM case study
Simple models work best. Do not try to complicate things unless it
is really necessary!
Possible extensions to the ECM model
Accommodate latency effects
Model simple “architectural hazards”
(c) RRZE 2014 29 ECM Model