Upload
dixie
View
69
Download
0
Embed Size (px)
DESCRIPTION
Autotuning Sparse Matrix and Structured Grid Kernels. Samuel Williams 1,2 , Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 , Jonathan Carter 2 , David Patterson 1,2 - PowerPoint PPT Presentation
Citation preview
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
1
BERKELEY PAR LAB
Autotuning Sparse Matrix and Structured Grid Kernels
Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2, Jonathan Carter2, David Patterson1,2
1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology
2
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Overview Multicore is the de facto performance solution for the next decade
Examined Sparse Matrix Vector Multiplication (SpMV) kernel Important, common, memory intensive, HPC kernel Present 2 autotuned threaded implementations Compare with leading MPI implementation(PETSc) with an autotuned serial
kernel (OSKI)
Examined Lattice-Boltzmann Magneto-hydrodynamic (LBMHD) application memory intensive HPC application (structured grid) Present 2 autotuned threaded implementations
Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 (Huron) IBM QS20 Cell Blade
We show Cell consistently delivers good performance and efficiency Niagara2 delivers good performance and productivity
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
3
BERKELEY PAR LAB
Autotuning
AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary
4
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuning
Hand optimizing each architecture/dataset combination is not feasible
Autotuning finds a good performance solution be heuristics or exhaustive search Perl script generates many possible kernels Generate SSE optimized kernels Autotuning benchmark examines kernels and reports back with the best
one for the current architecture/dataset/compiler/… Performance depends on the optimizations generated Heuristics are often desirable when the search space isn’t tractable
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
5
BERKELEY PAR LAB
Multicore SMPs used
AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary
6
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems
667MHz FBDIMMs
Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
XDR BIF
PPE
512KBL2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
BIF XDR
PPE
512KBL2
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT1MB
victim1MBvictim
SRI / crossbar
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT 1MB
victim1MBvictim
SRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
7
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(memory hierarchy)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
XDR BIF
PPE
512KBL2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
BIF XDR
PPE
512KBL2
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT1MB
victim1MBvictim
SRI / crossbar
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT 1MB
victim1MBvictim
SRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
Conventional Cache-based
Memory Hierarchy
8
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(memory hierarchy)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
XDR BIF
PPE
512KBL2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
BIF XDR
PPE
512KBL2
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT1MB
victim1MBvictim
SRI / crossbar
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT 1MB
victim1MBvictim
SRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1Conventional Cache-based
Memory Hierarchy
Disjoint Local Store
Memory Hierarchy
9
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(memory hierarchy)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
XDR BIF
PPE
512KBL2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
BIF XDR
PPE
512KBL2
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT1MB
victim1MBvictim
SRI / crossbar
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT 1MB
victim1MBvictim
SRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
Cache + Pthreads
implementations
Local Store + libspe
implementations
10
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(peak flops)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
XDR BIF
PPE
512KBL2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
BIF XDR
PPE
512KBL2
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT1MB
victim1MBvictim
SRI / crossbar
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT 1MB
victim1MBvictim
SRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
75 Gflop/s(TLP + ILP + SIMD)
17 Gflop/s(TLP + ILP)
PPEs: 13 Gflop/s(TLP + ILP)
SPEs: 29 Gflop/s(TLP + ILP + SIMD)
11 Gflop/s(TLP)
11
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(peak DRAM bandwidth)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
XDR BIF
PPE
512KBL2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
BIF XDR
PPE
512KBL2
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT1MB
victim1MBvictim
SRI / crossbar
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT 1MB
victim1MBvictim
SRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
21 GB/s(read)10 GB/s(write)
21 GB/s
51 GB/s42 GB/s(read)21 GB/s(write)
12
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems
667MHz FBDIMMs
Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
10.6 GB/s
Core2
FSB
Core2 Core2 Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
XDR BIF
PPE
512KBL2
512MB XDR DRAM
25.6GB/s
EIB (Ring Network)
BIF XDR
PPE
512KBL2
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
SPE
256K
MFC
MFC MFC MFC MFC
256K 256K 256K 256K
SPE SPE SPE SPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT1MB
victim1MBvictim
SRI / crossbar
Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller
HT 1MB
victim1MBvictim
SRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
MTSparc
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
Unifo
rm M
emor
y Ac
cess
Non-U
nifor
m Mem
ory A
cces
s
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
13
BERKELEY PAR LAB
HPC Kernels
AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary
14
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Arithmetic Intensity
Arithmetic Intensity ~ Total Compulsory Flops / Total Compulsory Bytes Many HPC kernels have an arithmetic intensity that scales with with problem size (increasing
temporal locality) But there are many important and interesting kernels that don’t Low arithmetic intensity kernels are likely to be memory bound High arithmetic intensity kernels are likely to be processor bound Ignores memory addressing complexity
A r i t h m e t i c I n t e n s i t y
O( N ) O( log(N) ) O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTs
Dense Linear Algebra (BLAS3)
Particle Methods
15
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Arithmetic Intensity
Arithmetic Intensity ~ Total Compulsory Flops / Total Compulsory Bytes Many HPC kernels have an arithmetic intensity that scales with with problem size (increasing
temporal locality) But there are many important and interesting kernels that don’t Low arithmetic intensity kernels are likely to be memory bound High arithmetic intensity kernels are likely to be processor bound Ignores memory addressing complexity
A r i t h m e t i c I n t e n s i t y
O( N ) O( log(N) ) O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTs
Dense Linear Algebra (BLAS3)
Particle Methods
Good Match for Clovertown, eDP Cell, … Good Match for Cell, Niagara2
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
16
BERKELEY PAR LAB
Sparse Matrix-Vector Multiplication (SpMV)
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.
AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary
17
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Sparse MatrixVector Multiplication
Sparse Matrix Most entries are 0.0 Performance advantage in only
storing/operating on the nonzeros Requires significant meta data
Evaluate y=Ax A is a sparse matrix x & y are dense vectors
Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)
= likely memory bound
A x y
18
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Dataset (Matrices)
Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M
Dense
Protein FEM /Spheres
FEM /Cantilever
WindTunnel
FEM /Harbor QCD FEM /
Ship Economics Epidemiology
FEM /Accelerator Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
19
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Naïve Serial Implementation
Vanilla C implementation Matrix stored in CSR
(compressed sparse row) Explored compiler options,
but only the best is presented here
x86 core delivers > 10x the performance of a Niagara2 thread0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (PPE)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
20
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Naïve Parallel Implementation
SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Naïve Pthreads
Naïve
21
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine
Naïve Parallel Implementation
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Naïve Pthreads
Naïve
8x cores = 1.9x performance
4x cores = 1.5x performance
64x threads = 41x performance
4x threads = 3.4x performance
22
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine
Naïve Parallel Implementation
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Naïve Pthreads
Naïve
1.4% of peak flops29% of bandwidth 4% of peak flops
20% of bandwidth
25% of peak flops39% of bandwidth
2.7% of peak flops4% of bandwidth
23
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+NUMA & SW Prefetching)
Use first touch, or libnuma to exploit NUMA.
Also includes process affinity.
Tag prefetches with temporal locality
Autotune: search for the optimal prefetch distances
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
24
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+Matrix Compression)
If memory bound, only hope is minimizing memory traffic
Heuristically compress the parallelized matrix to minimize it
Implemented with SSE Benefit of prefetching is
hidden by requirement of register blocking
Options: register blocking, index size, format, etc…
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.53.0
3.5
4.04.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
25
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+Cache/TLB Blocking)
Reorganize matrix to maximize locality of source vector accesses
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.53.0
3.5
4.04.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
26
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+DIMMs, Firmware, Padding)
Clovertown was already fully populated with DIMMs
Gave Opteron as many DIMMs as Clovertown
Firmware update for Niagara2 Array padding to avoid inter-
thread conflict misses
PPE’s use ~1/3 of Cell chip area
0.0
0.5
1.0
1.5
2.0
2.53.0
3.5
4.04.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
27
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+DIMMs, Firmware, Padding)
Clovertown was already fully populated with DIMMs
Gave Opteron as many DIMMs as Clovertown
Firmware update for Niagara2 Array padding to avoid inter-
thread conflict misses
PPE’s use ~1/3 of Cell chip area
0.0
0.5
1.0
1.5
2.0
2.53.0
3.5
4.04.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
4% of peak flops52% of bandwidth
20% of peak flops65% of bandwidth
54% of peak flops57% of bandwidth
10% of peak flops10% of bandwidth
28
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+Cell/SPE version)
Wrote a double precision Cell/SPE version
DMA, local store blocked, NUMA aware, etc…
Only 2x1 and larger BCOO Only the SpMV-proper
routine changed
About 12x faster (median) than using the PPEs alone.
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (SPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
29
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Wrote a double precision Cell/SPE version
DMA, local store blocked, NUMA aware, etc…
Only 2x1 and larger BCOO Only the SpMV-proper
routine changed
About 12x faster than using the PPEs alone.
Autotuned Performance(+Cell/SPE version)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (SPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
4% of peak flops52% of bandwidth
20% of peak flops65% of bandwidth
54% of peak flops57% of bandwidth 40% of peak flops
92% of bandwidth
30
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(How much did double precision and 2x1 blocking hurt)
Model faster cores by commenting out the inner kernel calls, but still performing all DMAs
Enabled 1x1 BCOO
~16% improvement
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
IBM Cell Blade (SPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
+better Cell implementation
31
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
MPI vs. Threads On x86 machines,
autotuned(OSKI) shared memory MPICH implementation rarely scales beyond 2 threads
Still debugging MPI issues on Niagara2, but so far, it rarely scales beyond 8 threads.
Autotuned pthreads
Autotuned MPI
Naïve Serial
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-AccelCircuit
Webbase
LP
Median
GFlop/s
Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
32
BERKELEY PAR LAB
Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)
Preliminary results
Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS) (to appear), 2008.
AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary
33
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Lattice Methods
Structured grid code, with a series of time steps Popular in CFD Allows for complex boundary conditions Higher dimensional phase space
Simplified kinetic model that maintains the macroscopic quantities Distribution functions (e.g. 27 velocities per point in space) are used to
reconstruct macroscopic quantities Significant Memory capacity requirements
14
12
4
16
13
5
9
8
212
0
25
3
1
24
22
23
2618
15
6
19
17
7
11
10
20
+Z
+Y
+X
34
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD(general characteristics)
Plasma turbulence simulation Two distributions:
momentum distribution (27 components) magnetic distribution (15 vector components)
Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)
Must read 73 doubles, and update(write) 79 doubles per point in space
Requires about 1300 floating point operations per point in space Just over 1.0 flops/byte (ideal) No temporal locality between points in space within one time step
35
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD(implementation details)
Data Structure choices: Array of Structures: lacks spatial locality Structure of Arrays: huge number of memory streams per thread, but
vectorizes well
Parallelization Fortran version used MPI to communicate between nodes.
= bad match for multicore This version uses pthreads for multicore, and MPI for inter-node MPI is not used when autotuning
Two problem sizes: 643 (~330MB) 1283 (~2.5GB)
36
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Pthread Implementation
Not naïve fully unrolled loops NUMA-aware 1D parallelization
Always used 8 threads per core on Niagara2
Cell version was notautotuned
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 1 2 464^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 864^3 128^3
GFlop/s
IBM Cell BladeSun Niagara2 (Huron)
Intel Clovertown AMD Opteron
37
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Not naïve fully unrolled loops NUMA-aware 1D parallelization
Always used 8 threads per core on Niagara2
Pthread Implementation
Cell version was notautotuned
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 1 2 464^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 864^3 128^3
GFlop/s
IBM Cell BladeSun Niagara2 (Huron)
Intel Clovertown AMD Opteron
4.8% of peak flops16% of bandwidth 14% of peak flops
17% of bandwidth
54% of peak flops14% of bandwidth
38
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+Stencil-aware Padding)
This lattice method is essentially 79 simultaneous72-point stencils
Can cause conflict misses even with highly associative L1 caches (not to mention opteron’s 2 way)
Solution: pad each component so that when accessed with the corresponding stencil(spatial) offset, the components are uniformly distributed in the cache
Cell version was notautotuned
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 1 2 464^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 864^3 128^3
GFlop/s .
IBM Cell BladeSun Niagara2 (Huron)
Intel Clovertown AMD Opteron
+Padding
Naïve+NUMA
39
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+Vectorization)
Each update requires touching ~150 components, each likely to be on a different page
TLB misses can significantly impact performance
Solution: vectorization Fuse spatial loops,
strip mine into vectors of size VL, and interchange with phase dimensional loops
Autotune: search for the optimal vector length
Significant benefit on some architectures Becomes irrelevant when bandwidth
dominates performance
Cell version was notautotuned
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 1 2 464^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
IBM Cell BladeSun Niagara2 (Huron)
Intel Clovertown AMD Opteron
+Vectorization
+Padding
Naïve+NUMA
40
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+Explicit Unrolling/Reordering)
Give the compilers a helping hand for the complex loops
Code Generator: Perl script to generate all power of 2 possibilities
Autotune: search for the best unrolling and expression of data level parallelism
Is essential when using SIMD intrinsics
Cell version was notautotuned
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 1 2 464^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
IBM Cell BladeSun Niagara2 (Huron)
Intel Clovertown AMD Opteron
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
41
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+Software prefetching)
Expanded the code generator to insert software prefetches in case the compiler doesn’t.
Autotune: no prefetch prefetch 1 line ahead prefetch 1 vector ahead.
Relatively little benefit for relatively little work
Cell version was notautotuned
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 1 2 464^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
IBM Cell BladeSun Niagara2 (Huron)
Intel Clovertown AMD Opteron
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
42
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(+SIMDization, including non-temporal stores)
Compilers(gcc & icc) failed at exploiting SIMD.
Expanded the code generator to use SIMD intrinsics.
Explicit unrolling/reordering was extremely valuable here.
Exploited movntpd to minimize memory traffic (only hope if memory bound)
Significant benefit for significant work
Cell version was notautotuned
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 4 1 2 464^3 128^3
GFlop/s
IBM Cell BladeSun Niagara2 (Huron)
Intel Clovertown AMD Opteron
+SIMDization
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
43
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Autotuned Performance(Cell/SPE version)
First attempt at cell implementation.
VL, unrolling, reordering fixed Exploits DMA and double
buffering to load vectors Straight to SIMD intrinsics. Despite the relative
performance, Cell’s DP implementation severely impairs performance
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
1 2 4 1 2 464^3 128^3
GFlop/s
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
1 2 4 8 16 1 2 4 8 16
64^3 128^3
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 (Huron) IBM Cell Blade*
*collision() only
+SIMDization
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
44
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
First attempt at cell implementation.
VL, unrolling, reordering fixed Exploits DMA and double
buffering to load vectors Straight to SIMD intrinsics. Despite the relative
performance, Cell’s DP implementation severely impairs performance
Autotuned Performance(Cell/SPE version)
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
1 2 4 1 2 464^3 128^3
GFlop/s
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
1 2 4 8 1 2 4 8
64^3 128^3
GFlop/s
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
1 2 4 8 16 1 2 4 8 16
64^3 128^3
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 (Huron) IBM Cell Blade*
*collision() only
+SIMDization
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
7.5% of peak flops17% of bandwidth
42% of peak flops35% of bandwidth
59% of peak flops15% of bandwidth
57% of peak flops33% of bandwidth
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
45
BERKELEY PAR LAB
Summary
AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary
46
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Aggregate Performance (Fully optimized)
Cell consistently delivers the best full system performance Niagara2 delivers comparable per socket performance Dual core Opteron delivers far better performance (bandwidth) than Clovertown, but
as the flop:byte ratio increases its performance advantage decreases. Huron has far more bandwidth than it can exploit
(too much latency, too few cores) Clovertown has far too little effective FSB bandwidth
SpMV(median)
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores
GFlop/s
OpteronClovertownNiagara2 (Huron)Cell Blade
LBMHD(64^3)
0.002.004.006.008.00
10.0012.0014.0016.0018.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores
GFlop/s
OpteronClovertownNiagara2 (Huron)Cell Blade
47
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Parallel Efficiency(average performance per thread, Fully optimized)
Aggregate Mflop/s / #cores Niagara2 & Cell show very good multicore scaling Clovertown showed very poor multicore scaling on both applications For SpMV, Opteron and Clovertown showed good multisocket scaling Clovertown runs into bandwidth limits far short of its theoretical peak even for LBMHD Opteron lacks the bandwidth for SpMV, and the FP resources to use its bandwidth for
LBMHD
LBMHD(64^3)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores
GFlop/s/core
OpteronClovertownNiagara2 (Huron)Cell Blade
SpMV(median)
0.000.100.200.300.400.500.600.700.800.901.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores
GFlop/s/core
OpteronClovertownNiagara2 (Huron)Cell Blade
48
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Power Efficiency(Fully Optimized)
Used a digital power meter to measure sustained power under load Calculate power efficiency as:
sustained performance / sustained power All cache-based machines delivered similar power efficiency FBDIMMs (~12W each) sustained power
8 DIMMs on Clovertown (total of ~330W) 16 DIMMs on N2 machine (total of ~450W)
LBMHD(64^3)
0
10
20
30
40
50
60
70
Clovertown Opteron Niagara2(Huron)
Cell Blade
MFlop/s/watt
SpMV(median)
02468
1012141618202224
Clovertown Opteron Niagara2(Huron)
Cell Blade
MFlop/s/watt
49
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Productivity
Niagara2 required significantly less work to deliver good performance.
For LBMHD, Clovertown, Opteron, and Cell all required SIMD (hampers productivity) for best performance.
Virtually every optimization was required (sooner or later) for Opteron and Cell.
Cache based machines required search for some optimizations, while cell always relied on heuristics
50
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Summary
Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.
Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency (processor and
power)
Our multicore specific autotuned SpMV implementation significantly outperformed an autotuned MPI implementation
Our multicore autotuned LBMHD implementation significantly outperformed the already optimized serial implementation
Sustainable memory bandwidth is essential even on kernels with moderate computational intensity (flop:byte ratio)
Architectural transparency is invaluable in optimizing code
51
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Acknowledgements
UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)
Sun Microsystems Niagara2 access
Forschungszentrum Jülich Cell blade cluster access
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
52
BERKELEY PAR LAB
Questions?
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
53
BERKELEY PAR LAB
Backup Slides