Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Vorlesung Leistungsanalyse
Parallel SPEC Benchmarks
Regression Models
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Summary of Previous Lecture
Experimental Design
3 Holger Brunst, Matthias Müller: Leistungsanalyse
Goal & Terminology
Obtain the maximum information with the
minimum number experiments
Response Variable (Zielgröße)
Factors (Einflussfaktoren), also called: Predictor variables or predictors
Levels, also called: treatment
Primary Factors
Secondary Factors
Replication
Experiment Design (Versuchsplanung)
Experimental Unit
4 Holger Brunst, Matthias Müller: Leistungsanalyse
Terminology: Interaction
Interaction (Wechselwirkung)
Two factors A and B are interacting factors if the effect of one depends upon the level of the other
Noninteracting Factors
A1 A2
B1 2 4
B2 5 7
Interacting Factors
A1 A2
B1 2 4
B2 5 8
2 4 6 8
10
A1 A2
B2
B1
10
2 4 6 8
B1 B2
A2
A1
10
2 4 6 8
A1 A2
B2
B1
10
2 4 6 8
B1 B2
A2
A1
5 Holger Brunst, Matthias Müller: Leistungsanalyse
Common Mistakes
Variation due to experimental error ignored
Important parameters are not controlled
Effects of different factors are not isolated
Simple one-factor-at-a-time designs are used
Interactions are ignored
Too many experiments are conducted
7 Holger Brunst, Matthias Müller: Leistungsanalyse
Full Factorial Designs
Uses every possible combination at all levels of all factors which requires n experiments, where
In workstation example: 7 CPUs x 3 Memory sizes x 4 disk drives x 4 workloads x 4 operating systems x 3 educational levels
= 4032 experiments
Advantage: Every possible factor combination is examined. This includes secondary factors and their interactions.
Disadvantage:
– Cost of the study regarding time and money
– Too many experiments to be conducted.
– Also consider replication!
8 Holger Brunst, Matthias Müller: Leistungsanalyse
2k Factorial Designs
Determines the effect of k factors with 2 levels each
Easy to analyze
Helps to sort performance factors in the order of impact
At beginning of performance study:
– Large number of factors and levels
– Full factorial design most likely not possible
– Reduce the number of factors by selecting the significant ones
Impact of unidirectional factors can be estimated for their minimum and maximum levels
Decide if performance difference is worth further examination (with more levels)
Explanation of the concept: Start with k=2, then generalize
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Parallel SPEC Benchmarks
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
SPEC OMP
12 Holger Brunst, Matthias Müller: Leistungsanalyse
SPEC OMP
Benchmark suite developed by SPEC HPG
Benchmark suite for performance testing of shared memory processor systems
Uses OpenMP versions of SPEC CPU2000 benchmarks
SPEC OMP mixes integer and FP in one suite
OMPM is focused on 4-way to 16-way systems
OMPL is targeting 32-way and larger systems
13 Holger Brunst, Matthias Müller: Leistungsanalyse
SPEC OMP Applications
Code Applications Language lines ammp Molecular Dynamics C 13500
applu CFD, partial LU Fortran 4000
apsi Air pollution Fortran 7500
art Image Recognition\
neural networks C 1300
fma3d Crash simulation Fortran 60000
gafort Genetic algorithm Fortran 1500
galgel CFD, Galerkin FE Fortran 15300
equake Earthquake modeling C 1500
mgrid Multigrid solver Fortran 500
swim Shallow water modeling Fortran 400
wupwise Quantum chromodynamics Fortran 2200
14 Holger Brunst, Matthias Müller: Leistungsanalyse
CPU2000 vs OMPL2001
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
SPEC MPI2007
16 Holger Brunst, Matthias Müller: Leistungsanalyse
An application benchmark suite that measures:
– Type of computer processor
– Number of computer processors
– Communication interconnect
– Memory architecture
– Compilers
– MPI library performance
– File system performance
Identifying Candidate Applications
– From SPEC CPU2006
– With a search for candidate call
MPI2007 design goals: benchmark for distributed memory
17 Holger Brunst, Matthias Müller: Leistungsanalyse
Comparison of Different Benchmarks using MPI
SPEC MPI NPB HPCC
Number of applications
13 8 7
Language F77,F90,C,C++ F77,C C
Code size ~530.000 lines 28.000 lines 47.200 lines
#MPI calls in the code ~2400 ~400 ~600
#different MPI calls in the code
~59 ~36 ~44
18 Holger Brunst, Matthias Müller: Leistungsanalyse
Application Fields
– Computation fluid dynamics
– Quantum chromodynamics
– Climate modeling
– Ray tracing
– Molecular Dynamics
– Weather prediction
– Heat transfer
– Hydrodynamics
– Flow Simulation
19 Holger Brunst, Matthias Müller: Leistungsanalyse
MPI2007 Benchmark Goals
–Runs on Clusters or SMP’s
–Validates for correctness and measures performance
–Supports 32-bit or 64-bit OS/ABI.
–Consists of applications drawn from National Labs and University research centers
–Supports a broad range of MPI implementations and Operating systems including Windows, Linux, Proprietary Unix
–Has a runtime of ~1 hour per benchmark test at 16 ranks using GigE with 1 GB memory footprint per rank
–Scales to 128 ranks
–Is extensible to future large and extreme data sets planned to cover larger number of ranks.
20 Holger Brunst, Matthias Müller: Leistungsanalyse
MPI2007 – tested for portability
– Architectures:
• Opteron, Xeon, Itanium2, PA-Risc, Power5, Sparc
– Interconnects:
• Ethernet, Infiniband, Infinipath, SGI NUMAlink, and shared memory.
– Operating systems
• Linux (RH FC3, SLES9/10,Suse 9.3), Windows CCS, HPUX, Solaris, AIX
– MPI implementations
• HP-MPI, MPICH, MPICH2, Open MPI, IBM-MPI, Intel MPI, MPICH-GM, MVAPICH, Fujitsu MPI, InfiniPath MPI, SGI MPT
– Compilers:
• SUN Studio, Fujitsu, Intel, PathScale, PGI, HP, and IBM compilers.
21 Holger Brunst, Matthias Müller: Leistungsanalyse
MPI2007 – tested for scalability
– Scalable from 16 to 128 ranks (processes) for medium data set
– Runtime of 1 hour per benchmark test at 16 ranks using GigE on an unspecified reference cluster.
– Memory footprint should be < 1GB per rank at 16 ranks.
– Exhaustively tested for each rank count - 12 - 15 -> 130 - 140, 160, 180, 200, 225, 256, 512
22 Holger Brunst, Matthias Müller: Leistungsanalyse
Overview of the applications
Code LOC Language MPI MPI Area call sites calls
104.milc 17987 C 51 18 Lattice QCD 107.leslie3d 10503 F77,F90 43 13 Combustion 113.GemsFDTD 21858 F90 237 16 Electrodynamic simulation 115.fds4 44524 F90,C 239 15 CFD
121.pop2 69203 F90 158 17 Geophysical fluid
dynamics 122.tachyon 15512 C 17 16 Ray tracing 126.lammps 6796 C++ 625 25 Molecular dynamics 127.wrf2 163462 F90,C 132 23 Weather forecast 128.GAPgeofem 30935 F77,C 58 18 Geophysical FEM 129.tera_tf 6468 F90 42 13 Eulerian hydrodynamics 130.socorro 91585 F90 155 20 density-functional theory 132.zeusmp2 44441 C,F90 639 21 Astrophysical CFD 137.lu 5671 F90 72 13 SSOR
23 Holger Brunst, Matthias Müller: Leistungsanalyse
MPI2007 Benchmark dynamic message call counts
24 Holger Brunst, Matthias Müller: Leistungsanalyse
Pt2Pt Communication Statistics: 122.tachyon (ray tracing)
25 Holger Brunst, Matthias Müller: Leistungsanalyse
Pt2Pt Communication Statistics: 107.leslie3D (combustion)
26 Holger Brunst, Matthias Müller: Leistungsanalyse
Pt2Pt Communication Statistics: 113.GemsFDTD (electrodynamics)
27 Holger Brunst, Matthias Müller: Leistungsanalyse
Message Length Statistics (Pt2Pt)
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Available Results
29 Holger Brunst, Matthias Müller: Leistungsanalyse
Available Results (blind submission)
– AMD A2210 Reference Platform (16 cores)
• Gigabit Ethernet
• Single Core AMD Opteron 848, 2.2 GHz
– SGI Altix 4700 (16-128 cores)
• SGI Numalink, SGI MPT 1.15
• Dual-Core Intel Itanium II 9040, 1.6 GHz
– HP Proliant BL460c Blade Cluster Platform 3000 BL (16-256 cores)
• Infiniband DDR, HP-MPI 2.2.5
• Dual-Core Intel Xeon 5160, 3.0 GHz
– QLogic, U. Cambridge Darwin Cluster (32-512 cores)
• Infinipath, QLogic Infinipath MPI library 2.0
• Dual-Core Intel Xeon 5160, 3.0 GHz
– QLogic, AMD Emerald Cluster (32-512 cores)
• Infinipath, QLogic Infinipath MPI library 2.1
• Dual-Core AMD Opteron 290, 2.8 GHz
30 Holger Brunst, Matthias Müller: Leistungsanalyse
Scales to 128 , works on 512
31 Holger Brunst, Matthias Müller: Leistungsanalyse
Scalability on U. Cambridge’s Darwin Cluster (II)
32 Holger Brunst, Matthias Müller: Leistungsanalyse
Scalability on HP Cluster
33 Holger Brunst, Matthias Müller: Leistungsanalyse
Summary and Conclusion
SPEC MPI2007 properties:
– Application benchmark with 13 different codes
– Run and reporting rules for reproducibility
– Tested on a wide range of platforms:
• CPU and Node Architectures
• Interconnects
• Compilers
• MPI implementations
– Available dataset (medium) scales to 128 ranks
– Next steps:
• Large dataset with enhanced scalability for larger systems
• …
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Use Cases
35 Holger Brunst, Matthias Müller: Leistungsanalyse
Use cases
– Performance trends
– Compiler and performance
– Comparing different Itanium systems
– Comparing different system generations
36 Holger Brunst, Matthias Müller: Leistungsanalyse
SPEC performance trends (performance per thread)
37 Holger Brunst, Matthias Müller: Leistungsanalyse
Where Does the Performance Go? or Why Should I Care About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency) Proc
60%/yr.
(2X/1.5yr)
DRAM
9%/yr.
(2X/10 yrs)
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
CPU
DRAM
38 Holger Brunst, Matthias Müller: Leistungsanalyse
Comparison OMPM base compilers
39 Holger Brunst, Matthias Müller: Leistungsanalyse
Influence of compilers on OMPM base 32-way results
40 Holger Brunst, Matthias Müller: Leistungsanalyse
Comparison OMPM on 32-way 1.5 GHz Itanium
41 Holger Brunst, Matthias Müller: Leistungsanalyse
SMP Performance Gain Itanium/Itanium 2
42 Holger Brunst, Matthias Müller: Leistungsanalyse
45.7cm
38.6cm CPU
1985 1990 1995 1998
Perf
orm
ance
Bipolar Water-cooled
CMOS Air-cooled
Multi Nodes
Large scale cluster
>100nodes
SX-3
SX-5
Over 1GFLOP Per Node
SX-6/7
SX-1/2
SX-4
Technology
2cm
2cm
SX-8
Massive scale cluster >500nodes
2004
Single module node
Single Chip Vector Processor
Multi CPUs
Architecture
The history of NEC SX series
2001
43 Holger Brunst, Matthias Müller: Leistungsanalyse
Performance Properties of Different SX systems
System Availability CPU perf. Mem band/ CPU Node perf. Mem. Band/ Node
SX-4 1996 2 GF/s 16 GB/s 64 GF/s 512 GB/s
SX-5e 1999 4 GF/s 32 GB/s 64 GF/s 512 GB/s
SX-6 2001 8 GF/s 32 GB/s 64 GF/s 256 GB/s
SX-6+ 2002 9 GF/s 36 GB/s 72 GF/s 324 GB/s
SX-8 2004 16 GF/s 64 GB/s 128 GF/s 512 GB/s
Factor 2 in
two years
Factor 2 in
eight years
44 Holger Brunst, Matthias Müller: Leistungsanalyse
Properties of SPEC codes on vector systems
Name Lang Vratio Vlen MEM (MB)
Wupwise F 87.34 58.74 1488
Swim F 99.75 253.48 1584
Mgrid F 99.14 211.04 480
Applu F 81.31 34.17 1520
Galgel F 92.57 45.14 272
Equake C 0.06 9.6 464
Apsi F 76.70 23.02 1648
Gafort F 40.25 59.60 1680
Fma3d F 10.29 8.95 1040
Art C 32.06 242.14 272
Ammp C 76.67 102.79 176
45 Holger Brunst, Matthias Müller: Leistungsanalyse
Expectations
Swim, mgrid and maybe galgel should perform well
Equake, fma3d and art should perform poorly
However, the focus was not on absolute, but relative performance and scalability
46 Holger Brunst, Matthias Müller: Leistungsanalyse
SPEC efficiency on SX
47 Holger Brunst, Matthias Müller: Leistungsanalyse
Performance measurements
All performance is reported relative to the performance of one thread on SX-4
Number of threads used:
– 1,2,4,8,16,32 on SX-4
– 1,2,4,8,16 on SX-5
– 1,2,4,8 on SX-6+
– 1,2,4,8 on SX-8
48 Holger Brunst, Matthias Müller: Leistungsanalyse
Wupwise – expected behavior
Same node
performance
of SX-4/5/6
49 Holger Brunst, Matthias Müller: Leistungsanalyse
Art – improves better than peak performance
Art benefits from
improvements of
scalar unit
50 Holger Brunst, Matthias Müller: Leistungsanalyse
Swim – surprisingly improves with every generation
Compute
bound on SX-4
and SX-5 !
51 Holger Brunst, Matthias Müller: Leistungsanalyse
Mgrid – large improvements from SX-6+ to SX-8
Improved
stride 2
memory access
52 Holger Brunst, Matthias Müller: Leistungsanalyse
Not much improvement from SX-4 to 5 and 6 to 8
53 Holger Brunst, Matthias Müller: Leistungsanalyse
Explanation for ammp improvements
Ammp contains a lot of locks
Lock performance (measured by EPCC microbenchmarks)
Lock Lock Ratio Ammp Ammp ratio
SX-6+ 4.3 micro s 1.23 2.82 1
SX-8 3.5 micro s 1 3.40 1.21
54 Holger Brunst, Matthias Müller: Leistungsanalyse
General observations
With the exception of equake and galgel the applications show good scalability
Peak performance improvements
– realized to 87% to 96% for 1 thread
– realized to 81% to 89% for 8 threads
On average an SX-8 CPU is 6.14 times faster than an SX-4 CPU (peak ratio is 8)
No significant difference between scalar and vector codes
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Summary for SPEC
56 Holger Brunst, Matthias Müller: Leistungsanalyse
Summary – What you should have learned
– There are many different benchmark approaches: microbenchmarks, kernels, applications…
– SPEC benchmarks are application or at least application oriented benchmarks, designed to represent current workloads
• An update is required after a few years
– SPEC benchmarks are used to:
• Measure and compare performance of systems
• Drive future development
• …
– Different metrics are used (base/peak, speed/throughput)
– Many different factors have an influence on application performance:
• CPU
• Memory system
• Compilers
• OS and runtime environment
• I/O system
• …
Matthias Müller ([email protected])
Center for Information Services and High Performance Computing (ZIH)
Regression Models
58 Holger Brunst, Matthias Müller: Leistungsanalyse
Terms
Regression models allow to estimate or predict a random variable as a function of several other variables
The estimated variable is called response variable, the variables used to predict the response are called predictor variables, predictors or factors.
59 Holger Brunst, Matthias Müller: Leistungsanalyse
Simple Linear Regression Model
Predictor variable x and predicted response y:
Regression parameters b
Error
x
y
Measured y Estimated y
60 Holger Brunst, Matthias Müller: Leistungsanalyse
Definitions
n observation pairs:
Error:
Sum of Squared Errors (SSE):
Mean Error:
Best linear model minimizes SSE and has a mean error of zero.
Exercise: calculate regression parameters for best linear model
61 Holger Brunst, Matthias Müller: Leistungsanalyse
Calculation of Linear Regression Parameters
62 Holger Brunst, Matthias Müller: Leistungsanalyse
Coefficient of determination
Sum of Squared Errors (SSE):
SSE without regression would be (total sum of squares SST):
Difference between SSE and SST is explained by regression:
SSR=SST-SSE
Coefficient of determination (the higher R, the better the regression)
63 Holger Brunst, Matthias Müller: Leistungsanalyse
Assumptions
The relationship between the response variable y and the predictor variable x is linear
The predictor variable x is measured without any error
The model errors are statistically independent
The errors are normally distributed with zero mean and a constant standard deviation
64 Holger Brunst, Matthias Müller: Leistungsanalyse
Visual tests: look at the data
x
y (a) Linear
x
y (c) Outlier
x
y (d) Nonlinear
x
y (b) Multilinear
65 Holger Brunst, Matthias Müller: Leistungsanalyse
Residual versus predicted response graph
Predicted response
(a) No trend R
esid
ual
Predicted response
(b) Trend
Res
idua
l Predicted response
(c) Trend
Res
idua
l
66 Holger Brunst, Matthias Müller: Leistungsanalyse
Residual versus experiment number
Experiment number
(a) No trend R
esid
ual
Experiment number
(b) Trend
Res
idua
l
Example: physical experiment with insufficient initial conditions.
67 Holger Brunst, Matthias Müller: Leistungsanalyse
Check for constant standard deviation of errors
Predicted response
(a) No trend R
esid
ual
Predicted response
(b) Increasing spread
Res
idua
l
68 Holger Brunst, Matthias Müller: Leistungsanalyse
Automatic fitting with gnuplot
69 Holger Brunst, Matthias Müller: Leistungsanalyse
gnuplot> f(x)=a*x+b
gnuplot> fit f(x) "data.txt" u 1:2 via a,b
After 4 iterations the fit converged.
final sum of squares of residuals : 1.80841
rel. change during last iteration : -6.64694e-07
degrees of freedom (ndf) : 15
rms of residuals (stdfit) = sqrt(WSSR/ndf) : 0.347218
variance of residuals (reduced chisquare) = WSSR/ndf : 0.120561
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 0.530196 +/- 0.01719 (3.242%)
b = 3.70353 +/- 0.1761 (4.756%)
70 Holger Brunst, Matthias Müller: Leistungsanalyse
Visual test
plot [0:][0:] "data.txt" u 1:2 w p, f(x)
71 Holger Brunst, Matthias Müller: Leistungsanalyse
Residual versus predicted response graph
72 Holger Brunst, Matthias Müller: Leistungsanalyse
Residual versus experiment number
73 Holger Brunst, Matthias Müller: Leistungsanalyse
Fitting with gnuplot: basics
The `fit` command can fit a user-defined function to a set of data points
(x,y), using an implementation of the nonlinear least-squares
(NLLS) Marquardt-Levenberg algorithm. Any user-defined variable occurring in
the function body may serve as a fit parameter, but the return type of the
function must be real.
Syntax: fit {[xrange] {[yrange]}} <function> '<datafile>'
{datafile-modifiers} via '<parameter file>' | <var1>{,<var2>,...}
74 Holger Brunst, Matthias Müller: Leistungsanalyse
Fitting with gnuplot: advanced
The default data formats for fitting functions with a single independent
variable, y=f(x), are {x:}y or x:y:s; those formats can be changed with
the datafile `using` qualifier. The third item (a column number or an
expression), if present, is interpreted as the standard deviation of the
corresponding y value and is used to compute a weight for the datum, 1/s**2.
75 Holger Brunst, Matthias Müller: Leistungsanalyse
Curvilinear regression
Sometimes life is more difficult than linear dependencies: nonlinear regression is needed
Often it is sufficient to convert the nonlinear function in a linear form with suitable variable conversion, this is called curvilinear regression
Example:
76 Holger Brunst, Matthias Müller: Leistungsanalyse
Examples of curvilinear regression functions
Note: if a predictor variable appears in more than one transformed predictor variable, the transformed variables are likely to be correlated, causing the problem of multicolinearity
Nonlinear Linear Y=a+b/x Y=a+b(1/x)
y = 1(a+bx) (1/y) = a+bx
y = x / (a+bx) (x/y) = a + bx
y = a b^x ln y = ln a + (ln b) x
y = a + b x^n y = a + b (x ^ n )
77 Holger Brunst, Matthias Müller: Leistungsanalyse
Common mistakes in regression
Not verifying that the relationship is linear
Relying on automated results without visual verification
Not specifying confidence intervalls for the regression parameters
Not specifying the coefficient of determination
Confusing the Coefficient of Determination R^2 and the Coefficient of Correlation R
Using regression to predict far beyond the measure range
78 Holger Brunst, Matthias Müller: Leistungsanalyse
Coefficient of determination provides wrong indication
x
y
x
y
x
y
x
y
79 Holger Brunst, Matthias Müller: Leistungsanalyse
Short checklist for simple linear regression analysis
1. Visually verified that the relationship is linear?
2. Are all predictors in appropriate units so that the regression coeeficients are comparable?
3. Has the coefficient of determination been specified?
4. Is the coefficient of determination high enough?
5. Have the confidence intervals for regression parameters been calculated?
6. Are all regression parameters statistically significant?
7. Is the regression only been used for predictions closed to the measured range?
80 Holger Brunst, Matthias Müller: Leistungsanalyse
Not treated here
Confidence intervals for regression parameters
Confidence intervals for predictions
Multiple linear regression
General transformations
.. and much more…
Jens Doleschal ([email protected])
Internal Timer Synchronization for
Parallel Event Tracing
Use Case for Linear Regression
Center for Information Services and High Performance Computing (ZIH)
82 Jens Doleschal
Introduction
• Timers on distributed environments are not sufficiently synchronized for the purpose of event tracing (~2-3 μs = network latency)
Causes for insufficient timers on distributed environments:
– Every host typically has it's own local timer
– System timers synchronized with NTP are far to imprecise for event tracing (~1 ms)
– Some timers like cycle counters are not synchronized by default
– Fluctuations of the timer speed due to temperature and aging of the quartz oscillator
– Speed Step Technology
83 Jens Doleschal
Introduction
Use of inaccurately synchronized timers results in an erroneous representation of the program trace data:
Q1 Qualitative error: Violation of the logical order of distributed events.
Q2 Quantitative error: Distorted time measurement of distributed activities. Leads to skewed performance values.
84 Jens Doleschal
Timer Synchronization: Overview
Two parts of the synchronization scheme:
– Recording synchronization information during runtime
– Subsequent correction, i.e. Transformation of asynchronous local time stamps to synchronous global time stamps with a linear interpolation
Due to small fluctuations in the timer drift the synchronization error will be accumulated over long intervals
Linear begin-to-end correction insufficient for long trace runs
Synchronize the timers frequently and piecewise interpolate the timer parameters between the synchronization phases
85 Jens Doleschal
Timer Synchronization: Clock Model
Timer correction:
– System of linear equations solved with least square method
– Foundation of the equation system are message passing relationships between the local timers
– No need for a reference timer
Statistical estimation:
- Message delays determined by a maximum likelihood estimator
- Error is normal distributed
– Best linear unbiased estimator (BLUE)
86 Jens Doleschal
Timer Synchronization: Resynchronization
Relationships between local timers using multiple ping-pong messages for concurrency with low uncertainty
Within each synchronization phase a specially designed message pattern monitors timer alignment
Frequently repetition of synchronization phase
1st Sync-phase 2nd Sync-phase
87 Jens Doleschal
Measurement Results: Linear begin-to-end Synchronization
88 Jens Doleschal
Measurement Results: Resynchronization every 4 Minutes
89 Jens Doleschal
Measurement Results: Resynchronization every 2 Minutes