Regression Models - TU Dresden · 8 Holger Brunst, Matthias Müller: Leistungsanalyse 2k Factorial Designs Determines the effect of k factors with 2 levels each Easy to analyze Helps

Matthias Müller ([email protected])

Center for Information Services and High Performance Computing (ZIH)

Vorlesung Leistungsanalyse

Parallel SPEC Benchmarks

Regression Models



Summary of Previous Lecture

Experimental Design

3 Holger Brunst, Matthias Müller: Leistungsanalyse

Goal & Terminology

Obtain the maximum information with the

minimum number experiments

Response Variable (Zielgröße)

Factors (Einflussfaktoren), also called: Predictor variables or predictors

Levels, also called: treatment

Primary Factors

Secondary Factors

Replication

Experiment Design (Versuchsplanung)

Experimental Unit


Terminology: Interaction

Interaction (Wechselwirkung)

Two factors A and B are interacting factors if the effect of one depends upon the level of the other

Noninteracting Factors

A1 A2

B1 2 4

B2 5 7

Interacting Factors

A1 A2

B1 2 4

B2 5 8

2 4 6 8

10

A1 A2

B2

B1

10

2 4 6 8

B1 B2

A2

A1

10

2 4 6 8

A1 A2

B2

B1

10

2 4 6 8

B1 B2

A2

A1


Common Mistakes

Variation due to experimental error ignored

Important parameters are not controlled

Effects of different factors are not isolated

Simple one-factor-at-a-time designs are used

Interactions are ignored

Too many experiments are conducted


Full Factorial Designs

Uses every possible combination at all levels of all factors which requires n experiments, where

In workstation example: 7 CPUs x 3 Memory sizes x 4 disk drives x 4 workloads x 4 operating systems x 3 educational levels

= 4032 experiments

Advantage: Every possible factor combination is examined. This includes secondary factors and their interactions.

Disadvantage:

– Cost of the study regarding time and money

– Too many experiments to be conducted.

– Also consider replication!


2k Factorial Designs

Determines the effect of k factors with 2 levels each

Easy to analyze

Helps to sort performance factors in the order of impact

At beginning of performance study:

– Large number of factors and levels

– Full factorial design most likely not possible

– Reduce the number of factors by selecting the significant ones

Impact of unidirectional factors can be estimated for their minimum and maximum levels

Decide if performance difference is worth further examination (with more levels)

Explanation of the concept: Start with k=2, then generalize



Parallel SPEC Benchmarks



SPEC OMP


SPEC OMP

Benchmark suite developed by SPEC HPG

Benchmark suite for performance testing of shared memory processor systems

Uses OpenMP versions of SPEC CPU2000 benchmarks

SPEC OMP mixes integer and FP in one suite

OMPM is focused on 4-way to 16-way systems

OMPL is targeting 32-way and larger systems


SPEC OMP Applications

Code Applications Language lines ammp Molecular Dynamics C 13500

applu CFD, partial LU Fortran 4000

apsi Air pollution Fortran 7500

art Image Recognition\

neural networks C 1300

fma3d Crash simulation Fortran 60000

gafort Genetic algorithm Fortran 1500

galgel CFD, Galerkin FE Fortran 15300

equake Earthquake modeling C 1500

mgrid Multigrid solver Fortran 500

swim Shallow water modeling Fortran 400

wupwise Quantum chromodynamics Fortran 2200


CPU2000 vs OMPL2001



SPEC MPI2007


An application benchmark suite that measures:

– Type of computer processor

– Number of computer processors

– Communication interconnect

– Memory architecture

– Compilers

– MPI library performance

– File system performance

Identifying Candidate Applications

– From SPEC CPU2006

– With a search for candidate call

MPI2007 design goals: benchmark for distributed memory


Comparison of Different Benchmarks using MPI

SPEC MPI NPB HPCC

Number of applications

13 8 7

Language F77,F90,C,C++ F77,C C

Code size ~530.000 lines 28.000 lines 47.200 lines

#MPI calls in the code ~2400 ~400 ~600

#different MPI calls in the code

~59 ~36 ~44


Application Fields

– Computation fluid dynamics

– Quantum chromodynamics

– Climate modeling

– Ray tracing

– Molecular Dynamics

– Weather prediction

– Heat transfer

– Hydrodynamics

– Flow Simulation


MPI2007 Benchmark Goals

–Runs on Clusters or SMP’s

–Validates for correctness and measures performance

–Supports 32-bit or 64-bit OS/ABI.

–Consists of applications drawn from National Labs and University research centers

–Supports a broad range of MPI implementations and Operating systems including Windows, Linux, Proprietary Unix

–Has a runtime of ~1 hour per benchmark test at 16 ranks using GigE with 1 GB memory footprint per rank

–Scales to 128 ranks

–Is extensible to future large and extreme data sets planned to cover larger number of ranks.


MPI2007 – tested for portability

– Architectures:

• Opteron, Xeon, Itanium2, PA-Risc, Power5, Sparc

– Interconnects:

• Ethernet, Infiniband, Infinipath, SGI NUMAlink, and shared memory.

– Operating systems

• Linux (RH FC3, SLES9/10,Suse 9.3), Windows CCS, HPUX, Solaris, AIX

– MPI implementations

• HP-MPI, MPICH, MPICH2, Open MPI, IBM-MPI, Intel MPI, MPICH-GM, MVAPICH, Fujitsu MPI, InfiniPath MPI, SGI MPT

– Compilers:

• SUN Studio, Fujitsu, Intel, PathScale, PGI, HP, and IBM compilers.


MPI2007 – tested for scalability

– Scalable from 16 to 128 ranks (processes) for medium data set

– Runtime of 1 hour per benchmark test at 16 ranks using GigE on an unspecified reference cluster.

– Memory footprint should be < 1GB per rank at 16 ranks.

– Exhaustively tested for each rank count - 12 - 15 -> 130 - 140, 160, 180, 200, 225, 256, 512


Overview of the applications

Code LOC Language MPI MPI Area call sites calls

104.milc 17987 C 51 18 Lattice QCD 107.leslie3d 10503 F77,F90 43 13 Combustion 113.GemsFDTD 21858 F90 237 16 Electrodynamic simulation 115.fds4 44524 F90,C 239 15 CFD

121.pop2 69203 F90 158 17 Geophysical fluid

dynamics 122.tachyon 15512 C 17 16 Ray tracing 126.lammps 6796 C++ 625 25 Molecular dynamics 127.wrf2 163462 F90,C 132 23 Weather forecast 128.GAPgeofem 30935 F77,C 58 18 Geophysical FEM 129.tera_tf 6468 F90 42 13 Eulerian hydrodynamics 130.socorro 91585 F90 155 20 density-functional theory 132.zeusmp2 44441 C,F90 639 21 Astrophysical CFD 137.lu 5671 F90 72 13 SSOR


MPI2007 Benchmark dynamic message call counts


Pt2Pt Communication Statistics: 122.tachyon (ray tracing)


Pt2Pt Communication Statistics: 107.leslie3D (combustion)


Pt2Pt Communication Statistics: 113.GemsFDTD (electrodynamics)


Message Length Statistics (Pt2Pt)



Available Results


Available Results (blind submission)

– AMD A2210 Reference Platform (16 cores)

• Gigabit Ethernet

• Single Core AMD Opteron 848, 2.2 GHz

– SGI Altix 4700 (16-128 cores)

• SGI Numalink, SGI MPT 1.15

• Dual-Core Intel Itanium II 9040, 1.6 GHz

– HP Proliant BL460c Blade Cluster Platform 3000 BL (16-256 cores)

• Infiniband DDR, HP-MPI 2.2.5

• Dual-Core Intel Xeon 5160, 3.0 GHz

– QLogic, U. Cambridge Darwin Cluster (32-512 cores)

• Infinipath, QLogic Infinipath MPI library 2.0

• Dual-Core Intel Xeon 5160, 3.0 GHz

– QLogic, AMD Emerald Cluster (32-512 cores)

• Infinipath, QLogic Infinipath MPI library 2.1

• Dual-Core AMD Opteron 290, 2.8 GHz


Scales to 128 , works on 512


Scalability on U. Cambridge’s Darwin Cluster (II)


Scalability on HP Cluster


Summary and Conclusion

SPEC MPI2007 properties:

– Application benchmark with 13 different codes

– Run and reporting rules for reproducibility

– Tested on a wide range of platforms:

• CPU and Node Architectures

• Interconnects

• Compilers

• MPI implementations

– Available dataset (medium) scales to 128 ranks

– Next steps:

• Large dataset with enhanced scalability for larger systems

• …



Use Cases


Use cases

– Performance trends

– Compiler and performance

– Comparing different Itanium systems

– Comparing different system generations


SPEC performance trends (performance per thread)


Where Does the Performance Go? or Why Should I Care About the Memory Hierarchy?

Processor-DRAM Memory Gap (latency) Proc

60%/yr.

(2X/1.5yr)

DRAM

9%/yr.

(2X/10 yrs)

“Moore’s Law”

Processor-Memory

Performance Gap:

(grows 50% / year)

CPU

DRAM


Comparison OMPM base compilers


Influence of compilers on OMPM base 32-way results


Comparison OMPM on 32-way 1.5 GHz Itanium


SMP Performance Gain Itanium/Itanium 2


45.7cm

38.6cm CPU

1985 1990 1995 1998

Perf

orm

ance

Bipolar Water-cooled

CMOS Air-cooled

Multi Nodes

Large scale cluster

>100nodes

SX-3

SX-5

Over 1GFLOP Per Node

SX-6/7

SX-1/2

SX-4

Technology

2cm

2cm

SX-8

Massive scale cluster >500nodes

2004

Single module node

Single Chip Vector Processor

Multi CPUs

Architecture

The history of NEC SX series

2001


Performance Properties of Different SX systems

System Availability CPU perf. Mem band/ CPU Node perf. Mem. Band/ Node

SX-4 1996 2 GF/s 16 GB/s 64 GF/s 512 GB/s

SX-5e 1999 4 GF/s 32 GB/s 64 GF/s 512 GB/s

SX-6 2001 8 GF/s 32 GB/s 64 GF/s 256 GB/s

SX-6+ 2002 9 GF/s 36 GB/s 72 GF/s 324 GB/s

SX-8 2004 16 GF/s 64 GB/s 128 GF/s 512 GB/s

Factor 2 in

two years

Factor 2 in

eight years


Properties of SPEC codes on vector systems

Name Lang Vratio Vlen MEM (MB)

Wupwise F 87.34 58.74 1488

Swim F 99.75 253.48 1584

Mgrid F 99.14 211.04 480

Applu F 81.31 34.17 1520

Galgel F 92.57 45.14 272

Equake C 0.06 9.6 464

Apsi F 76.70 23.02 1648

Gafort F 40.25 59.60 1680

Fma3d F 10.29 8.95 1040

Art C 32.06 242.14 272

Ammp C 76.67 102.79 176


Expectations

Swim, mgrid and maybe galgel should perform well

Equake, fma3d and art should perform poorly

However, the focus was not on absolute, but relative performance and scalability


SPEC efficiency on SX


Performance measurements

All performance is reported relative to the performance of one thread on SX-4

Number of threads used:

– 1,2,4,8,16,32 on SX-4

– 1,2,4,8,16 on SX-5

– 1,2,4,8 on SX-6+

– 1,2,4,8 on SX-8


Wupwise – expected behavior

Same node

performance

of SX-4/5/6


Art – improves better than peak performance

Art benefits from

improvements of

scalar unit


Swim – surprisingly improves with every generation

Compute

bound on SX-4

and SX-5 !


Mgrid – large improvements from SX-6+ to SX-8

Improved

stride 2

memory access


Not much improvement from SX-4 to 5 and 6 to 8


Explanation for ammp improvements

Ammp contains a lot of locks

Lock performance (measured by EPCC microbenchmarks)

Lock Lock Ratio Ammp Ammp ratio

SX-6+ 4.3 micro s 1.23 2.82 1

SX-8 3.5 micro s 1 3.40 1.21


General observations

With the exception of equake and galgel the applications show good scalability

Peak performance improvements

– realized to 87% to 96% for 1 thread

– realized to 81% to 89% for 8 threads

On average an SX-8 CPU is 6.14 times faster than an SX-4 CPU (peak ratio is 8)

No significant difference between scalar and vector codes



Summary for SPEC


Summary – What you should have learned

– There are many different benchmark approaches: microbenchmarks, kernels, applications…

– SPEC benchmarks are application or at least application oriented benchmarks, designed to represent current workloads

• An update is required after a few years

– SPEC benchmarks are used to:

• Measure and compare performance of systems

• Drive future development

• …

– Different metrics are used (base/peak, speed/throughput)

– Many different factors have an influence on application performance:

• CPU

• Memory system

• Compilers

• OS and runtime environment

• I/O system

• …



Regression Models


Terms

Regression models allow to estimate or predict a random variable as a function of several other variables

The estimated variable is called response variable, the variables used to predict the response are called predictor variables, predictors or factors.


Simple Linear Regression Model

Predictor variable x and predicted response y:

Regression parameters b

Error

x

y

Measured y Estimated y


Definitions

n observation pairs:

Error:

Sum of Squared Errors (SSE):

Mean Error:

Best linear model minimizes SSE and has a mean error of zero.

Exercise: calculate regression parameters for best linear model


Calculation of Linear Regression Parameters


Coefficient of determination

Sum of Squared Errors (SSE):

SSE without regression would be (total sum of squares SST):

Difference between SSE and SST is explained by regression:

SSR=SST-SSE

Coefficient of determination (the higher R, the better the regression)


Assumptions

The relationship between the response variable y and the predictor variable x is linear

The predictor variable x is measured without any error

The model errors are statistically independent

The errors are normally distributed with zero mean and a constant standard deviation


Visual tests: look at the data

x

y (a) Linear

x

y (c) Outlier

x

y (d) Nonlinear

x

y (b) Multilinear


Residual versus predicted response graph

Predicted response

(a) No trend R

esid

ual

Predicted response

(b) Trend

Res

idua

l Predicted response

(c) Trend

Res

idua

l


Residual versus experiment number

Experiment number

(a) No trend R

esid

ual

Experiment number

(b) Trend

Res

idua

l

Example: physical experiment with insufficient initial conditions.


Check for constant standard deviation of errors

Predicted response

(a) No trend R

esid

ual

Predicted response

(b) Increasing spread

Res

idua

l


Automatic fitting with gnuplot


gnuplot> f(x)=a*x+b

gnuplot> fit f(x) "data.txt" u 1:2 via a,b

After 4 iterations the fit converged.

final sum of squares of residuals : 1.80841

rel. change during last iteration : -6.64694e-07

degrees of freedom (ndf) : 15

rms of residuals (stdfit) = sqrt(WSSR/ndf) : 0.347218

variance of residuals (reduced chisquare) = WSSR/ndf : 0.120561

Final set of parameters Asymptotic Standard Error

======================= ==========================

a = 0.530196 +/- 0.01719 (3.242%)

b = 3.70353 +/- 0.1761 (4.756%)


Visual test

plot [0:][0:] "data.txt" u 1:2 w p, f(x)


Residual versus predicted response graph


Residual versus experiment number


Fitting with gnuplot: basics

The `fit` command can fit a user-defined function to a set of data points

(x,y), using an implementation of the nonlinear least-squares

(NLLS) Marquardt-Levenberg algorithm. Any user-defined variable occurring in

the function body may serve as a fit parameter, but the return type of the

function must be real.

Syntax: fit {[xrange] {[yrange]}} <function> '<datafile>'

{datafile-modifiers} via '<parameter file>' | <var1>{,<var2>,...}


Fitting with gnuplot: advanced

The default data formats for fitting functions with a single independent

variable, y=f(x), are {x:}y or x:y:s; those formats can be changed with

the datafile `using` qualifier. The third item (a column number or an

expression), if present, is interpreted as the standard deviation of the

corresponding y value and is used to compute a weight for the datum, 1/s**2.


Curvilinear regression

Sometimes life is more difficult than linear dependencies: nonlinear regression is needed

Often it is sufficient to convert the nonlinear function in a linear form with suitable variable conversion, this is called curvilinear regression

Example:


Examples of curvilinear regression functions

Note: if a predictor variable appears in more than one transformed predictor variable, the transformed variables are likely to be correlated, causing the problem of multicolinearity

Nonlinear Linear Y=a+b/x Y=a+b(1/x)

y = 1(a+bx) (1/y) = a+bx

y = x / (a+bx) (x/y) = a + bx

y = a b^x ln y = ln a + (ln b) x

y = a + b x^n y = a + b (x ^ n )


Common mistakes in regression

Not verifying that the relationship is linear

Relying on automated results without visual verification

Not specifying confidence intervalls for the regression parameters

Not specifying the coefficient of determination

Confusing the Coefficient of Determination R^2 and the Coefficient of Correlation R

Using regression to predict far beyond the measure range


Coefficient of determination provides wrong indication

x

y

x

y

x

y

x

y


Short checklist for simple linear regression analysis

1. Visually verified that the relationship is linear?

2. Are all predictors in appropriate units so that the regression coeeficients are comparable?

3. Has the coefficient of determination been specified?

4. Is the coefficient of determination high enough?

5. Have the confidence intervals for regression parameters been calculated?

6. Are all regression parameters statistically significant?

7. Is the regression only been used for predictions closed to the measured range?


Not treated here

Confidence intervals for regression parameters

Confidence intervals for predictions

Multiple linear regression

General transformations

.. and much more…

Jens Doleschal ([email protected])

Internal Timer Synchronization for

Parallel Event Tracing

Use Case for Linear Regression


82 Jens Doleschal

Introduction

• Timers on distributed environments are not sufficiently synchronized for the purpose of event tracing (~2-3 μs = network latency)

Causes for insufficient timers on distributed environments:

– Every host typically has it's own local timer

– System timers synchronized with NTP are far to imprecise for event tracing (~1 ms)

– Some timers like cycle counters are not synchronized by default

– Fluctuations of the timer speed due to temperature and aging of the quartz oscillator

– Speed Step Technology

83 Jens Doleschal

Introduction

Use of inaccurately synchronized timers results in an erroneous representation of the program trace data:

Q1 Qualitative error: Violation of the logical order of distributed events.

Q2 Quantitative error: Distorted time measurement of distributed activities. Leads to skewed performance values.

84 Jens Doleschal

Timer Synchronization: Overview

Two parts of the synchronization scheme:

– Recording synchronization information during runtime

– Subsequent correction, i.e. Transformation of asynchronous local time stamps to synchronous global time stamps with a linear interpolation

Due to small fluctuations in the timer drift the synchronization error will be accumulated over long intervals

Linear begin-to-end correction insufficient for long trace runs

Synchronize the timers frequently and piecewise interpolate the timer parameters between the synchronization phases

85 Jens Doleschal

Timer Synchronization: Clock Model

Timer correction:

– System of linear equations solved with least square method

– Foundation of the equation system are message passing relationships between the local timers

– No need for a reference timer

Statistical estimation:

- Message delays determined by a maximum likelihood estimator

- Error is normal distributed

– Best linear unbiased estimator (BLUE)

86 Jens Doleschal

Timer Synchronization: Resynchronization

Relationships between local timers using multiple ping-pong messages for concurrency with low uncertainty

Within each synchronization phase a specially designed message pattern monitors timer alignment

Frequently repetition of synchronization phase

1st Sync-phase 2nd Sync-phase

87 Jens Doleschal

Measurement Results: Linear begin-to-end Synchronization

88 Jens Doleschal

Measurement Results: Resynchronization every 4 Minutes

89 Jens Doleschal

Measurement Results: Resynchronization every 2 Minutes

Documents

Regression Models - TU Dresden · 8 Holger Brunst, Matthias Müller: Leistungsanalyse 2k Factorial Designs Determines the effect of k factors with 2 levels each Easy to analyze Helps