26
1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

Embed Size (px)

Citation preview

Page 1: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

1

Typical performance bottlenecks and how they can be found

Bert Wesarg

ZIH, Technische Universität Dresden

Page 2: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

2SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Case I:– Finding load imbalances in OpenMP codes

• Case II:– Finding communication and computation imbalances in MPI

codes

Outline

Page 3: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

3SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Matrix has significant more zero elements => sparse matrix

• Only non-zero elements of are saved efficiently in memory

• Algorithm:

Case I: Sparse Matrix Vector Multiplication

foreach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]

Page 4: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

4SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Naïve OpenMP Algorithm:

• Distributes the rows of A evenly across the threads in the parallel region

• The distribution of the non-zero elements may influence the load balance in the parallel application

Case I: Sparse Matrix Vector Multiplication

#pragma omp parallel forforeach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]

Page 5: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

5SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Measuring the static OpenMP application

Case I: Load imbalances in OpenMP codes

% cd ~/Bottlenecks/smxv% make PREP=scorepscorep gcc -fopenmp -DLITTLE_ENDIAN \ -DFUNCTION_INC='"y_Ax-omp.inc.c"' -DFUNCTION=y_Ax_omp \ -o smxv-omp smxv.c -lmscorep gcc -fopenmp -DLITTLE_ENDIAN \ -DFUNCTION_INC='"y_Ax-omp-dynamic.inc.c"‘ \ -DFUNCTION=y_Ax_omp_dynamic -o smxv-omp-dynamic smxv.c -lm% OMP_NUM_THREADS=8 scan –t ./smxv-omp yax_large.bin

Page 6: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

6SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Two metrics which indicate load imbalances:– Time spent in OpenMP barriers– Computational imbalance

• Open prepared measurement on the LiveDVD with Cube

Case I: Load imbalances in OpenMP codes: Profile

% cube ~/Bottlenecks/smxv/scorep_smxv-omp_large/trace.cubex

[CUBE GUI showing trace analysis report]

Page 7: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

7SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

Case I: Time spent in OpenMP barriers

These threads spent up to 20% of there running time

in the barrier

Page 8: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

8SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

Case I: Computational imbalance

Master thread does 66% of the work

Page 9: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

9SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Improved OpenMP Algorithm

• Distributes the rows of A dynamically across the threads in the parallel region

Case I: Sparse Matrix Vector Multiplication

#pragma omp parallel for schedule(dynamic,1000)foreach row r in A y[r.x] = 0 foreach non-zero element e in row y[r.x] += e.value * x[e.y]

Page 10: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

10SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Two metrics which indicate load imbalances:– Time spent in OpenMP barriers– Computational imbalance

• Open prepared measurement on the LiveDVD with Cube:

Case I: Profile Analysis

% cube ~/Bottlenecks/smxv/scorep_smxv-omp-dynamic_large/trace.cubex

[CUBE GUI showing trace analysis report]

Page 11: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

11SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

Case I: Time spent in OpenMP barriers

All threads spent similar time in the barrier

Page 12: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

12SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

Case I: Computational imbalance

Threads do nearly equal work

Page 13: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

13SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Open prepared measurement on the LiveDVD with Vampir:

Case I: Trace Comparison

% vampir ~/Bottlenecks/smxv/scorep_smxv-omp_large/traces.otf2

[Vampir GUI showing trace]

Page 14: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

14SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

Case I: Time spent in OpenMP barriers

Improved runtime

Less time in OpenMP barrier

Page 15: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

15SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

Case I: Computational imbalance

Great imbalance for time spent in computational

code

Great imbalance for time spent in computational

code

Page 16: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

16SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Case I:– Finding load imbalances in OpenMP codes

• Case II:– Finding communication and computation imbalances in MPI

codes

Outline

Page 17: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

17SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Calculating the heat conduction at each timestep• Discretized formula for space and time

Case II: Heat Conduction Simulation

Page 18: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

18SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Application uses MPI for boundary exchange and OpenMP for computation

• Simulation grid is distributed across MPI ranks

Case II: Heat Conduction Simulation

Page 19: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

19SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Ranks need to exchange boundaries before next iteration step

Case II: Heat Conduction Simulation

Page 20: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

20SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• MPI Algorithm

• Building and measuring the heat conduction application:

• Open prepared measurement on the LiveDVD with Cube

Case II: Profile Analysis

foreach step in [1:nsteps] exchangeBoundaries computeHeatConduction

% cd ~/Bottlenecks/heat% make PREP=‘scorep --user’ [... make output ...]% scan –t mpirun –np 16 ./heat-MPI 3072 32

% cube ~/Bottlenecks/heat/scorep_heat-MPI_16/trace.cubex

[CUBE GUI showing trace analysis report]

Page 21: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

21SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Step 1: Compute heat in the area which is communicated to your neighbors

Case II: Hide MPI communication with computation

Page 22: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

22SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Step 2: Start communicating boundaries with your neighbors

Case II: Hide MPI communication with computation

Page 23: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

23SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Step 3: Compute heat in the interior area

Case II: Hide MPI communication with computation

Page 24: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

24SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Improved MPI Algorithm

• Measuring the improved heat conduction application:

• Open prepared measurement on the LiveDVD with Cube

Case II: Profile Analysis

foreach step in [1:nsteps] computeHeatConductionInBoundaries startBoundaryExchange computeHeatConductionInInterior waitForCompletionOfBoundaryExchange

% scan –t mpirun –np 16 ./heat-MPI-overlap 3072 32

% cube ~/Bottlenecks/heat/scorep_heat-MPI-overlap_16/trace.cubex

[CUBE GUI showing trace analysis report]

Page 25: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

25SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Open prepared measurement on the LiveDVD with Vampir:

Case II: Trace Comparison

% vampir ~/Bottlenecks/heat/scorep_heat-MPI_16/traces.otf2

[Vampir GUI showing trace]

Page 26: 1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden

26SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering

• Thanks to Dirk Schmidl, RWTH Aachen, for providing the sparse matrix vector multiplication code

Acknowledgments