29
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data- Parallel Applications Published in: Cluster Computing (CLUSTE R), 2012 IEEE Internation al Conference on 2013/9/11 1

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

Embed Size (px)

Citation preview

Page 1: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

1

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems

Using Functional Performance Models of Data-Parallel Applications

Published in:Cluster Computing (CLUSTER), 2012 IEEE International Conference on

2013/9/11

Page 2: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

2

Outline

• Introduction• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results

2013/9/11

Page 3: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

3

Outline

• Introduction• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results

2013/9/11

Page 4: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

4

Introduction

• Heterogeneous multiprocessor systems– Better power efficiency– Performance/price ratio

• Multicore and GPU programming techniques– OpenMP, MPI– Brook+, CUDA, OpenCL

2013/9/11

Page 5: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

5

Introduction (cont.)

• Data-parallel scientific applications– Linear algebra routines– Digital signal processing– Computational fluid dynamics

• Data partitioning algorithm– Performance models of processor

2013/9/11

Page 6: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

6

Introduction (cont.)

• Constant performance model (CPM)– Use history of performance measurement– Absolute speed of processors/devices

• Functional performance model (FPM)– Be used with any data-parallel application– GPU and CPU have separate memory and different

programming models

2013/9/11

Page 7: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

7

Introduction (cont.)

• Load balancing algorithm– Static algorithms• Known as predicting-the future• Do not require data redistribution• Cannot balance on non-dedicated platforms

– Dynamic algorithms• Do not require a priori information• Communication overhead

2013/9/11

Page 8: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

8

Outline

• Introduction• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results

2013/9/11

Page 9: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

9

Performance Measurement

• Hybrid multicore and multi-GPU node of NUMA architecture – Multiple identical cores– Hierarchical memory– Heterogeneous GPUs via the PCI Express

2013/9/11

Page 10: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

10

Performance Measurement

• CPU– GEMM kernel from ACML 4.4 (AMD Core Math

Library)• GPU– CUBLAS 4.1 (NVDIA CUDA BLAS)

2013/9/11

Page 11: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

11

Performance Measurement (cont.)

• Approach to performance measurement– Processes are bound to cores– Processes are synchronized– Repeat multiple times

2013/9/11

Page 12: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

12

Performance Measurement (cont.)

• CPU– The speed of a core depended on the number of

cores executing the kernel on the same socket– Wasn’t affected by the execution on the other

socket• GPU– One core is dedicated to the GPU, the other cores

are idle– Send / Receive matrix

2013/9/11

Page 13: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

13

Outline

• Introduction• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results

2013/9/11

Page 14: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

14

Column-based matrix multiplication

2013/9/11

Page 15: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

15

Column-based matrix multiplication (cont.)

• Partitioning algorithm – Arrange the submatrices to be as square as

possible– Minimizing the total volume of communications

and balancing the computations• blocking factor b – a parameter of the application adjusting the

granularity of communications and computations– Comes from experiment

2013/9/11

Page 16: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

16

Outline

• Introduction• Related Work• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results

2013/9/11

Page 17: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

17

FPM of multiple cores and GPUs

• Speed functions of multiple cores

2013/9/11

Page 18: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

18

FPM of multiple cores and GPUs (cont.)

• Speed functions of GPUs

2013/9/11

Page 19: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

19

FPM of multiple cores and GPUs (cont.)

• Version 1– pivot column A(b), row B(b), submatrix Ci are stored

in the host memory• Version 2– submatrix C is stored and accumulated in

the device until the device memory is exceeded

2013/9/11

Page 20: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

20

FPM of multiple cores and GPUs (cont.)

• Version 3– Overlapping communications and computaions

2013/9/11

Page 21: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

21

FPM of multiple cores and GPUs (cont.)

• Speed functions of GPUs

2013/9/11

Page 22: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

22

Outline

• Introduction• Related Work• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results

2013/9/11

Page 23: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

23

Experimental results

2013/9/11

Page 24: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

24

Experimental results (cont.)

2013/9/11

Page 25: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

25

Experimental results (cont.)

2013/9/11

Page 26: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

26

Experimental results (cont.)

2013/9/11

Page 27: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

27

Q&A

2013/9/11

Page 28: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

28

Thank you for listening

2013/9/11

Page 29: Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster

29

• 1. Performance modelling • 2. The performance of the program• 3. Why FPM• 4. Problem size• 5. Kernel• 6. NUMA• 7. GEMM• 8. BLAS• 9. GFlops2013/9/11