iMinds The Conference: Jan Lemeire

01/05/2023

GPU acceleration of image processing Jan

Lemeire

1

3

GPU vs CPU Peak Performance Trends

GPU peak performance has grown aggressively. Hardware has kept up with Moore’s law

Source : NVIDIA

2010350 Million triangles/second3 Billion transistors GPU

19955,000 triangles/second800,000 transistors GPU

01/05/2023

To the rescue: Graphical Processing Units (GPUs)

94 fps (AMD Tahiti Pro)

GPU: 1-3 TeraFlop/second instead of 10-20 GigaFlop/second for CPU

4

Figure 1.1. Enlarging Performance Gap between GPUs and CPUs.

Multi-core CPU

Many-core GPU

Courtesy: John Owens

01/05/2023

GPUs are an alternative for CPUs

in offering processing power

6

01/05/2023 7

pixel rescaling lens correction pattern detection

CPU gives only 4 fpsnext generation machines need 50fps

01/05/2023

CPU: 4 fps GPU: 70 fps

8

01/05/2023

Methodology

9

Application

Identification of compute-intensive parts

Feasibility study of GPU acceleration

GPU implementation

GPU optimization

Hardware

01/05/2023

Obstacle 1Hard(er) to implement

10

01/05/2023 11

Device/GPU ± 1TFLOPS

Global Memory (1GB)

Multiprocessor 1

Local Memory (16/48KB)

ScalarProcessor

± 1GHz

Private 16K/8

ScalarProcessor

Private

Multiprocessor 2

Local Memory

ScalarProcessor

Private

ScalarProcessor

PrivateHost/CPU

Constant Memory (64KB)

GPU Programming Concepts

Texture Memory (in global memory)

RAM

Proces-sor

Grid (1D, 2D or 3D)

Group(0, 0)

Group(1, 0)

Group(0, 1)

Group(1, 1)

Group(2, 0)

Group(2, 1)

Work group

Work item(0, 0)

Work item(1, 0)

Work item(2, 0)

Work item(0, 1)

Work item(1, 1)

Work item(2, 1)

Work item(0, 2)

Work item(1, 2)

Work item(2, 2)

kernel

Max #work items per work group: 1024Executed in warps/wavefronts of 32/64 work itemsMax work groups simultaneously on MP: 8Max active warps on MP: 24/48

get_local_size(0)

get_

local_

size(1

)Work

group

size

S yWork group size Sx

(get_local_id(0), get_local_id(1))

(get_group_id(0),get_group_id(1))100GB/s 200 cycles

40GB/s few cycles

4-8 GB/s

OpenCL terminology

01/05/2023

Semi-abstract scalable hardware model

Need to know model for effective and efficient code

CPU: processor ensures efficient execution

12

Need to know more details than of CPU

Code remains compatible/efficient

01/05/2023

Increased code complexity

1. Complex index calculations Mapping data elements on processing elements (at

least 2 levels) Sometimes better to group elements

2. Optimizations Impact on performance need to be tested

3. A lot of parameters:a. Algorithm, implementationb. Configuration of mappingc. Hardware parameters (limits)d. Optimized versions

13

01/05/2023 14

Application



GPU implementation

GPU optimization

Hardware

Skeleton-based

OpenCL

Pragma-based

Parallelization by compiler

Methodology

01/05/2023

Obstacle 2Hard(er) to get efficiency

15

01/05/2023

We expect peak performance Speedup of 100x possible

At least, we expect some speedup But what is 5x worth?

Reasons for low efficiency?

16

01/05/2023 17

Roofline model

01/05/2023 18

01/05/2023 19

Application



Performance estimation

GPU implementation Performance analysis

GPU optimization

Hardware

Algorithm characterization

Hardwarecharacterization

bottlenecks & trade-offs

Skeleton-based

OpenCL

Pragma-based

Parallelization by compiler

Roofline model& benchmarks

Analytical model

benchmarks

Methodology: our contribution

Anti-parallel patterns

01/05/2023

Conclusions

20

01/05/2023

Conclusions

21

Changed into…

01/05/2023

Conclusions

22

01/05/2023

Competence Center for Personal Supercomputing

Offer trainings (overcome obstacle 1) Acquire expertise Take an independent, critical position

Offer feasibility and performance studies (overcome obstacle 2)

Symposium: Brussels, December 13th 2012

http://parallel.vub.ac.be 23

Documents

iMinds The Conference: Jan Lemeire