23
GPU acceleration of image processing Jan Lemeire 19/06/2022 1

iMinds The Conference: Jan Lemeire

  • Upload
    imec

  • View
    242

  • Download
    0

Embed Size (px)

Citation preview

Page 1: iMinds The Conference: Jan Lemeire

01/05/2023

GPU acceleration of image processing Jan

Lemeire

1

Page 2: iMinds The Conference: Jan Lemeire
Page 3: iMinds The Conference: Jan Lemeire

3

GPU vs CPU Peak Performance Trends

GPU peak performance has grown aggressively. Hardware has kept up with Moore’s law

Source : NVIDIA

2010350 Million triangles/second3 Billion transistors GPU

19955,000 triangles/second800,000 transistors GPU

Page 4: iMinds The Conference: Jan Lemeire

01/05/2023

To the rescue: Graphical Processing Units (GPUs)

94 fps (AMD Tahiti Pro)

GPU: 1-3 TeraFlop/second instead of 10-20 GigaFlop/second for CPU

4

Figure 1.1. Enlarging Performance Gap between GPUs and CPUs.

Multi-core CPU

Many-core GPU

Courtesy: John Owens

Page 5: iMinds The Conference: Jan Lemeire
Page 6: iMinds The Conference: Jan Lemeire

01/05/2023

GPUs are an alternative for CPUs

in offering processing power

6

Page 7: iMinds The Conference: Jan Lemeire

01/05/2023 7

pixel rescaling lens correction pattern detection

CPU gives only 4 fpsnext generation machines need 50fps

Page 8: iMinds The Conference: Jan Lemeire

01/05/2023

CPU: 4 fps GPU: 70 fps

8

Page 9: iMinds The Conference: Jan Lemeire

01/05/2023

Methodology

9

Application

Identification of compute-intensive parts

Feasibility study of GPU acceleration

GPU implementation

GPU optimization

Hardware

Page 10: iMinds The Conference: Jan Lemeire

01/05/2023

Obstacle 1Hard(er) to implement

10

Page 11: iMinds The Conference: Jan Lemeire

01/05/2023 11

Device/GPU ± 1TFLOPS

Global Memory (1GB)

Multiprocessor 1

Local Memory (16/48KB)

ScalarProcessor

± 1GHz

Private 16K/8

ScalarProcessor

Private

Multiprocessor 2

Local Memory

ScalarProcessor

Private

ScalarProcessor

PrivateHost/CPU

Constant Memory (64KB)

GPU Programming Concepts

Texture Memory (in global memory)

RAM

Proces-sor

Grid (1D, 2D or 3D)

Group(0, 0)

Group(1, 0)

Group(0, 1)

Group(1, 1)

Group(2, 0)

Group(2, 1)

Work group

Work item(0, 0)

Work item(1, 0)

Work item(2, 0)

Work item(0, 1)

Work item(1, 1)

Work item(2, 1)

Work item(0, 2)

Work item(1, 2)

Work item(2, 2)

kernel

Max #work items per work group: 1024Executed in warps/wavefronts of 32/64 work itemsMax work groups simultaneously on MP: 8Max active warps on MP: 24/48

get_local_size(0)

get_

local_

size(1

)Work

group

size

S yWork group size Sx

(get_local_id(0), get_local_id(1))

(get_group_id(0),get_group_id(1))100GB/s 200 cycles

40GB/s few cycles

4-8 GB/s

OpenCL terminology

Page 12: iMinds The Conference: Jan Lemeire

01/05/2023

Semi-abstract scalable hardware model

Need to know model for effective and efficient code

CPU: processor ensures efficient execution

12

Need to know more details than of CPU

Code remains compatible/efficient

Page 13: iMinds The Conference: Jan Lemeire

01/05/2023

Increased code complexity

1. Complex index calculations Mapping data elements on processing elements (at

least 2 levels) Sometimes better to group elements

2. Optimizations Impact on performance need to be tested

3. A lot of parameters:a. Algorithm, implementationb. Configuration of mappingc. Hardware parameters (limits)d. Optimized versions

13

Page 14: iMinds The Conference: Jan Lemeire

01/05/2023 14

Application

Identification of compute-intensive parts

Feasibility study of GPU acceleration

GPU implementation

GPU optimization

Hardware

Skeleton-based

OpenCL

Pragma-based

Parallelization by compiler

Methodology

Page 15: iMinds The Conference: Jan Lemeire

01/05/2023

Obstacle 2Hard(er) to get efficiency

15

Page 16: iMinds The Conference: Jan Lemeire

01/05/2023

We expect peak performance Speedup of 100x possible

At least, we expect some speedup But what is 5x worth?

Reasons for low efficiency?

16

Page 17: iMinds The Conference: Jan Lemeire

01/05/2023 17

Roofline model

Page 18: iMinds The Conference: Jan Lemeire

01/05/2023 18

Page 19: iMinds The Conference: Jan Lemeire

01/05/2023 19

Application

Identification of compute-intensive parts

Feasibility study of GPU acceleration

Performance estimation

GPU implementation Performance analysis

GPU optimization

Hardware

Algorithm characterization

Hardwarecharacterization

bottlenecks & trade-offs

Skeleton-based

OpenCL

Pragma-based

Parallelization by compiler

Roofline model& benchmarks

Analytical model

benchmarks

Methodology: our contribution

Anti-parallel patterns

Page 20: iMinds The Conference: Jan Lemeire

01/05/2023

Conclusions

20

Page 21: iMinds The Conference: Jan Lemeire

01/05/2023

Conclusions

21

Changed into…

Page 22: iMinds The Conference: Jan Lemeire

01/05/2023

Conclusions

22

Page 23: iMinds The Conference: Jan Lemeire

01/05/2023

Competence Center for Personal Supercomputing

Offer trainings (overcome obstacle 1) Acquire expertise Take an independent, critical position

Offer feasibility and performance studies (overcome obstacle 2)

Symposium: Brussels, December 13th 2012

http://parallel.vub.ac.be 23