Graphic Processing Unithron.fei.tuke.sk/~adam/pca/lectures/8.Graphic Processing...Shadow of the Tomb Raider, September, 2018 Minimum an Intel Core i3-3220, 8GB of RAM, and an NVIDIA

1

Graphic Processing

Unit

2

What is a GPU ?

• Graphics Processing Units (GPU )

• Highly parallel, multithreaded, many core processors

• Capable of very high computation and data throughput

• Once specially designed for computer graphics and

programmable only through graphics APIs

• Primarily used to manage and boost the performance of

video and graphics

• 2D / 3D graphics ← rendering

• Digital output to display monitors

What is

a G

PU

?

3

What is a GPU ?

• Today’s GPUs

• General-purpose parallel processors (GPGPU)

• Support for accessible programming interfaces and industry

standard languages such as C

• Developers who port their applications to GPUs often

achieve speedups of orders of magnitude vs. optimized

CPU implementations

• High floating point performance

• Peak memory bandwidth

What is

a G

PU

?

4

What is a GPU ?

• Today’s GPUs

• Specialized for compute-intensive, highly parallel

computation

• More transistors are devoted to data processing rather than

data caching and flow control

• Developer’s point of view

• Hardware latencies are not hidden

• Managed explicitly

• Writing an efficient GPU program is not possible without the

knowledge of the architecture

What is

a G

PU

?

5

What is a GPU ?

• Today’s GPUs

• A GPU is not only used in a PC on a video card or

motherboard

• Mobile phones

• Display adapters

• Workstations and game consoles

• VPU (Visual Processing Unit)

• A class of processor intended for accelerating machine

learning and artificial intelligence technologies

• It is more suitable for running different types of machine

vision algorithms

What is

a G

PU

?

6

History

• August 31, 1999 – Nvidia releases GeForce

256

• The introduction of the Graphics Processing Unit

(GPU) for the PC industry

• The technical definition of a GPU is "a single chip

processor with integrated transform, lighting, triangle

setup/clipping, and rendering engines that is capable of

processing a minimum of 10 million polygons per

second."

John Manning, Graphic Processing Units

His

tory

7

History

• In the 1999-2000 timeframe

• Computer scientists and domain scientists from various

fields started using GPU’s to accelerate a range of

scientific applications

• While users achieved unprecedented performance (over

100x compared to CPUs in some cases), the challenge

was that GPGPU required the use of graphics

programming API’s like OpenGL and Cg to program

the GPU

• This limited accessibility to the tremendous capabilities

of GPU’s for science

John Manning, Graphic Processing Units

His

tory

8

Architecture - concept

CPU GPU

Cache

ALU

Control

ALU

ALU

ALU

DRAMDRAM

The a

rchite

ctu

re

9

Multicore CPU

• Optimized for “sequential” program

• Sophisticated control logic

• Large on-chip cache

• To reduce the long-latency

• The execution latency of each thread is reduced.

• It consume chip area and power.

• Latency-oriented design

• Many applications are limited by the speed at which data

can be moved from memory to processor

The a

rchite

ctu

re

10

Many-core GPU

• Massive number of FP calculation

• Video game industry

• Maximize the chip area dedicated to the floating-point

calculations.

• Optimized for the execution of massive number of threads.

• Pipelined memory channels

• Have more cores on a chip to increase the execution

throughput

The a

rchite

ctu

re

11

Many-core GPU

• Massive number of FP calculation

• A large number of threads to find work to do when some of

them are waiting for long-latency memory accesses or

arithmetic operations

• Small cache are provide to help control the bandwidth

requirements so multiple threads that access the same

memory do not need to go the DRAM

• Throughput-oriented design

• The goal is to maximize the total execution throughput

• Individual threads may take a much longer time to finish the

computation

The a

rchite

ctu

re

12

CPU vs. GPU

• CPUs can do general purpose work

• For program that have one or very few threads, CPUs with lower operation latencies can achieve much higher performance than GPUs

• Is CPU slower in raytracing ?

• Embree kernel framework

• https://embree.github.io/papers/2014-Siggraph-Embree.pdf

• GPU is best at focusing all the computing abilities on a specific task

• GPU uses thousands of smaller and more efficient cores for a massively parallel architecture aimed at handling multiple functions at the same time

• They are 50–100 times faster in tasks that require multiple parallel processes

The a

rchite

ctu

re

13

What problems are GPUs suited to

address?

• Games

• Graphic-intensive rendering of the gaming world

• The tasks of modern games become too heavy for

CPU graphics solution

Shadow of the Tomb Raider,

September, 2018

Minimum an Intel Core i3-3220, 8GB of RAM, and

an NVIDIA GeForce GTX 660/1050 or an AMD

Radeon HD 7770 graphics card.

The company recommends a beefier Core i7-4770K

or Ryzen 5 1600 processor and 16GB of RAM for a

smoother experience, with the GPU requirements

jumping up to a GTX 1060 6GB or RX 480.

Read more:

https://www.tweaktown.com/news/63013/shadow-

tomb-raider-pc-requirements-released/index.html

https://medium.com/altumea/gpu-vs-cpu-computing-what-to-choose-a9788a2370c4

Métier

14


address?

• 3D Visualization

• Computer-aided design (CAD)

• Requirements to visualize objects in 3D in real time as

you rotate or move them

• Workstation graphics cards that can manipulate

complex geometry that could be in excess of a billion

triangles (e.g. bridges, skyscrapers or a truck)

• AutoCAD 2019

• AMD FirePro W2100

• NVIDIA Quadro K420


Métier

15


address?

• Image Processing

• Image processing algorithms usually consume a lot of

computing resources

• GPUs can accurately process millions of images

• This ability is extensively used in industries such as border

control, security, and medical x-ray processing


A GPU Simulation Tool for Training and

Optimisation in 2D Digital X-Ray Imaging

The application was developed using CUDA

technology, GPGPU solution of NVIDIA. A

NVIDIA GeForce GTX 680 graphic unit was

employed.

https://journals.plos.org/plosone/article?id=10.

1371/journal.pone.0141497

Métier

16


address?

• Big Data

• GPUs are used to depict data as interactive

visualization, and they integrate with other datasets in

order to explore volume and velocity of data

• Power up gene mapping by processing data and

analyzing co-variances to understand the relationship

between different combinations of genes


GPU Accelerated Browser for Neuroimaging

Genomics

The GPU in use is the GeForce GTX

Titan X. It has 12gb of RAM and 3,072 cores.

https://link.springer.com/article/10.1007/s1202

1-018-9376-y

Métier

17


address?

• Deep Machine Learning

• GPUs can process tons of training data and train neural

networks in areas like image and video analytics,

speech recognition and natural language processing,

self-driving cars, computer vision and much more.


NVIDIA Deep Learning Course: Class #1 –

Introduction to Deep Learning

https://www.youtube.com/watch?v=6eBpjEdg

Sm0

Métier

18

Programming the GPU

• Basic idea:

• GPUs are available as graphics cards, which must be

mounted into computer systems, and a runtime software

package must be available to drive the computations

• A graphics card has programmable processing units,

various types of memory and cache, and fixed-function

units for special graphics tasks

• The hardware operation must be controlled by a

program running on the host computer’s CPU through

Application Programming Interfaces (API)

Pro

gra

mm

ing

19

Programming the GPU

• Basic idea:

• Programs might be written and compiled from various

programming languages, some originally designed for

graphics (like Cg or HLSL) and some born by the

extension of generic programming languages (like

CUDA C)

• The programming environment also defines a

programming model or virtual parallel architecture

that reflects how programmable and fixed-function units

are interconnected

Pro

gra

mm

ing

20

Programming the GPU

• Basic idea:

• Graphics APIs provide us with the view that the GPU is

a pipeline or a stream-processor since this is natural for

most of the graphics applications

• CUDA or OpenCL gives the illusion that the GPU is a

collection of multiprocessors

• Every multiprocessor is a wide SIMD processor

composed of scalar units, capable of executing the same

operation on different data

Pro

gra

mm

ing

21

Programming the GPU

• Basic idea:

• The total number of scalar processors is the product of

the number of multiprocessors and the number of SIMD

scalar processors per multiprocessor, which can be well

over a thousand

• This huge number of processors can execute the same

program on different data

Pro

gra

mm

ing

22

Programming the GPU

• Basic idea:

• All processors have some fast local memory, which is

only accessible to threads executed on the same

processor, i.e. to a thread block

• There is also global device memory to which data can

be uploaded or downloaded from by the host program

• This memory can be accessed from multiprocessors

through different caching and synchronization strategies

Pro

gra

mm

ing

23

Programming the GPU

• Basic idea:

• The GPU favours the parallel execution of short,

coherent computations on compact pieces of data

• The main challenge of porting algorithms to the GPU is

that of parallelization and decomposition to independent

computational steps

• GPU programs, which perform such a step when

executed by the processing units, are often called

kernels

Pro

gra

mm

ing

24

Architecture of a CUDA-capable GPU.It is organized into an array of highly threaded streaming multiprocessors (SMs).

Two streaming multiprocessors (SMs) form a building block

Each SM has a number of streaming processors (SPs) that share control logic

and instruction cache.

Each GPU comes with gigabytes of Graphics Double Data Rate (GDDR),

Synchronous DRAM (SDRAM), referred to as Global Memory.

A high level view of the architecture of a typical CUDA-capable GPU.

25

Terminology

• Like vector architectures, GPUs work well only

with data-level parallel problems

• A thread is associated with each data element

• Threads are organized into blocks

• A Thread Block is assigned to a processor

that executes the code

• Blocks are organized into a grid

David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017

Term

inolo

gy

26

Terminology

• The GPU hardware contains a collection of

multithreaded SIMD Processors (Streaming

Multiprocessors; SM) that execute a Grid of

Thread Blocks

• GPU hardware handles thread management,

not applications or OS


Term

inolo

gy

27

Terminology

• A GPU can have from one to several dozen

multithreaded SIMD Processors

• 2009: NVIDIA GeForce 210 (GP100-893-A1) has 2

• 2016: NVIDIA GeForce GTX 1050 (GP107-300-A1) has 5

• 2018: NVIDIA GeForce RTX 2080 Ti (TU102-300A-K1-A1)

has 68


Term

inolo

gy

28

Terminology

• The machine object that the hardware creates,

manages, schedules, and executes is a thread

of SIMD instructions

• These are running on a multithreaded SIMD

Processor

• The SIMD Thread Scheduler sends them off to

a dispatch unit to be run on the multithreaded

SIMD Processor


Term

inolo

gy

29

Terminology

• Two level of HW schedulers

• Thread Block Scheduler

• It assigns Thread Blocks to multithreaded SM

• SIMD Thread Scheduler within a SM

• Which schedules when threads of SIMD instructions

should run

• Thread scheduling is strictly an implementation

concept


Term

inolo

gy

30

Terminology

• The SIMD Processor must have parallel

functional units to perform the operation.

• We call them SIMD Lanes

• A block assigned to an SM is further divided

into 32 thread units called warps.

• The size of warps is implementation-specific.

• Warps are mapped to the SIMD (physical) lanes


Term

inolo

gy

31

Example A = B * C

• Let’s suppose we want to multiply two vectors

together, each 8192 elements long: A = B * C

• Code that works over all elements (8192) is the

grid (vectorized loop)

• Thread blocks break this down into manageable

sizes

• 512 threads per block

• SIMD instruction executes 32 elements at a time

• The grid size = 16 blocks (8192/512)


The e

xam

ple

32

Example A = B * C

• A thread block is assigned to a multithreaded

SIMD processor by the Thread Block

Scheduler

• The programmer tells the Thread Block

Scheduler, which is implemented in hardware,

how many Thread Blocks to run

• In this example, it would send 16 Thread

Blocks to multithreaded SIMD Processors to

compute all 8192 elements of this loop


The e

xam

ple

33

Example A = B * C

• The SIMD instructions of these threads are 32

wide, so each thread of SIMD instructions in

this example would compute 32 of the

elements of the computation

• In this example, Thread Blocks would contain

512/32=16 SIMD Threads


The e

xam

ple

34

Example A = B * C


The e

xam

ple

35

GPU Organization

Simplified block diagram of a multithreaded SIMD Processor.


Arc

hite

ctu

re

36

Example: image blur

• Assume that a CUDA device allows up to 8

blocks and 1024 threads per SM, whichever

becomes a limitation first

• Furthermore, it allows up to 512 threads in

each block

• For image blur, should we use 8 × 8, 16 × 16,

or 32 × 32 thread blocks?

Wen-mei W. Hwu, David B. Kirk, Programming Massively Parallel Processors, 3rd Edition, 2016

The e

xam

ple

37

Example: image blur

• CASE I: 8 x 8 thread blocks

• Each block would have only 64 threads

• We will need 1024/64 = 12 blocks to fully occupy an SM

• However, each SM can only allow up to 8 blocks; thus,

we will end up with only 64 × 8 = 512 threads in each

SM

• This limited number implies that the SM execution

resources will likely be underutilized because fewer

warps will be available to schedule around long-latency

operations


The e

xam

ple

38

Example: image blur

• CASE II: 16 x 16 thread blocks

• The 16 × 16 blocks result in 256 threads per block,

implying that each SM can take 1024/256 = 4 blocks

• This number is within the 8-block limitation and is a

good configuration as it will allow us a full thread

capacity in each SM and a maximal number of warps for

scheduling around the long-latency operations


The e

xam

ple

39

Example: image blur

• CASE III: 32 x 32 thread blocks

• The 32 × 32 blocks would give 1024 threads in each

block, which exceeds the 512 threads per block

limitation of this device

Only 16 × 16 blocks allow a maximal number of

threads assigned to each SM


The e

xam

ple

40

GeForce 20 series

• It is a family of graphics processing units

developed by Nvidia

• Was announced on August 20, 2018

• It is the successor to the GeForce 10 series

• It is based on the Turing microarchitecture and

features real-time ray tracing

https://en.wikipedia.org/wiki/GeForce_20_series

Arc

hite

ctu

re

41

GeForce 20 series

• New features in Turing

• CUDA Compute Capability 7.5

• New Streaming Multiprocessor (SM)

• 50% improvement compared to Pascal

• Turing Tensor Cores

• Deep Learning Super Sampling (DLSS)

• Real-Time Ray Tracing Acceleration

• New Shading Advancements

• Mesh Shading, Variable Rate Shading, Texture-Space Shading, …

• GDDR6 High-Performance Memory Subsystem

• Second-Generation NVIDIA NVLink

https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

Arc

hite

ctu

re

42

Turing TU102 GPU

• 4,608 CUDA Cores

• 72 RT Cores

• 576 Tensor Cores

• 288 texture units

• 12 32-bit GDDR6

memory controllers

(384-bits total)


Arc

hite

ctu

re

43

Turing TU102 GPU

• The Turing SM is partitioned into

• Four processing blocks

• Each with 16 FP32 Cores

• 16 INT32 Cores

• Two Tensor Cores

• One warp scheduler

• One dispatch unit

• Each block includes a new L0

instruction cache and a 64 KB register

file

• The four processing blocks share a

combined 96 KB L1 data cache/shared

memory


Arc

hite

ctu

re

44

Turing TU102 GPUA

rchite

ctu

re

Documents

Graphic Processing Unithron.fei.tuke.sk/~adam/pca/lectures/8.Graphic Processing...Shadow of the Tomb Raider, September, 2018 Minimum an Intel Core i3-3220, 8GB of RAM, and an NVIDIA