Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
1
Graphic Processing
Unit
2
What is a GPU ?
• Graphics Processing Units (GPU )
• Highly parallel, multithreaded, many core processors
• Capable of very high computation and data throughput
• Once specially designed for computer graphics and
programmable only through graphics APIs
• Primarily used to manage and boost the performance of
video and graphics
• 2D / 3D graphics ← rendering
• Digital output to display monitors
What is
a G
PU
?
3
What is a GPU ?
• Today’s GPUs
• General-purpose parallel processors (GPGPU)
• Support for accessible programming interfaces and industry
standard languages such as C
• Developers who port their applications to GPUs often
achieve speedups of orders of magnitude vs. optimized
CPU implementations
• High floating point performance
• Peak memory bandwidth
What is
a G
PU
?
4
What is a GPU ?
• Today’s GPUs
• Specialized for compute-intensive, highly parallel
computation
• More transistors are devoted to data processing rather than
data caching and flow control
• Developer’s point of view
• Hardware latencies are not hidden
• Managed explicitly
• Writing an efficient GPU program is not possible without the
knowledge of the architecture
What is
a G
PU
?
5
What is a GPU ?
• Today’s GPUs
• A GPU is not only used in a PC on a video card or
motherboard
• Mobile phones
• Display adapters
• Workstations and game consoles
• VPU (Visual Processing Unit)
• A class of processor intended for accelerating machine
learning and artificial intelligence technologies
• It is more suitable for running different types of machine
vision algorithms
What is
a G
PU
?
6
History
• August 31, 1999 – Nvidia releases GeForce
256
• The introduction of the Graphics Processing Unit
(GPU) for the PC industry
• The technical definition of a GPU is "a single chip
processor with integrated transform, lighting, triangle
setup/clipping, and rendering engines that is capable of
processing a minimum of 10 million polygons per
second."
John Manning, Graphic Processing Units
His
tory
7
History
• In the 1999-2000 timeframe
• Computer scientists and domain scientists from various
fields started using GPU’s to accelerate a range of
scientific applications
• While users achieved unprecedented performance (over
100x compared to CPUs in some cases), the challenge
was that GPGPU required the use of graphics
programming API’s like OpenGL and Cg to program
the GPU
• This limited accessibility to the tremendous capabilities
of GPU’s for science
John Manning, Graphic Processing Units
His
tory
8
Architecture - concept
CPU GPU
Cache
ALU
Control
ALU
ALU
ALU
DRAMDRAM
The a
rchite
ctu
re
9
Multicore CPU
• Optimized for “sequential” program
• Sophisticated control logic
• Large on-chip cache
• To reduce the long-latency
• The execution latency of each thread is reduced.
• It consume chip area and power.
• Latency-oriented design
• Many applications are limited by the speed at which data
can be moved from memory to processor
The a
rchite
ctu
re
10
Many-core GPU
• Massive number of FP calculation
• Video game industry
• Maximize the chip area dedicated to the floating-point
calculations.
• Optimized for the execution of massive number of threads.
• Pipelined memory channels
• Have more cores on a chip to increase the execution
throughput
The a
rchite
ctu
re
11
Many-core GPU
• Massive number of FP calculation
• A large number of threads to find work to do when some of
them are waiting for long-latency memory accesses or
arithmetic operations
• Small cache are provide to help control the bandwidth
requirements so multiple threads that access the same
memory do not need to go the DRAM
• Throughput-oriented design
• The goal is to maximize the total execution throughput
• Individual threads may take a much longer time to finish the
computation
The a
rchite
ctu
re
12
CPU vs. GPU
• CPUs can do general purpose work
• For program that have one or very few threads, CPUs with lower operation latencies can achieve much higher performance than GPUs
• Is CPU slower in raytracing ?
• Embree kernel framework
• https://embree.github.io/papers/2014-Siggraph-Embree.pdf
• GPU is best at focusing all the computing abilities on a specific task
• GPU uses thousands of smaller and more efficient cores for a massively parallel architecture aimed at handling multiple functions at the same time
• They are 50–100 times faster in tasks that require multiple parallel processes
The a
rchite
ctu
re
13
What problems are GPUs suited to
address?
• Games
• Graphic-intensive rendering of the gaming world
• The tasks of modern games become too heavy for
CPU graphics solution
Shadow of the Tomb Raider,
September, 2018
Minimum an Intel Core i3-3220, 8GB of RAM, and
an NVIDIA GeForce GTX 660/1050 or an AMD
Radeon HD 7770 graphics card.
The company recommends a beefier Core i7-4770K
or Ryzen 5 1600 processor and 16GB of RAM for a
smoother experience, with the GPU requirements
jumping up to a GTX 1060 6GB or RX 480.
Read more:
https://www.tweaktown.com/news/63013/shadow-
tomb-raider-pc-requirements-released/index.html
https://medium.com/altumea/gpu-vs-cpu-computing-what-to-choose-a9788a2370c4
Métier
14
What problems are GPUs suited to
address?
• 3D Visualization
• Computer-aided design (CAD)
• Requirements to visualize objects in 3D in real time as
you rotate or move them
• Workstation graphics cards that can manipulate
complex geometry that could be in excess of a billion
triangles (e.g. bridges, skyscrapers or a truck)
• AutoCAD 2019
• AMD FirePro W2100
• NVIDIA Quadro K420
https://medium.com/altumea/gpu-vs-cpu-computing-what-to-choose-a9788a2370c4
Métier
15
What problems are GPUs suited to
address?
• Image Processing
• Image processing algorithms usually consume a lot of
computing resources
• GPUs can accurately process millions of images
• This ability is extensively used in industries such as border
control, security, and medical x-ray processing
https://medium.com/altumea/gpu-vs-cpu-computing-what-to-choose-a9788a2370c4
A GPU Simulation Tool for Training and
Optimisation in 2D Digital X-Ray Imaging
The application was developed using CUDA
technology, GPGPU solution of NVIDIA. A
NVIDIA GeForce GTX 680 graphic unit was
employed.
https://journals.plos.org/plosone/article?id=10.
1371/journal.pone.0141497
Métier
16
What problems are GPUs suited to
address?
• Big Data
• GPUs are used to depict data as interactive
visualization, and they integrate with other datasets in
order to explore volume and velocity of data
• Power up gene mapping by processing data and
analyzing co-variances to understand the relationship
between different combinations of genes
https://medium.com/altumea/gpu-vs-cpu-computing-what-to-choose-a9788a2370c4
GPU Accelerated Browser for Neuroimaging
Genomics
The GPU in use is the GeForce GTX
Titan X. It has 12gb of RAM and 3,072 cores.
https://link.springer.com/article/10.1007/s1202
1-018-9376-y
Métier
17
What problems are GPUs suited to
address?
• Deep Machine Learning
• GPUs can process tons of training data and train neural
networks in areas like image and video analytics,
speech recognition and natural language processing,
self-driving cars, computer vision and much more.
https://medium.com/altumea/gpu-vs-cpu-computing-what-to-choose-a9788a2370c4
NVIDIA Deep Learning Course: Class #1 –
Introduction to Deep Learning
https://www.youtube.com/watch?v=6eBpjEdg
Sm0
Métier
18
Programming the GPU
• Basic idea:
• GPUs are available as graphics cards, which must be
mounted into computer systems, and a runtime software
package must be available to drive the computations
• A graphics card has programmable processing units,
various types of memory and cache, and fixed-function
units for special graphics tasks
• The hardware operation must be controlled by a
program running on the host computer’s CPU through
Application Programming Interfaces (API)
Pro
gra
mm
ing
19
Programming the GPU
• Basic idea:
• Programs might be written and compiled from various
programming languages, some originally designed for
graphics (like Cg or HLSL) and some born by the
extension of generic programming languages (like
CUDA C)
• The programming environment also defines a
programming model or virtual parallel architecture
that reflects how programmable and fixed-function units
are interconnected
Pro
gra
mm
ing
20
Programming the GPU
• Basic idea:
• Graphics APIs provide us with the view that the GPU is
a pipeline or a stream-processor since this is natural for
most of the graphics applications
• CUDA or OpenCL gives the illusion that the GPU is a
collection of multiprocessors
• Every multiprocessor is a wide SIMD processor
composed of scalar units, capable of executing the same
operation on different data
Pro
gra
mm
ing
21
Programming the GPU
• Basic idea:
• The total number of scalar processors is the product of
the number of multiprocessors and the number of SIMD
scalar processors per multiprocessor, which can be well
over a thousand
• This huge number of processors can execute the same
program on different data
Pro
gra
mm
ing
22
Programming the GPU
• Basic idea:
• All processors have some fast local memory, which is
only accessible to threads executed on the same
processor, i.e. to a thread block
• There is also global device memory to which data can
be uploaded or downloaded from by the host program
• This memory can be accessed from multiprocessors
through different caching and synchronization strategies
Pro
gra
mm
ing
23
Programming the GPU
• Basic idea:
• The GPU favours the parallel execution of short,
coherent computations on compact pieces of data
• The main challenge of porting algorithms to the GPU is
that of parallelization and decomposition to independent
computational steps
• GPU programs, which perform such a step when
executed by the processing units, are often called
kernels
Pro
gra
mm
ing
24
Architecture of a CUDA-capable GPU.It is organized into an array of highly threaded streaming multiprocessors (SMs).
Two streaming multiprocessors (SMs) form a building block
Each SM has a number of streaming processors (SPs) that share control logic
and instruction cache.
Each GPU comes with gigabytes of Graphics Double Data Rate (GDDR),
Synchronous DRAM (SDRAM), referred to as Global Memory.
A high level view of the architecture of a typical CUDA-capable GPU.
25
Terminology
• Like vector architectures, GPUs work well only
with data-level parallel problems
• A thread is associated with each data element
• Threads are organized into blocks
• A Thread Block is assigned to a processor
that executes the code
• Blocks are organized into a grid
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
Term
inolo
gy
26
Terminology
• The GPU hardware contains a collection of
multithreaded SIMD Processors (Streaming
Multiprocessors; SM) that execute a Grid of
Thread Blocks
• GPU hardware handles thread management,
not applications or OS
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
Term
inolo
gy
27
Terminology
• A GPU can have from one to several dozen
multithreaded SIMD Processors
• 2009: NVIDIA GeForce 210 (GP100-893-A1) has 2
• 2016: NVIDIA GeForce GTX 1050 (GP107-300-A1) has 5
• 2018: NVIDIA GeForce RTX 2080 Ti (TU102-300A-K1-A1)
has 68
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
Term
inolo
gy
28
Terminology
• The machine object that the hardware creates,
manages, schedules, and executes is a thread
of SIMD instructions
• These are running on a multithreaded SIMD
Processor
• The SIMD Thread Scheduler sends them off to
a dispatch unit to be run on the multithreaded
SIMD Processor
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
Term
inolo
gy
29
Terminology
• Two level of HW schedulers
• Thread Block Scheduler
• It assigns Thread Blocks to multithreaded SM
• SIMD Thread Scheduler within a SM
• Which schedules when threads of SIMD instructions
should run
• Thread scheduling is strictly an implementation
concept
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
Term
inolo
gy
30
Terminology
• The SIMD Processor must have parallel
functional units to perform the operation.
• We call them SIMD Lanes
• A block assigned to an SM is further divided
into 32 thread units called warps.
• The size of warps is implementation-specific.
• Warps are mapped to the SIMD (physical) lanes
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
Term
inolo
gy
31
Example A = B * C
• Let’s suppose we want to multiply two vectors
together, each 8192 elements long: A = B * C
• Code that works over all elements (8192) is the
grid (vectorized loop)
• Thread blocks break this down into manageable
sizes
• 512 threads per block
• SIMD instruction executes 32 elements at a time
• The grid size = 16 blocks (8192/512)
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
The e
xam
ple
32
Example A = B * C
• A thread block is assigned to a multithreaded
SIMD processor by the Thread Block
Scheduler
• The programmer tells the Thread Block
Scheduler, which is implemented in hardware,
how many Thread Blocks to run
• In this example, it would send 16 Thread
Blocks to multithreaded SIMD Processors to
compute all 8192 elements of this loop
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
The e
xam
ple
33
Example A = B * C
• The SIMD instructions of these threads are 32
wide, so each thread of SIMD instructions in
this example would compute 32 of the
elements of the computation
• In this example, Thread Blocks would contain
512/32=16 SIMD Threads
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
The e
xam
ple
34
Example A = B * C
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
The e
xam
ple
35
GPU Organization
Simplified block diagram of a multithreaded SIMD Processor.
David A Patterson and John L. Hennessy, Computer Architecture: A Quantitative Approach 6th Edition, 2017
Arc
hite
ctu
re
36
Example: image blur
• Assume that a CUDA device allows up to 8
blocks and 1024 threads per SM, whichever
becomes a limitation first
• Furthermore, it allows up to 512 threads in
each block
• For image blur, should we use 8 × 8, 16 × 16,
or 32 × 32 thread blocks?
Wen-mei W. Hwu, David B. Kirk, Programming Massively Parallel Processors, 3rd Edition, 2016
The e
xam
ple
37
Example: image blur
• CASE I: 8 x 8 thread blocks
• Each block would have only 64 threads
• We will need 1024/64 = 12 blocks to fully occupy an SM
• However, each SM can only allow up to 8 blocks; thus,
we will end up with only 64 × 8 = 512 threads in each
SM
• This limited number implies that the SM execution
resources will likely be underutilized because fewer
warps will be available to schedule around long-latency
operations
Wen-mei W. Hwu, David B. Kirk, Programming Massively Parallel Processors, 3rd Edition, 2016
The e
xam
ple
38
Example: image blur
• CASE II: 16 x 16 thread blocks
• The 16 × 16 blocks result in 256 threads per block,
implying that each SM can take 1024/256 = 4 blocks
• This number is within the 8-block limitation and is a
good configuration as it will allow us a full thread
capacity in each SM and a maximal number of warps for
scheduling around the long-latency operations
Wen-mei W. Hwu, David B. Kirk, Programming Massively Parallel Processors, 3rd Edition, 2016
The e
xam
ple
39
Example: image blur
• CASE III: 32 x 32 thread blocks
• The 32 × 32 blocks would give 1024 threads in each
block, which exceeds the 512 threads per block
limitation of this device
Only 16 × 16 blocks allow a maximal number of
threads assigned to each SM
Wen-mei W. Hwu, David B. Kirk, Programming Massively Parallel Processors, 3rd Edition, 2016
The e
xam
ple
40
GeForce 20 series
• It is a family of graphics processing units
developed by Nvidia
• Was announced on August 20, 2018
• It is the successor to the GeForce 10 series
• It is based on the Turing microarchitecture and
features real-time ray tracing
https://en.wikipedia.org/wiki/GeForce_20_series
Arc
hite
ctu
re
41
GeForce 20 series
• New features in Turing
• CUDA Compute Capability 7.5
• New Streaming Multiprocessor (SM)
• 50% improvement compared to Pascal
• Turing Tensor Cores
• Deep Learning Super Sampling (DLSS)
• Real-Time Ray Tracing Acceleration
• New Shading Advancements
• Mesh Shading, Variable Rate Shading, Texture-Space Shading, …
• GDDR6 High-Performance Memory Subsystem
• Second-Generation NVIDIA NVLink
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
Arc
hite
ctu
re
42
Turing TU102 GPU
• 4,608 CUDA Cores
• 72 RT Cores
• 576 Tensor Cores
• 288 texture units
• 12 32-bit GDDR6
memory controllers
(384-bits total)
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
Arc
hite
ctu
re
43
Turing TU102 GPU
• The Turing SM is partitioned into
• Four processing blocks
• Each with 16 FP32 Cores
• 16 INT32 Cores
• Two Tensor Cores
• One warp scheduler
• One dispatch unit
• Each block includes a new L0
instruction cache and a 64 KB register
file
• The four processing blocks share a
combined 96 KB L1 data cache/shared
memory
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
Arc
hite
ctu
re
44
Turing TU102 GPUA
rchite
ctu
re