31
NVIDIA Fermi Architecture Joseph Kider University of Pennsylvania CIS 565 - Fall 2011

NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

NVIDIA Fermi Architecture

Joseph KiderUniversity of PennsylvaniaCIS 565 - Fall 2011

Page 2: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Administrivia

Project checkpoint on Monday

Page 3: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Sources

Patrick Cozzi Spring 2011NVIDIA CUDA Programming GuideCUDA by ExampleProgramming Massively Parallel Processors

Page 4: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

G80, GT200, and Fermi

November 2006: G80June 2008: GT200March 2011: Fermi (GF100)

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 5: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

New GPU Generation

What are the technical goals for a new GPU generation?

Page 6: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

New GPU Generation

What are the technical goals for a new GPU generation?

Improve existing application performance. How?

Page 7: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

New GPU Generation

What are the technical goals for a new GPU generation?

Improve existing application performance. How?Advance programmability. In what ways?

Page 8: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi: What’s More?

More total cores (SPs) – not SMs thoughMore registers: 32K per SMMore shared memory: up to 48K per SMMore Super Functional Units (SFUs)

Page 9: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi: What’s Faster?

Faster double precision – 8x over GT200Faster atomic operations. What for?

5-20xFaster context switches

Between applications – 10xBetween graphics and compute, e.g., OpenGL and CUDA

Page 10: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi: What’s New?

L1 and L2 caches.For compute or graphics?

Dual warp schedulingConcurrent kernel executionC++ supportFull IEEE 754-2008 support in hardwareUnified address spaceError Correcting Code (ECC) memory supportFixed function tessellation for graphics

Page 11: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

G80, GT200, and Fermi

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 12: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

G80, GT200, and Fermi

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 13: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

GT200 and Fermi

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 14: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi Block Diagram

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

GF10016 SMsEach with 32 cores

512 total coresEach SM hosts up to

48 warps, or1,536 threads

In flight, up to24,576 threads

Page 15: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi SM

Why 32 cores per SM instead of 8?Why not more SMs?

G80 – 8 cores GT200 – 8 cores GF100 – 32 cores

Page 16: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi SM

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Dual warp schedulingWhy?

32K registers32 cores

Floating point and integer unit per core

16 Load/stores4 SFUs

Page 17: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi SM

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

16 SMs * 32 cores/SM = 512 floating point operations per cycleWhy not in practice?

Page 18: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi SM

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Each SM64KB on-chip memory

48KB shared memory / 16KB L1 cache, or16KB L1 cache / 48 KB shared memory

Configurable by CUDA developer

Page 19: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi Dual Warping Scheduling

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 20: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf

Page 21: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi Caches

Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 22: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi Caches

Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 23: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Fermi: Unified Address Space

Page 24: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi: Unified Address Space

64-bit virtual addresses40-bit physical addresses (currently)CUDA 4: Shared address space with CPU. Why?

Page 25: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi: Unified Address Space

64-bit virtual addresses40-bit physical addresses (currently)CUDA 4: Shared address space with CPU. Why?

No explicit CPU/GPU copiesDirect GPU-GPU copiesDirect I/O device to GPU copies

Page 26: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi ECC

ECC ProtectedRegister file, L1, L2, DRAM

Uses redundancy to ensure data integrity against cosmic rays flipping bits

For example, 64 bits is stored as 72 bitsFix single bit errors, detect multiple bit errorsWhat are the applications?

Page 27: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi Tessellation

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 28: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi Tessellation

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Page 29: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Fermi Tessellation

Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

Fixed function hardware on each SM for graphics

Texture filteringTexture cacheTessellationVertex Fetch / Attribute SetupStream OutputViewport Transform. Why?

Page 30: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Observations

Becoming easier to port CPU code to the GPU

Recursion, fast atomics, L1/L2 caches, faster global memory

In fact…

Page 31: NVIDIA Fermi Architecture - Penn Engineering · 2011. 11. 4. · Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address

Observations

Becoming easier to port CPU code to the GPU

Recursion, fast atomics, L1/L2 caches, faster global memory

In fact…GPUs are starting to look like CPUs

Beefier SMs, L1 and L2 caches, dual warp scheduling, double precision, fast atomics