7
7/14/10 1 | Heterogeneous Computing -> Fusion | June 2010 1 Heterogeneous Computing -> Fusion Phil Rogers AMD Corporate Fellow | Heterogeneous Computing -> Fusion | June 2010 2 Definitions Heterogenous Computing A system comprised of two or more compute engines with signficant structural differences In our case, a low latency x86 CPU and a high throughput Radeon GPU Fusion Bringing together two or more components and joining them into a single unified whole In our case, combining CPUs and GPUs on a single silicon die for higher performance and lower power | Heterogeneous Computing -> Fusion | June 2010 3 AMD Balanced Platform Advantage Other Highly Parallel Workloads Graphics Workloads Serial/Task-Parallel Workloads CPU is ideal for scalar processing Out of order x86 cores with low latency memory access Optimized for sequential and branching algorithms Runs existing applications very well GPU is ideal for parallel processing GPU shaders optimized for throughput computing Ready for emerging workloads Media processing, simulation, natural UI, etc | Heterogeneous Computing -> Fusion | June 2010 4 Three Eras of Processor Performance Single-Core Era Single-thread Performance ? Time we are here o Enabled by: Moore’s Law Voltage Scaling MicroArchitecture Constrained by: Power Complexity Multi-Core Era Throughput Performance Time (# of Processors) we are here o Enabled by: Moore’s Law Desire for Throughput 20 years of SMP arch Constrained by: Power Parallel SW availability Scalability Heterogeneous Systems Era Targeted Application Performance Time (Data-parallel exploitation) we are here o Enabled by: Moore’s Law Abundant data parallelism Power efficient GPUs Temporarily constrained by: Programming models Communication overheads

PhilRogers - Department of Electrical & Computer Engineering · 2017. 8. 22. · 150 300 HD5870 . 7/14 ... watt-per-dollar Very efficient hardware threading SIMD architecture well

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PhilRogers - Department of Electrical & Computer Engineering · 2017. 8. 22. · 150 300 HD5870 . 7/14 ... watt-per-dollar Very efficient hardware threading SIMD architecture well

7/14/10

1

| Heterogeneous Computing -> Fusion | June 2010 1

Heterogeneous Computing -> Fusion

Phil Rogers AMD Corporate Fellow

| Heterogeneous Computing -> Fusion | June 2010 2

Definitions

  Heterogenous Computing

– A system comprised of two or more compute engines with signficant structural differences

– In our case, a low latency x86 CPU and a high throughput Radeon GPU

  Fusion

– Bringing together two or more components and joining them into a single unified whole

– In our case, combining CPUs and GPUs on a single silicon die for higher performance and lower power

| Heterogeneous Computing -> Fusion | June 2010 3

AMD Balanced Platform Advantage

Other Highly Parallel Workloads

Graphics Workloads

Serial/Task-Parallel Workloads

CPU is ideal for scalar processing

  Out of order x86 cores with low latency memory access

  Optimized for sequential and branching algorithms

  Runs existing applications very well

GPU is ideal for parallel processing

  GPU shaders optimized for throughput computing

  Ready for emerging workloads

  Media processing, simulation, natural UI, etc

| Heterogeneous Computing -> Fusion | June 2010 4

Three Eras of Processor Performance

Single-Core Era

Sin

gle-

thre

ad P

erfo

rman

ce

?

Time

we are here

o

Enabled by:   Moore’s Law   Voltage Scaling   MicroArchitecture

Constrained by:   Power   Complexity

Multi-Core Era

Thro

ughp

ut P

erfo

rman

ce

Time (# of Processors)

we are here

o

Enabled by:   Moore’s Law   Desire for Throughput   20 years of SMP arch

Constrained by:   Power   Parallel SW availability   Scalability

Heterogeneous Systems Era

Targ

eted

App

licat

ion

P

erfo

rman

ce

Time (Data-parallel exploitation)

we are here

o

Enabled by:   Moore’s Law   Abundant data parallelism   Power efficient GPUs

Temporarily constrained by:   Programming models   Communication overheads

Page 2: PhilRogers - Department of Electrical & Computer Engineering · 2017. 8. 22. · 150 300 HD5870 . 7/14 ... watt-per-dollar Very efficient hardware threading SIMD architecture well

7/14/10

2

| Heterogeneous Computing -> Fusion | June 2010 5

Emerging Application Spaces

Category Characteristics Application Examples

Massive Data Mining

Full 64b addressing Huge data sets New data types

Image, Video, Audio processing Pattern analytics and search

Natural User Interfaces

Massive “behind-the-scenes”

computing

Face and gesture recognition Real time video & audio proc Physical world interpretation

Visualization Advanced rendering Interactive physics

Multi-layered Graphics Holographic Displays Scientific visualization & CAD Next generation Gaming

Cloud + Client Applications

Seamless responsiveness

Workload partitioning

Next generation browsers HTML5 Apps with Native Code from JavaScript

| Heterogeneous Computing -> Fusion | June 2010 6

GPU SP ALU Performance

HD4870

HD5870

CPU

| Heterogeneous Computing -> Fusion | June 2010 7

GPU DP ALU Performance

HD4870

HD5870

CPU

| Heterogeneous Computing -> Fusion | June 2010 8

GPU BW Performance expectations over time

250

0

100

200

50

150

300

HD5870

HD4870

Page 3: PhilRogers - Department of Electrical & Computer Engineering · 2017. 8. 22. · 150 300 HD5870 . 7/14 ... watt-per-dollar Very efficient hardware threading SIMD architecture well

7/14/10

3

| Heterogeneous Computing -> Fusion | June 2010 9

GPU Computing Efficiency Trend

GFLOPS/W

14.47 GFLOPS/W

| Heterogeneous Computing -> Fusion | June 2010 10

Thread Processors

5-way VLIW Architecture

4 Stream Cores and 1 Special Function Stream Core

Separate Branch Unit

All 5 cores co-issue

Scheduling across the cores is done by the compiler

Each core delivers a 32-bit result per clock

Thread Processor writes 5 results per clock

| Heterogeneous Computing -> Fusion | June 2010 11

SIMD Engines

 Diagram shows 2 SIMD Engines

 Each SIMD Unit includes:

  16 Thread Processors (80 shader cores) + 32KB Local Data Share

  Its own Thread Sequencer which operates a shared set of threads

  A dedicated fetch unit with an 8KB L1 cache

| Heterogeneous Computing -> Fusion | June 2010 12

ATI Radeon™ HD 5870 Compute Architecture

 20 SIMD Engines

  1600 shader cores

 Ultra-Threaded Dispatch Processor

 Instruction and Constant Caches

 Memory Export Buffer

 Fetch path with multi-level caches

 Global Data Store

Page 4: PhilRogers - Department of Electrical & Computer Engineering · 2017. 8. 22. · 150 300 HD5870 . 7/14 ... watt-per-dollar Very efficient hardware threading SIMD architecture well

7/14/10

4

| Heterogeneous Computing -> Fusion | June 2010 13

TeraScale 2 Architecture – Radeon HD 5870

| Heterogeneous Computing -> Fusion | June 2010 14

Memory Hierarchy

 Distributed Memory Controller

 Optimized for latency hiding and memory access efficiency

 GDDR5 memory at 150GB/s

 Up to 272 billion 32-bit fetches/second

 Up to 1 TB/sec L1 texture fetch bandwidth

 Up to 435 GB/sec between L1 & L2

| Heterogeneous Computing -> Fusion | June 2010 15

Comparative Stats on ATI Radeon HD 5870 GPU

* Based on internal AMD testing

AMD Opteron™ Model 2435

ATI Radeon™ HD 4870

ATI Radeon™ HD 5870

One Year Difference

Die Size 346 mm2 263 mm2 334 mm2 1.27x

Transistors 904 million 956 million 2.15 billion 2.25x

Memory Bandwidth 12.8 GB/s 115 GB/sec 153 GB/sec 1.33x

SP GFlops 124.8 1200 2720 2.25x

DP GFlops 62.4 240 544 2.25

ALUs 54 800 1600 2x

Board Power*

Idle 15.5 W 90 W 27 W 0.3x

Max 115 W 160 W 188 W 1.17x

| Heterogeneous Computing -> Fusion | June 2010 16

Yesterday’s Chip Designs Won’t Do

110 million transistors @150nm 2D and 3D gaming

Nascent video processing

105 million transistors @130nm Compute tasks including video decode

Page 5: PhilRogers - Department of Electrical & Computer Engineering · 2017. 8. 22. · 150 300 HD5870 . 7/14 ... watt-per-dollar Very efficient hardware threading SIMD architecture well

7/14/10

5

| Heterogeneous Computing -> Fusion | June 2010 17

Today We Are Evolving

2.15 billion transistors @40nm 3D OS

Multi-panel HD gaming Full HD video and audio

758 million transistors @45nm Multi-tasking Most compute tasks

| Heterogeneous Computing -> Fusion | June 2010 18

Tomorrow Will Amaze

  Significantly enhances active/ resting battery life

  High-bandwidth I/O

 ~1 billion transistors @32nm in one design

  APU: Fusion of CPU & GPU compute power within one processor

| Heterogeneous Computing -> Fusion | June 2010 19

AMD Fusion™ APUs Fill the Need

  Windows, MacOS and Linux franchises

  Thousands of apps

  Established programming and memory model

  Mature tool chain

  Extensive backward compatibility for applications and OSs

  High barrier to entry

x86 CPU owns the Software World

  Enormous parallel computing capacity

  Outstanding performance-per - watt-per-dollar

  Very efficient hardware threading

  SIMD architecture well matched to modern workloads: video, audio, graphics

GPU Optimized for Modern Workloads

| Heterogeneous Computing -> Fusion | June 2010 20

Fusion APUs: Putting it all together

System-level Programmable

Throughput Performance

Prog

ram

mer

Acc

essi

bilit

y

Graphics Driver-based

programs

OCL/DC Driver-based

programs

Power-efficient Data Parallel

Execution

High Performance Task Parallel Execution

Microprocessor Advancement

GPU

Adv

ance

men

t

Una

ccep

tabl

e Ex

pert

s O

nly

Mai

nstr

eam

Page 6: PhilRogers - Department of Electrical & Computer Engineering · 2017. 8. 22. · 150 300 HD5870 . 7/14 ... watt-per-dollar Very efficient hardware threading SIMD architecture well

7/14/10

6

| Heterogeneous Computing -> Fusion | June 2010 21

PC with Discrete GPU

| Heterogeneous Computing -> Fusion | June 2010 22

Fusion APU Based PC

| Heterogeneous Computing -> Fusion | June 2010 23

Two x86 Cores Tuned for Target Markets

“Bulldozer”

“Bobcat”

| Heterogeneous Computing -> Fusion | June 2010 24

Heterogeneous Computing: Next-Generation Software Ecosystem

Load balance across CPUs and GPUs; leverage

AMD Fusion™ performance advantages Drive new

features into industry standards

Increase ease of application

development

Page 7: PhilRogers - Department of Electrical & Computer Engineering · 2017. 8. 22. · 150 300 HD5870 . 7/14 ... watt-per-dollar Very efficient hardware threading SIMD architecture well

7/14/10

7

| Heterogeneous Computing -> Fusion | June 2010 25

Open Standards:

Vendor specific Cross-platform limiters

• Apple Display Connector

• 3dfx Glide

• Nvidia CUDA

• Nvidia Cg

• Rambus

• Unified Display Interface

Maximize Developer Freedom and Addressable Market

Vendor neutral Cross-platform enablers

| Heterogeneous Computing -> Fusion | June 2010 26

OpenCL™ and DirectX® 11 DirectCompute

  How will developers choose?

  DirectX® 11 DirectCompute

  Easiest path to add compute capabilities to existing DirectX applications

 Windows Vista® and Windows® 7 only

  OpenCL™

  Ideal path for new applications porting to the GPU for the first time

  True multiplatform: Windows®, Linux®, MacOS

 Natural programming without dealing with a graphics API

| Heterogeneous Computing -> Fusion | June 2010 27

The Benefits of Fusion

  Unparalleled processing capabilities in mobile form factors

  Shared memory for the CPU and GPU

  Eliminates copies, increasing performance

  Reduces dispatch overhead

  Lower latency from the GPU to memory

  Power efficient design

  Enables architectural innovations between CPU, GPU and the Memory System

  Scalable architecture that can target a broad range of platforms from mobile to data center

| Heterogeneous Computing -> Fusion | June 2010 28

The Fusion Opportunity

  A new architectural and performance balance point for computing

  A new machine target for research

  A high volume opportunity for new algorithms, new workloads and new applications

  The deployment opportunity is especially strong in the consumer market place