PhilRogers - Department of Electrical & Computer Engineering · 2017. 8. 22. · 150 300 HD5870 . 7/14 ... watt-per-dollar Very efficient hardware threading SIMD architecture well

7/14/10

1

| Heterogeneous Computing -> Fusion | June 2010 1

Heterogeneous Computing -> Fusion

Phil Rogers AMD Corporate Fellow


Definitions

  Heterogenous Computing

– A system comprised of two or more compute engines with signficant structural differences

– In our case, a low latency x86 CPU and a high throughput Radeon GPU

  Fusion

– Bringing together two or more components and joining them into a single unified whole

– In our case, combining CPUs and GPUs on a single silicon die for higher performance and lower power


AMD Balanced Platform Advantage

Other Highly Parallel Workloads

Graphics Workloads

Serial/Task-Parallel Workloads

CPU is ideal for scalar processing

  Out of order x86 cores with low latency memory access

  Optimized for sequential and branching algorithms

  Runs existing applications very well

GPU is ideal for parallel processing

  GPU shaders optimized for throughput computing

  Ready for emerging workloads

  Media processing, simulation, natural UI, etc


Three Eras of Processor Performance

Single-Core Era

Sin

gle-

thre

ad P

erfo

rman

ce

?

Time

we are here

o

Enabled by:   Moore’s Law   Voltage Scaling   MicroArchitecture

Constrained by:   Power   Complexity

Multi-Core Era

Thro

ughp

ut P

erfo

rman

ce

Time (# of Processors)

we are here

o

Enabled by:   Moore’s Law   Desire for Throughput   20 years of SMP arch

Constrained by:   Power   Parallel SW availability   Scalability

Heterogeneous Systems Era

Targ

eted

App

licat

ion

P

erfo

rman

ce

Time (Data-parallel exploitation)

we are here

o

Enabled by:   Moore’s Law   Abundant data parallelism   Power efficient GPUs

Temporarily constrained by:   Programming models   Communication overheads

7/14/10

2


Emerging Application Spaces

Category Characteristics Application Examples

Massive Data Mining

Full 64b addressing Huge data sets New data types

Image, Video, Audio processing Pattern analytics and search

Natural User Interfaces

Massive “behind-the-scenes”

computing

Face and gesture recognition Real time video & audio proc Physical world interpretation

Visualization Advanced rendering Interactive physics

Multi-layered Graphics Holographic Displays Scientific visualization & CAD Next generation Gaming

Cloud + Client Applications

Seamless responsiveness

Workload partitioning

Next generation browsers HTML5 Apps with Native Code from JavaScript


GPU SP ALU Performance

HD4870

HD5870

CPU


GPU DP ALU Performance

HD4870

HD5870

CPU


GPU BW Performance expectations over time

250

0

100

200

50

150

300

HD5870

HD4870

7/14/10

3


GPU Computing Efficiency Trend

GFLOPS/W

14.47 GFLOPS/W


Thread Processors

5-way VLIW Architecture

4 Stream Cores and 1 Special Function Stream Core

Separate Branch Unit

All 5 cores co-issue

Scheduling across the cores is done by the compiler

Each core delivers a 32-bit result per clock

Thread Processor writes 5 results per clock


SIMD Engines

 Diagram shows 2 SIMD Engines

 Each SIMD Unit includes:

  16 Thread Processors (80 shader cores) + 32KB Local Data Share

  Its own Thread Sequencer which operates a shared set of threads

  A dedicated fetch unit with an 8KB L1 cache


ATI Radeon™ HD 5870 Compute Architecture

 20 SIMD Engines

  1600 shader cores

 Ultra-Threaded Dispatch Processor

 Instruction and Constant Caches

 Memory Export Buffer

 Fetch path with multi-level caches

 Global Data Store

7/14/10

4


TeraScale 2 Architecture – Radeon HD 5870


Memory Hierarchy

 Distributed Memory Controller

 Optimized for latency hiding and memory access efficiency

 GDDR5 memory at 150GB/s

 Up to 272 billion 32-bit fetches/second

 Up to 1 TB/sec L1 texture fetch bandwidth

 Up to 435 GB/sec between L1 & L2


Comparative Stats on ATI Radeon HD 5870 GPU

* Based on internal AMD testing

AMD Opteron™ Model 2435

ATI Radeon™ HD 4870

ATI Radeon™ HD 5870

One Year Difference

Die Size 346 mm2 263 mm2 334 mm2 1.27x

Transistors 904 million 956 million 2.15 billion 2.25x

Memory Bandwidth 12.8 GB/s 115 GB/sec 153 GB/sec 1.33x

SP GFlops 124.8 1200 2720 2.25x

DP GFlops 62.4 240 544 2.25

ALUs 54 800 1600 2x

Board Power*

Idle 15.5 W 90 W 27 W 0.3x

Max 115 W 160 W 188 W 1.17x


Yesterday’s Chip Designs Won’t Do

110 million transistors @150nm 2D and 3D gaming

Nascent video processing

105 million transistors @130nm Compute tasks including video decode

7/14/10

5


Today We Are Evolving

2.15 billion transistors @40nm 3D OS

Multi-panel HD gaming Full HD video and audio

758 million transistors @45nm Multi-tasking Most compute tasks


Tomorrow Will Amaze

  Significantly enhances active/ resting battery life

  High-bandwidth I/O

 ~1 billion transistors @32nm in one design

  APU: Fusion of CPU & GPU compute power within one processor


AMD Fusion™ APUs Fill the Need

  Windows, MacOS and Linux franchises

  Thousands of apps

  Established programming and memory model

  Mature tool chain

  Extensive backward compatibility for applications and OSs

  High barrier to entry

x86 CPU owns the Software World

  Enormous parallel computing capacity

  Outstanding performance-per - watt-per-dollar

  Very efficient hardware threading

  SIMD architecture well matched to modern workloads: video, audio, graphics

GPU Optimized for Modern Workloads


Fusion APUs: Putting it all together

System-level Programmable

Throughput Performance

Prog

ram

mer

Acc

essi

bilit

y

Graphics Driver-based

programs

OCL/DC Driver-based

programs

Power-efficient Data Parallel

Execution

High Performance Task Parallel Execution

Microprocessor Advancement

GPU

Adv

ance

men

t

Una

ccep

tabl

e Ex

pert

s O

nly

Mai

nstr

eam

7/14/10

6


PC with Discrete GPU


Fusion APU Based PC


Two x86 Cores Tuned for Target Markets

“Bulldozer”

“Bobcat”


Heterogeneous Computing: Next-Generation Software Ecosystem

Load balance across CPUs and GPUs; leverage

AMD Fusion™ performance advantages Drive new

features into industry standards

Increase ease of application

development

7/14/10

7


Open Standards:

Vendor specific Cross-platform limiters

• Apple Display Connector

• 3dfx Glide

• Nvidia CUDA

• Nvidia Cg

• Rambus

• Unified Display Interface

Maximize Developer Freedom and Addressable Market

Vendor neutral Cross-platform enablers


OpenCL™ and DirectX® 11 DirectCompute

  How will developers choose?

  DirectX® 11 DirectCompute

  Easiest path to add compute capabilities to existing DirectX applications

 Windows Vista® and Windows® 7 only

  OpenCL™

  Ideal path for new applications porting to the GPU for the first time

  True multiplatform: Windows®, Linux®, MacOS

 Natural programming without dealing with a graphics API


The Benefits of Fusion

  Unparalleled processing capabilities in mobile form factors

  Shared memory for the CPU and GPU

  Eliminates copies, increasing performance

  Reduces dispatch overhead

  Lower latency from the GPU to memory

  Power efficient design

  Enables architectural innovations between CPU, GPU and the Memory System

  Scalable architecture that can target a broad range of platforms from mobile to data center


The Fusion Opportunity

  A new architectural and performance balance point for computing

  A new machine target for research

  A high volume opportunity for new algorithms, new workloads and new applications

  The deployment opportunity is especially strong in the consumer market place

Documents

PhilRogers - Department of Electrical & Computer Engineering · 2017. 8. 22. · 150 300 HD5870 . 7/14 ... watt-per-dollar Very efficient hardware threading SIMD architecture well