Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
7/14/10
1
| Heterogeneous Computing -> Fusion | June 2010 1
Heterogeneous Computing -> Fusion
Phil Rogers AMD Corporate Fellow
| Heterogeneous Computing -> Fusion | June 2010 2
Definitions
Heterogenous Computing
– A system comprised of two or more compute engines with signficant structural differences
– In our case, a low latency x86 CPU and a high throughput Radeon GPU
Fusion
– Bringing together two or more components and joining them into a single unified whole
– In our case, combining CPUs and GPUs on a single silicon die for higher performance and lower power
| Heterogeneous Computing -> Fusion | June 2010 3
AMD Balanced Platform Advantage
Other Highly Parallel Workloads
Graphics Workloads
Serial/Task-Parallel Workloads
CPU is ideal for scalar processing
Out of order x86 cores with low latency memory access
Optimized for sequential and branching algorithms
Runs existing applications very well
GPU is ideal for parallel processing
GPU shaders optimized for throughput computing
Ready for emerging workloads
Media processing, simulation, natural UI, etc
| Heterogeneous Computing -> Fusion | June 2010 4
Three Eras of Processor Performance
Single-Core Era
Sin
gle-
thre
ad P
erfo
rman
ce
?
Time
we are here
o
Enabled by: Moore’s Law Voltage Scaling MicroArchitecture
Constrained by: Power Complexity
Multi-Core Era
Thro
ughp
ut P
erfo
rman
ce
Time (# of Processors)
we are here
o
Enabled by: Moore’s Law Desire for Throughput 20 years of SMP arch
Constrained by: Power Parallel SW availability Scalability
Heterogeneous Systems Era
Targ
eted
App
licat
ion
P
erfo
rman
ce
Time (Data-parallel exploitation)
we are here
o
Enabled by: Moore’s Law Abundant data parallelism Power efficient GPUs
Temporarily constrained by: Programming models Communication overheads
7/14/10
2
| Heterogeneous Computing -> Fusion | June 2010 5
Emerging Application Spaces
Category Characteristics Application Examples
Massive Data Mining
Full 64b addressing Huge data sets New data types
Image, Video, Audio processing Pattern analytics and search
Natural User Interfaces
Massive “behind-the-scenes”
computing
Face and gesture recognition Real time video & audio proc Physical world interpretation
Visualization Advanced rendering Interactive physics
Multi-layered Graphics Holographic Displays Scientific visualization & CAD Next generation Gaming
Cloud + Client Applications
Seamless responsiveness
Workload partitioning
Next generation browsers HTML5 Apps with Native Code from JavaScript
| Heterogeneous Computing -> Fusion | June 2010 6
GPU SP ALU Performance
HD4870
HD5870
CPU
| Heterogeneous Computing -> Fusion | June 2010 7
GPU DP ALU Performance
HD4870
HD5870
CPU
| Heterogeneous Computing -> Fusion | June 2010 8
GPU BW Performance expectations over time
250
0
100
200
50
150
300
HD5870
HD4870
7/14/10
3
| Heterogeneous Computing -> Fusion | June 2010 9
GPU Computing Efficiency Trend
GFLOPS/W
14.47 GFLOPS/W
| Heterogeneous Computing -> Fusion | June 2010 10
Thread Processors
5-way VLIW Architecture
4 Stream Cores and 1 Special Function Stream Core
Separate Branch Unit
All 5 cores co-issue
Scheduling across the cores is done by the compiler
Each core delivers a 32-bit result per clock
Thread Processor writes 5 results per clock
| Heterogeneous Computing -> Fusion | June 2010 11
SIMD Engines
Diagram shows 2 SIMD Engines
Each SIMD Unit includes:
16 Thread Processors (80 shader cores) + 32KB Local Data Share
Its own Thread Sequencer which operates a shared set of threads
A dedicated fetch unit with an 8KB L1 cache
| Heterogeneous Computing -> Fusion | June 2010 12
ATI Radeon™ HD 5870 Compute Architecture
20 SIMD Engines
1600 shader cores
Ultra-Threaded Dispatch Processor
Instruction and Constant Caches
Memory Export Buffer
Fetch path with multi-level caches
Global Data Store
7/14/10
4
| Heterogeneous Computing -> Fusion | June 2010 13
TeraScale 2 Architecture – Radeon HD 5870
| Heterogeneous Computing -> Fusion | June 2010 14
Memory Hierarchy
Distributed Memory Controller
Optimized for latency hiding and memory access efficiency
GDDR5 memory at 150GB/s
Up to 272 billion 32-bit fetches/second
Up to 1 TB/sec L1 texture fetch bandwidth
Up to 435 GB/sec between L1 & L2
| Heterogeneous Computing -> Fusion | June 2010 15
Comparative Stats on ATI Radeon HD 5870 GPU
* Based on internal AMD testing
AMD Opteron™ Model 2435
ATI Radeon™ HD 4870
ATI Radeon™ HD 5870
One Year Difference
Die Size 346 mm2 263 mm2 334 mm2 1.27x
Transistors 904 million 956 million 2.15 billion 2.25x
Memory Bandwidth 12.8 GB/s 115 GB/sec 153 GB/sec 1.33x
SP GFlops 124.8 1200 2720 2.25x
DP GFlops 62.4 240 544 2.25
ALUs 54 800 1600 2x
Board Power*
Idle 15.5 W 90 W 27 W 0.3x
Max 115 W 160 W 188 W 1.17x
| Heterogeneous Computing -> Fusion | June 2010 16
Yesterday’s Chip Designs Won’t Do
110 million transistors @150nm 2D and 3D gaming
Nascent video processing
105 million transistors @130nm Compute tasks including video decode
7/14/10
5
| Heterogeneous Computing -> Fusion | June 2010 17
Today We Are Evolving
2.15 billion transistors @40nm 3D OS
Multi-panel HD gaming Full HD video and audio
758 million transistors @45nm Multi-tasking Most compute tasks
| Heterogeneous Computing -> Fusion | June 2010 18
Tomorrow Will Amaze
Significantly enhances active/ resting battery life
High-bandwidth I/O
~1 billion transistors @32nm in one design
APU: Fusion of CPU & GPU compute power within one processor
| Heterogeneous Computing -> Fusion | June 2010 19
AMD Fusion™ APUs Fill the Need
Windows, MacOS and Linux franchises
Thousands of apps
Established programming and memory model
Mature tool chain
Extensive backward compatibility for applications and OSs
High barrier to entry
x86 CPU owns the Software World
Enormous parallel computing capacity
Outstanding performance-per - watt-per-dollar
Very efficient hardware threading
SIMD architecture well matched to modern workloads: video, audio, graphics
GPU Optimized for Modern Workloads
| Heterogeneous Computing -> Fusion | June 2010 20
Fusion APUs: Putting it all together
System-level Programmable
Throughput Performance
Prog
ram
mer
Acc
essi
bilit
y
Graphics Driver-based
programs
OCL/DC Driver-based
programs
Power-efficient Data Parallel
Execution
High Performance Task Parallel Execution
Microprocessor Advancement
GPU
Adv
ance
men
t
Una
ccep
tabl
e Ex
pert
s O
nly
Mai
nstr
eam
7/14/10
6
| Heterogeneous Computing -> Fusion | June 2010 21
PC with Discrete GPU
| Heterogeneous Computing -> Fusion | June 2010 22
Fusion APU Based PC
| Heterogeneous Computing -> Fusion | June 2010 23
Two x86 Cores Tuned for Target Markets
“Bulldozer”
“Bobcat”
| Heterogeneous Computing -> Fusion | June 2010 24
Heterogeneous Computing: Next-Generation Software Ecosystem
Load balance across CPUs and GPUs; leverage
AMD Fusion™ performance advantages Drive new
features into industry standards
Increase ease of application
development
7/14/10
7
| Heterogeneous Computing -> Fusion | June 2010 25
Open Standards:
Vendor specific Cross-platform limiters
• Apple Display Connector
• 3dfx Glide
• Nvidia CUDA
• Nvidia Cg
• Rambus
• Unified Display Interface
Maximize Developer Freedom and Addressable Market
Vendor neutral Cross-platform enablers
| Heterogeneous Computing -> Fusion | June 2010 26
OpenCL™ and DirectX® 11 DirectCompute
How will developers choose?
DirectX® 11 DirectCompute
Easiest path to add compute capabilities to existing DirectX applications
Windows Vista® and Windows® 7 only
OpenCL™
Ideal path for new applications porting to the GPU for the first time
True multiplatform: Windows®, Linux®, MacOS
Natural programming without dealing with a graphics API
| Heterogeneous Computing -> Fusion | June 2010 27
The Benefits of Fusion
Unparalleled processing capabilities in mobile form factors
Shared memory for the CPU and GPU
Eliminates copies, increasing performance
Reduces dispatch overhead
Lower latency from the GPU to memory
Power efficient design
Enables architectural innovations between CPU, GPU and the Memory System
Scalable architecture that can target a broad range of platforms from mobile to data center
| Heterogeneous Computing -> Fusion | June 2010 28
The Fusion Opportunity
A new architectural and performance balance point for computing
A new machine target for research
A high volume opportunity for new algorithms, new workloads and new applications
The deployment opportunity is especially strong in the consumer market place