Upload
chen
View
38
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor. José-María Arnau , Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel). Focusing on Mobile GPUs. 1. Market demands. Energy-efficient mobile GPUs. 2. Technology limitations. - PowerPoint PPT Presentation
Citation preview
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor
José-María Arnau, Joan-Manuel Parcerisa (UPC)Polychronis Xekalakis (Intel)
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 2
Focusing on Mobile GPUs
Marketdemands
Technology limitations
Energy-efficient mobile GPUs
1
1 http://www.digitalversus.com/mobile-phone/samsung-galaxy-note-p11735/test.html Samsung galaxy SII vs Samsung Galaxy Note when running the game Shadow Gun 3D
2 http://www.ispsd.com/02/battery-psd-templates/
2
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 3
GPU Performance and Memory
Graphical workloads: Large working sets not amenable to caching Texture memory accesses are fine-grained and unpredictable
Traditional techniques to deal with memory: Caches Prefetching Multithreading
A mobile single-threaded GPU with perfect caches achieves a speedup of 3.2x on a set of commercial
Android games
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 4
Outline
BackgroundMethodologyMultithreading & PrefetchingDecoupled Access/ExecuteConclusions
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 5
Assumed GPU Architecture
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 6
Assumed Fragment Processor
4 threads per warp 4-wide vectorial
registers (16 bytes) 36 registers per thread
Warp: group of threads executed in lockstep mode (SIMD group)
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 7
Methodology
Power Model: CACTI 6.5 and Qsilver
Main memory Latency = 100 cyclesBandwidth = 4 bytes/cycle
Pixel/Textures caches 2 KB, 2-way, 2 cycles
L2 cache 32 KB, 8-way, 12 cycles
Number of cores 4 vertex, 4 pixel processors
Warp width 4 threads
Register file size 2304 bytes per warp
Number of warps 1-16 warps/core
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 8
Workload Selection2D games Simple 3D games Complex 3D games
Small/medium sized textures
Texture filtering: 1 memory access
Small fragment programs
Small/medium sized texturesTexture filtering: 1-4
memory accessesSmall/medium fragment
programs
Medium/big sized texturesTexture filtering: 4-8
memory accessesBig, memory intensive
fragment programs
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 9
Improving Performance Using Multithreading
Very effective High energy cost (25% more energy) Huge register file to maintain the state of all the threads
36 KB MRF for a GPU with 16 warps/core (bigger than L2)
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 10
Employing Prefetching
Hardware prefetchers: Global History Buffer
K. J. Nesbit and J. E. Smith. “Data Cache Prefetching Using a Global History Buffer”. HPCA, 2004. Many-Thread Aware
J. Lee, N. B. Lakshminarayana, H. Kim and R, Vuduc. “Many-Thread Aware Prefetching Mechanisms for GPGPU Applications”. MICRO, 2010.
Prefetching is effective but there is still ample room for improvement
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 11
Decoupled Access/Execute
Use the fragment information to compute the addresses that will be requested when processing the fragment
Issue memory requests while the fragments are waiting in the tile queue
Tile queue size: Too small: timeliness is
not achieved Too big: cache
conflicts
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 12
Inter-Core Data Sharing
66.3% of cache misses are requests to data available in the L1 cache of another fragment processor
Use the prefetch queue to detect inter-core data sharing
Saves bandwidth to the L2 cache
Saves power (L1 caches smaller than L2)
Associative comparisons require additional energy
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 13
Decoupled Access/Execute
33% faster than hardware prefetchers, 9% energy savings DAE with 2 warps/core achieves 93% of the performance of a
bigger GPU with 16 warps/core, providing 34% energy savings
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 14
Benefits of Remote L1 Cache Accesses
Single threaded GPU Baseline: Global History Buffer 30% speedup 5.4% energy savings
Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 15
Conclusions High performance, energy efficient GPUs can be
architected based on the decoupled access/execute concept
A combination of decoupled access/execute -to hide memory latency- and multithreading -to hide functional units latency- provides the most energy efficient solution
Allowing for remote L1 cache accesses provides L2 cache bandwidth savings and energy savings
The decoupled access/execute architecture outperforms hardware prefetchers: 33% speedup, 9% energy savings
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor
José-María Arnau (UPC)Joan-Manuel Parcerisa (UPC)Polychronis Xekalakis (Intel)
Thank you!Questions?