16
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel)

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

  • Upload
    chen

  • View
    38

  • Download
    1

Embed Size (px)

DESCRIPTION

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor. José-María Arnau , Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel). Focusing on Mobile GPUs. 1. Market demands. Energy-efficient mobile GPUs. 2. Technology limitations. - PowerPoint PPT Presentation

Citation preview

Page 1: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

José-María Arnau, Joan-Manuel Parcerisa (UPC)Polychronis Xekalakis (Intel)

Page 2: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 2

Focusing on Mobile GPUs

Marketdemands

Technology limitations

Energy-efficient mobile GPUs

1

1 http://www.digitalversus.com/mobile-phone/samsung-galaxy-note-p11735/test.html Samsung galaxy SII vs Samsung Galaxy Note when running the game Shadow Gun 3D

2 http://www.ispsd.com/02/battery-psd-templates/

2

Page 3: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 3

GPU Performance and Memory

Graphical workloads: Large working sets not amenable to caching Texture memory accesses are fine-grained and unpredictable

Traditional techniques to deal with memory: Caches Prefetching Multithreading

A mobile single-threaded GPU with perfect caches achieves a speedup of 3.2x on a set of commercial

Android games

Page 4: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 4

Outline

BackgroundMethodologyMultithreading & PrefetchingDecoupled Access/ExecuteConclusions

Page 5: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 5

Assumed GPU Architecture

Page 6: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 6

Assumed Fragment Processor

4 threads per warp 4-wide vectorial

registers (16 bytes) 36 registers per thread

Warp: group of threads executed in lockstep mode (SIMD group)

Page 7: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 7

Methodology

Power Model: CACTI 6.5 and Qsilver

Main memory Latency = 100 cyclesBandwidth = 4 bytes/cycle

Pixel/Textures caches 2 KB, 2-way, 2 cycles

L2 cache 32 KB, 8-way, 12 cycles

Number of cores 4 vertex, 4 pixel processors

Warp width 4 threads

Register file size 2304 bytes per warp

Number of warps 1-16 warps/core

Page 8: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 8

Workload Selection2D games Simple 3D games Complex 3D games

Small/medium sized textures

Texture filtering: 1 memory access

Small fragment programs

Small/medium sized texturesTexture filtering: 1-4

memory accessesSmall/medium fragment

programs

Medium/big sized texturesTexture filtering: 4-8

memory accessesBig, memory intensive

fragment programs

Page 9: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 9

Improving Performance Using Multithreading

Very effective High energy cost (25% more energy) Huge register file to maintain the state of all the threads

36 KB MRF for a GPU with 16 warps/core (bigger than L2)

Page 10: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 10

Employing Prefetching

Hardware prefetchers: Global History Buffer

K. J. Nesbit and J. E. Smith. “Data Cache Prefetching Using a Global History Buffer”. HPCA, 2004. Many-Thread Aware

J. Lee, N. B. Lakshminarayana, H. Kim and R, Vuduc. “Many-Thread Aware Prefetching Mechanisms for GPGPU Applications”. MICRO, 2010.

Prefetching is effective but there is still ample room for improvement

Page 11: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 11

Decoupled Access/Execute

Use the fragment information to compute the addresses that will be requested when processing the fragment

Issue memory requests while the fragments are waiting in the tile queue

Tile queue size: Too small: timeliness is

not achieved Too big: cache

conflicts

Page 12: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 12

Inter-Core Data Sharing

66.3% of cache misses are requests to data available in the L1 cache of another fragment processor

Use the prefetch queue to detect inter-core data sharing

Saves bandwidth to the L2 cache

Saves power (L1 caches smaller than L2)

Associative comparisons require additional energy

Page 13: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 13

Decoupled Access/Execute

33% faster than hardware prefetchers, 9% energy savings DAE with 2 warps/core achieves 93% of the performance of a

bigger GPU with 16 warps/core, providing 34% energy savings

Page 14: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 14

Benefits of Remote L1 Cache Accesses

Single threaded GPU Baseline: Global History Buffer 30% speedup 5.4% energy savings

Page 15: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis 15

Conclusions High performance, energy efficient GPUs can be

architected based on the decoupled access/execute concept

A combination of decoupled access/execute -to hide memory latency- and multithreading -to hide functional units latency- provides the most energy efficient solution

Allowing for remote L1 cache accesses provides L2 cache bandwidth savings and energy savings

The decoupled access/execute architecture outperforms hardware prefetchers: 33% speedup, 9% energy savings

Page 16: Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor

José-María Arnau (UPC)Joan-Manuel Parcerisa (UPC)Polychronis Xekalakis (Intel)

Thank you!Questions?