Understanding the Future of Energy Efficiency in Multi ... · GPM GPU o 1-GPM GPU Speedup (Secondary axis) Energy • Speedup is dependent on bandwidth • Energy consumption drops

Understanding the Future of Energy Efficiency in Multi-Module GPUs

Akhil Arunkumar*, Evgeny Bolotin#, David Nellans#, Carole-Jean Wu*

*Arizona State University, #NVIDIA

25th IEEE International Symposium on High-Performance Computer Architecture

Multi-Module GPUs

1

Monolithic GPU

On-package Integration

Multi-Module GPU

DRAM GPU

Module

DRAM

DRAMDRAM

GPUModule

GPUModule

GPUModule

GPUModule

GPUModule

GPUModule

GPUModule

DRAM DRAM

DRAM DRAM

On-board Integration

Hybrid Integration

On-package IntegrationUtilize organic package / interposer• Arunkumar et al., ISCA ‘17• Vijayaraghavan et al., HPCA ‘17

On-board IntegrationUtilize PC board

• Milic et al., ISCA ‘17• NVIDIA DGX, HGX

Hybrid IntegrationUtilize package and PC board• Dally et al., VLSI ’18

Prior works have focused only on the performance aspect.

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2 4 8 16 32

Rela

tive

Ener

gy p

er Ta

sk

Number of GPMs

• Energy cost per task could double!• 32 GPMs integrated on-board

consumes 2x the energy of 1 GPM L

• What are the energy efficiency limitations?

• Where are the bottlenecks?

2x

Energy Cost of Multi-Module Scaling

2

Ideal

Outline

• Introduction and background

• GPU energy estimation framework – GPUJoule

• Energy efficiency scaling metric – EDPSE

• Energy efficiency trends in future multi-module GPUs

• Conclusion

3

4

GPU Energy Estimation – Prior Work

• Bottom-up GPU energy estimation[1][2][3]:• Estimate energy cost of each microarchitectural component• Hard to keep current as GPUs evolve

• Top-down instruction-based energy estimation[4][5][6]:• Estimate energy cost of instruction operations executed• Flexible and agile as microarchitecture evolves

[1] Hong and Kim, “An integrated GPU power and performance model”, ISCA ‘10[2] Leng et al., “GPUWattch: Enabling energy optimizations in GPGPUs”, ISCA ‘13[3] Guerreiro et al., “GPGPU power modeling for multi-domain voltage-frequency

scaling”, HPCA ‘18

[4] Kestor et al., “Quantifying the energy cost of data movement in scientific applications”, IISWC ‘13[5] Pandiyan et al., “Quantifying the energy cost of data movement for Emerging Smartphone

Workloads on Mobile Platforms”, IISWC ‘13[6] Shao et al., “Energy characterization and instruction-level energy model of Intel’s Xeon Phi

Processor”, ISLPED ‘13

Top-down energy model is well suited for GPUs

5

Our Contribution: The GPUJoule Framework

• Key Idea:• Estimate the energy-per-instruction (EPI) for each compute instruction type• Estimate the energy-per-transaction (EPT) for each memory transaction type• GPU-Energy (per-application):

= Σ(Ni x EPIi) + Σ(Txnj x EPTj) + idle_energy

Energy to execute compute

instructions

Energy to execute data movement

instructions

6

GPUJoule Energy Modeling Methodology

EPIs

EPTs

uBench Validation

Error =

(Si) Measured Energy -

Est. Energy

Microbenchmarks

Stress

Compute and

Memory Insts

GPUJoule Energy Model

Est. Energy = Σ(Ni*EPIi)

+ Σ(Txnj*EPTj) +

idle_energy

32

Op EPI

Compute value

Data

Movement

value

Improve Coverage

Memory Instruction Microbenchmarksptr = (void **)(&array[index])

For i = 0 to i < num_iterations do:

ptr = (void**)(*ptr)

Compute Instruction MicrobenchmarksFor i = 0 to i < num_iterations do:

__asm_volatile (

“fma.rn.f32 %r3, %r1, %r3, %r2;”

…)

1

7

GPUJoule Validation

• GPU platform

• Nvidia Tesla K40 GPU

• 15 SMs, 16 – 48 KB L1 cache,

• 1.5 MB L2 cache, 12 GB, 280 GB/s GDDR5 Memory

• On-board power sensors for power measurement

• Workloads

• Validation microbenchmarks è compute instruction + data movement operations

• Real GPU applications from Rodinia, CORAL & Stream suites

8

Tesla K40 Energy Characteristics

Inst or Op EPI (nJ) EPT (pJ/bit)

DADD, FFMA 0.15, 0.05 -

IADD, IMAD 0.07, 0.15 -

LOG2, SINE 0.03, 0.10 -

Shd Mem -> Reg, L1 -> Reg - 5.32, 5.85

L2 -> L1 - 15.48

DRAM -> L2 - 30.55

• EPI influenced by bit width, and functional unit

• EPT influenced by the level of memory hierarchy

• DRAM -> Register costs 9x more than L1 -> Register

• DRAM -> Register costs 80x more than floating point compute

9

GPUJoule Accuracy

-60-40-20

0204060

RSBe

nch

BFS

CoM

DM

iniA

MR

BTRE

ESr

ad-v

1Lu

lesh

-190

Lule

sh-1

50Sr

ad-v

2N

ekbo

ne-1

8N

ekbo

ne-1

2Lu

lesh

Uns

Kmea

nsHo

tspo

tBP

ROP

Stre

amM

nCtc

tPa

thF

GeoM

ean…Estim

atio

n Er

ror (

%)

-60-40-20

0204060

FADD64 +Shared

Memory

FADD64 +L1D Cache

FADD64 +L2 Cache

FADD64 +DRAM

FADD64 +L2 Cache +

DRAM

GeomeanEstim

atio

n Er

ror (

%)

98% Accuracy

90% Accuracy

Average accuracy of 90% across GPU applications compared to real silicon measurements

Outline





• Conclusion

10

11

Quantifying Energy Efficiency: EDP Scaling Efficiency

• EDP and ED2 well suited for comparing systems with similar resources• For strong scaled systems: Energy-Delay-Product Scaling Efficiency (EDPSE)

• Evaluates performance, energy costs, and resource scaling together• Systems can be expected to achieve an EDPSE threshold in the future • 50% EDPSE è “Energy efficiency scales to 50% of the ideal with strong scaling”

Outline





• Conclusion

12

13

Methodology• Performance Simulations: • Model GPUs with 1 – 32 GPU modules• Distributed CTA scheduling, first touch page placement, ring interconnect

• Energy Modeling:• EPI and EPT values from GPUJoule • Augmented with HBM Memory &

Inter-GPM data movement energy costs

BW Config Name I/0 BW DRAM BW I/O to DRAM BW Ratio Integration Domain1x-BW 128 GB/s 256 GB/s 1:2 On-Board

2x-BW 256 GB/s 256 GB/s 1:1 On-Package

4x-BW 512 GB/s 256 GB/s 2:1 On-Package

Energy CostHBM DRAM -> L2 Cache[1] 21.1 pJ/bit

On-Package Inter-GPM[2] 0.54 pJ/bit

On-Board Inter-GPM[3] 10 pJ/bit

[1] O’Connor et al., “Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems”, MICRO 2017[2] Poulton et al., “A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications”, JSSC 2013[3] Dally, W., “Challenges for Future Computing Systems”, Keynote, HiPEAC 2015

0

20

40

60

80

100

2-GPM 4-GPM 8-GPM 16-GPM 32-GPM

EDPS

E (%

)

2x-BW on-package integration

14

EDP Scaling Efficiency of Future GPUs

• EDPSE reduces drastically with increase in GPMs• Multi-Module GPUs face energy efficiency limitations at scale

50%

1

1.2

1.4

1.6

1.8

2

2.2


Incr

emen

tal S

peed

up

Speedup Monolithic Speedup

15

Diminishing Trend in Energy Efficiency Scaling

• Speedup reduces as number of modules increase

• Energy cost increases as number of modules increase

-5

0

5

10

15

20


Incr

emen

tal E

nerg

y

NUMA-effects lead to performance loss and energy increase

0

10

20

30

40

50

60

0%

Amortization

25%

Amortization

50%

Amortization

50%

Amortization

50%

Amortization

1x-BW 1x-BW 1x-BW 2x-BW 4x-BW

EDPS

E (%

)

On-package integration and constant energy amortization

16

• Multi-module GPUs suffer from

high constant energy overheads

• VRMs, power delivery network,

system I/O etc.

• On-package integration allows

amortization of these overheads

Higher link BW and tighter integration yields better energy efficiency scaling

0

5

10

15

20

0

0.5

1

1.5

2

2.5

1x-B

W2x

-BW

4x-B

W1x

-BW

2x-B

W4x

-BW

1x-B

W2x

-BW

4x-B

W1x

-BW

2x-B

W4x

-BW

1x-B

W2x

-BW

4x-B

W


Spee

dup

over

1-G

PM G

PU

Ener

gy R

elat

ive

to 1

-GPM

GPU

Speedup (Secondary axis) Energy

• Speedup is dependent on bandwidth

• Energy consumption drops with speedup

• Only increasing GPMs might not help• 16-GPM with 2xBW has same performance

as 32-GPM with 1xBW• Consumes only half the energy!

• Path to an efficient 32-GPM GPU• Increase bandwidth to 4x-BW.• Utilize on-package integration• Reduce energy consumption by 45%

Speedup & Energy Consumption

17

Conclusions

• Developed GPUJoule Instruction level GPU energy estimation framework

• Achieves 90% accuracy compared to real silicon energy measurements

• Open sourced at github.com/akhilarunkumar/GPUJoule_release

• Identify key energy efficiency trends in future GPUs• Energy efficiency scaling reduces as number of modules increase

• NUMA effects lead to suboptimal performance and energy consumption

• Inter-module bandwidth and tighter integration of components (on package integration) lead to higher energy efficiency

18

Understanding the Future of Energy Efficiency in Multi-Module GPUs

Thank you

25th IEEE International Symposium on High-Performance Computer Architecture

Akhil Arunkumar*, Evgeny Bolotin#, David Nellans#, Carole-Jean Wu*

*Arizona State University, #NVIDIA

20

Impact of On-Board Switch

0

20

40

60

80

100


EDPS

E (%

)

On-Board Integration

Ring (1x-BW) Switch (1x-BW) Switch (2x-BW)

Documents

Understanding the Future of Energy Efficiency in Multi ... · GPM GPU o 1-GPM GPU Speedup (Secondary axis) Energy • Speedup is dependent on bandwidth • Energy consumption drops