1
New Developments for PAPI 5.6+ Anthony Danalis, Heike Jagode, and Jack Dongarra University of Tennessee, Knoxville Vince Weaver and Yan Liu University of Maine INTRODUCTION PART 1: PAPI for PowerAware Computing PART 2: PAPI’s Counter Inspection Toolkit (CIT) PART 3: Modernizing PAPI Infrastructure SUPPORTED ARCHITECTURES 3RD PARTY TOOLS APPLYING PAPI • PAPI provides a consistent interface (and methodology) for hardware performance counters, found across a compute system: i. e., CPUs, GPUs, on- and off-chip memory, interconnects, I/O system, file system, energy/power, etc. • PAPI enables software engineers to see, in near real time, the relationship between software performance and hardware events across the entire compute system. Intel Knights Landing: FLAT mode • Entire MCDRAM is used as addressable memory memory allocations are treated similarly to DDR4 memory allocations PAPI for power-aware computing • We use PAPI’s latest powercap component for measurement, control, and performance analysis. • In the past, PAPI power components supported only reading power information. • New component exposes RAPL functionality to allow users to read and write power. • Study numerical building blocks of varying computational intensity. • Use PAPI powercap component to detect power optimization opportunities. • Cap the power on the architecture to reduce power usage while keeping the execution time constant energy savings. Key Concepts Goal: • Create a set of micro-benchmarks for illustrating details in hardware events and how they relate to the behavior of the micro-architecture Target audience: • Performance-conscious application developers • PAPI developers working on new architectures (think preset events) • Developers interested in validating hardware-event counters QUESTION How many branch instructions will these codes execute per iteration? The expectation is that both codes will execute 2. The measured answer is 2 for the first and 2.5 for the second! Can you guess why? Improved PAPI Test Infrastructure • The existing PAPI testsuite is used to test the correctness of PAPI before release. • The hardware and operating systems used by PAPI are always changing, and some of the existing tests were outdated or gave false negatives. • Existing tests were checked to ensure accurate results on modern hardware. • New counter validation tests were created which should provide a sanity check when bringing up support for a new processor architecture. Low-Overhead PAPI_read() support • Traditionally PAPI_read() counter reads went through the standard Linux read() system call, which can be slow (around 1,000 cycles). • x86 hardware supports a userspace rdpmc() instruction that bypasses the kernel and requires 200 cycles (a 5 × speedup). • Various bugs in the Linux kernel around this interface were found and fixed so that rdpmc() can be enabled by default. Enhanced Sampling Interface • PAPI currently has a limited counter sampling interface that only allows gathering the instruction pointer at regular intervals • Modern processors support much richer sampling information, including the cause of cache misses, where in the cache hierarchy the miss happened, and the cycles taken. • We extend the PAPI sampling interface to provide this additional sampling information. Heike Jagode, Azzam Haidar, Phil Vaccaro, Asim YarKhan, and Stan Tomov ACKNOWLEDGMENTS This material is based upon work supported in part by the National Science Foundation NSF under awards No. 1450429. A portion of this research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. Vince Weaver and Yan Liu Anthony Danalis, Heike Jagode, and Hanumanthrayappa Power awareness for Level-3 BLAS: DGEMM 68 cores KNL, Peak DP = 2,662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s • DGEMM is run repeatedly for a fixed matrix size of 18K per iteration. • For each iteration, we dropped the power in 10 or 20 Watts steps. • Powercap achieves its power limiting though DVFS scaling. See frequency value, which — as it drops — corresponds to a decrease in power. Power awareness for Level-3 BLAS: DGEMV 68 cores KNL, Peak DP = 2,662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s • DGEMV is run repeatedly for a fixed matrix size of 18K per iteration. • For each iteration, we dropped the power in 10 or 20 Watts steps. • Frequency is not affected until the power caps kick in. Power awareness for applications: Jacobi • Solving Helmholtz equation with Jacobi Iterative Method (grid size 12,800 x 12,800, 2D 5-point stencil) requires multiple memory accesses per update high-memory bandwidth and low computational intensity Lesson for DGEMM-type operations (compute intensive): • Both data allocation schemes provides roughly the same performance. This is caused by the high ratio of FLOPs to bytes in the compute-intensive routines, which allows for as much cache reuse as possible while hiding the data transfer between memory and cache levels. • Performance drops with decreasing power caps, which is expected for compute-intensive kernels. Lesson for DGEMV-type operations (memory bound): Performance varies between the two storage options: Data allocations in MCDRAM results in much higher performance for memory-bound kernels (BW for MCDRAM is 425GB/s vs. 90 GB/s for DDR4). Dgemv is bound by the reading of matrix A. Depending on its storage location, performance and power may be affected. We see lower package-power-usage but higher DDR4-power-usage. Capping at 140 W improves energy efficiency by ~10%. Capping at 120 W results in a 17% energy savings with some performance degradation (8%). Capping at 180 Watts renders in ~15% of energy saving with a negligible performance penalty (of less than 5%). Lesson for Jacobi iteration: • Computation is about 3.5X faster when the data is allocated in MCDRAM compared to DDR4. • MCDRAM: Capping at 170 Watts improves energy efficiency by ~14% without any loss in time to solution. • DDR4: capping at 135 Watts improves energy efficiency by ~25% without any loss in time to solution. Power awareness for applications: Lattice-Boltzman Lattice-Boltzmann simulation of computational fluid dynamics (from SPEC 2006 benchmark) high-memory bandwidth and low computational intensity Lesson for Lattice-Boltzmann: • Computation is about 3.6X faster when the data is allocated in MCDRAM compared to DDR4. • For MCDRAM, capping at 190 results in energy saving of ~6% without loss of time to solution. • For DDR4, capping at 130 Watts improves energy efficiency by ~12%. Native Event Characterization Highlighting non-obvious behavior Cortex A8, A9, A15, ARM64 Gemini and Aries interconnect, power Blue Gene Series, Q: 5-D Torus, I/0 System, EMON power, energy Tesla, Kepler: CUDA support for multiple GPUs; PC Sampling NVML Virtual Environment Virtual Environment Power Series Westmore, Sandy/Ivy Bridge, Haswell, Broadwell, Skylake(-X), Kaby Lake RAPL (power/energy), power capping PaRSEC UTK http://icl.utk.edu/parsec/ HPCToolkit Rice University http://hpctoolkit.org/ TAU University of Oregon http://www.cs.uoregon.edu/ research/tau/ Scalasca FZ Juelich, TU Darmstadt http://scalasca.org/ Vampir TU Dresden http://www.vampir.eu/ PerfSuite NCSA http://perfsuite.ncsa.uiuc.edu/ Open|Speedshop SGI http://oss.sgi.com/projects/ openspeedshop/ SvPablo RENCI at UNC http://www.renci.org/ research/pablo/ ompP UTK http://www.ompp-tool.com/ Score-P http://score-p.org/ KNC, Knights Landing including power/energy 0 20 40 60 80 100 16 64 256 1024 4096 16384 65536 Average count per 100 accesses Buffer size in KB L1 HIT L2 HIT L3 HIT 0 20 40 60 80 100 16 64 256 1024 4096 16384 65536 Average count per 100 accesses Buffer size in KB L1 MISS L2 MISS L3 MISS 0 10 20 30 40 50 60 70 80 90 100 16 32 64 128 256 512 1024 2048 4096 Average count per 100 accesses Buffer size in KB unit size= 64B unit size=128B unit size=256B 0 10 20 30 40 50 60 70 80 90 100 256 768 1280 1792 2304 2816 3328 3840 Average count per 100 accesses Buffer size in KB ppb=4 ppb=8 ppb=16 ppb=32 ppb=33 ppb=34 ppb=35 TDP = 215W Boxplot showing read latency for various versions of PAPI and the large improvement by using rdpmc Comparison of historical performance counter interfaces (perfmon2, perfctr) showing that perf_event rdpmc matches even the best historical interface. Time (sec) 0 10 20 30 40 50 60 70 80 90 100 110 Average power (Watts) 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 1991 8.22 1383 1901 8.36 1341 1785 8.58 1314 1615 8.57 1325 1340 8.02 1419 1154 7.36 1562 971 6.66 1715 785 5.81 1982 575 4.64 2489 Performance in Gflop/s Gflops/Watts Joules Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set MHz 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Frequency Time (sec) 0 10 20 30 40 50 60 70 80 90 100 110 Average power (Watts) 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 1997 8.82 1303 1904 8.92 1279 1741 8.95 1267 1589 9.04 1253 1328 8.47 1345 1137 7.73 1480 956 6.95 1661 773 6.03 1904 560 4.72 2443 Performance in Gflop/s Gflops/Watts Joules Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set MHz 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Frequency DDR4 MCDRAM MCDRAM DDR4 MCDRAM Time (sec) 0 11 22 33 44 55 66 77 88 99 110 121 132 143 154 Average power (Watts) 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 21 85 0.12 2588 21 85 0.12 2623 21 85 0.12 2639 21 85 0.12 2664 21 85 0.12 2661 21 84 0.13 2558 21 82 0.13 2432 20 82 0.14 2305 19 78 0.14 2240 Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set MHz 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Frequency Time (sec) 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 Average power (Watts) 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 84 335 0.39 815 83 331 0.39 805 82 328 0.42 717 79 317 0.45 686 65 261 0.42 745 56 225 0.38 822 47 188 0.34 909 38 150 0.29 1077 28 111 0.24 1361 Performance in Gflop/s Achieved Bandwidth GB/s Gflops/Watts Joules Accelerator Power Usage (PACKAGE) Memory Power Usage (DDR4) Max power limit set MHz 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Frequency DDR4 MCDRAM DDR4 DDR4 MCDRAM Time (sec) 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Average power (Watts) 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 2770 joules 2878 joules 2689 joules 2438 joules 2137 joules 2731 joules DDR_215Watts DDR_200Watts DDR_180Watts DDR_160Watts DDR_140Watts DDR_120Watts Time (sec) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Average power (Watts) 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 826 joules 842 joules 724 joules 803 joules 1066 joules 1865 joules MCDRAM_215Watts MCDRAM_200Watts MCDRAM_180Watts MCDRAM_160Watts MCDRAM_140Watts MCDRAM_120Watts Time (sec) 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 Average power (Watts) 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 9064 joules 9196 joules 9194 joules 9178 joules 9005 joules 8084 joules 8467 joules DDR_270watts DDR_215watts DDR_200watts DDR_180watts DDR_160watts DDR_140watts DDR_120watts Time (sec) 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 Average power (Watts) 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 3050 joules 2981 joules 2872 joules 3043 joules 3383 joules 4033 joules 5734 joules MCDRAM_270watts MCDRAM_215watts MCDRAM_200watts MCDRAM_180watts MCDRAM_160watts MCDRAM_140watts MCDRAM_120watts

New Developments for PAPI 5.6+ Archive/tech_poster... · 2018. 9. 17. · Improved PAPI Test Infrastructure • The existing PAPI testsuite is used to test the correctness of PAPI

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: New Developments for PAPI 5.6+ Archive/tech_poster... · 2018. 9. 17. · Improved PAPI Test Infrastructure • The existing PAPI testsuite is used to test the correctness of PAPI

New Developments for PAPI 5.6+Anthony Danalis, Heike Jagode, and Jack Dongarra

University of Tennessee, KnoxvilleVince Weaver and Yan Liu

University of Maine

INTRODUCTION

PART 1: PAPI for Power‐Aware Computing

PART 2:

PAPI’s Counter Inspection Toolkit (CIT)

PART 3:

Modernizing PAPI Infrastructure

SUPPORTED ARCHITECTURES

3RD PARTY TOOLS APPLYING PAPI

• PAPI provides a consistent interface (and methodology) for hardware performance counters, found across a compute system:� i. e., CPUs, GPUs, on- and off-chip memory, interconnects, I/O system, file system, energy/power, etc.

• PAPI enables software engineers to see, in near real time, the relationship between �software performance and hardware events across the entire compute system.

Intel Knights Landing: FLAT mode• Entire MCDRAM is used as addressable memory� ➔ memory allocations are treated similarly to DDR4 memory allocations

PAPI for power-aware computing• We use PAPI’s latest powercap component for

measurement, control, and performance analysis.

• In the past, PAPI power components supported only reading power information.

• New component exposes RAPL functionality to allow users to read and write power.

• Study numerical building blocks of varying computational intensity.

• Use PAPI powercap component to detect power optimization opportunities.

• Cap the power on the architecture to reduce power usage while keeping the execution time constant ➔ energy savings.

Key ConceptsGoal:

• Create a set of micro-benchmarks for illustrating details in hardware events and how they relate to the behavior of the micro-architecture

Target audience:

• Performance-conscious application developers

• PAPI developers working on new architectures (think preset events)

• Developers interested in validating hardware-event counters

QUESTION

How many branch instructions will these codes execute per iteration?

The expectation is that both codes will execute 2.

The measured answer is 2 for the first and 2.5 for the second!

Can you guess why?

Improved PAPI Test Infrastructure• The existing PAPI testsuite is used to test the correctness of

PAPI before release.

• The hardware and operating systems used by PAPI are always changing, and some of the existing tests were outdated or gave false negatives.

• Existing tests were checked to ensure accurate results on modern hardware.

• New counter validation tests were created which should provide a sanity check when bringing up support for a new processor architecture.

Low-Overhead PAPI_read() support• Traditionally PAPI_read() counter reads went through the

standard Linux read() system call, which can be slow (around 1,000 cycles).

• x86 hardware supports a userspace rdpmc() instruction that bypasses the kernel and requires 200 cycles (a 5× speedup).

• Various bugs in the Linux kernel around this interface were found and fixed so that rdpmc() can be enabled by default.

Enhanced Sampling Interface• PAPI currently has a limited counter sampling interface that

only allows gathering the instruction pointer at regular intervals

• Modern processors support much richer sampling information, including the cause of cache misses, where in the cache hierarchy the miss happened, and the cycles taken.

• We extend the PAPI sampling interface to provide this additional sampling information.

Heike Jagode, Azzam Haidar, Phil Vaccaro, Asim YarKhan, and Stan Tomov

ACKNOWLEDGMENTS This material is based upon work supported in part by the National Science Foundation NSF under awards No. 1450429. A portion of this research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

Vince Weaver and Yan LiuAnthony Danalis, Heike Jagode, and Hanumanthrayappa

Power awareness for Level-3 BLAS: DGEMM68 cores KNL, Peak DP = 2,662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s

• DGEMM is run repeatedly for a fixed matrix size of 18K per iteration.

• For each iteration, we dropped the power in 10 or 20 Watts steps.

• Powercap achieves its power limiting though DVFS scaling. See frequency value, which — as it drops — corresponds to a decrease in power.

Power awareness for Level-3 BLAS: DGEMV68 cores KNL, Peak DP = 2,662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s

• DGEMV is run repeatedly for a fixed matrix size of 18K per iteration.• For each iteration, we dropped the power in 10 or 20 Watts steps.• Frequency is not affected until the power caps kick in.

Power awareness for applications: Jacobi• Solving Helmholtz equation with Jacobi Iterative Method (grid size 12,800 x 12,800, 2D 5-point stencil)➔ requires multiple memory accesses per update➔ high-memory bandwidth and low computational intensity

Lesson for DGEMM-type operations (compute intensive):• Both data allocation schemes provides roughly the same performance. This is caused by the high ratio of FLOPs to

bytes in the compute-intensive routines, which allows for as much cache reuse as possible while hiding the data transfer between memory and cache levels.

• Performance drops with decreasing power caps, which is expected for compute-intensive kernels.

Lesson for DGEMV-type operations (memory bound):• Performance varies between the two storage options: Data allocations in MCDRAM results in much higher performance

for memory-bound kernels (BW for MCDRAM is 425GB/s vs. 90 GB/s for DDR4).• Dgemv is bound by the reading of matrix A. Depending on its storage location, performance and power may be affected.

• We see lower package-power-usage but higher DDR4-power-usage.• Capping at 140 W improves energy efficiency by ~10%.• Capping at 120 W results in a 17% energy savings with some performance degradation (8%).

• Capping at 180 Watts renders in ~15% of energy saving with a negligible performance penalty (of less than 5%).

Lesson for Jacobi iteration:• Computation is about 3.5X faster when the data is allocated in MCDRAM compared to DDR4.

• MCDRAM: Capping at 170 Watts improves energy efficiency by ~14% without any loss in time to solution.

• DDR4: capping at 135 Watts improves energy efficiency by ~25% without any loss in time to solution.

Power awareness for applications: Lattice-Boltzman• Lattice-Boltzmann simulation of computational fluid dynamics (from SPEC 2006 benchmark)➔ high-memory bandwidth and low computational intensity

Lesson for Lattice-Boltzmann:• Computation is about 3.6X faster when the data is allocated in MCDRAM compared to DDR4.

• For MCDRAM, capping at 190 results in energy saving of ~6% without loss of time to solution.

• For DDR4, capping at 130 Watts improves energy efficiency by ~12%.

Native Event Characterization

Highlighting non-obvious behavior

Cortex A8, A9, A15, ARM64 Gemini and Aries interconnect, power Blue Gene Series, Q: 5-D Torus,I/0 System, EMON power, energy

Tesla, Kepler: CUDA support formultiple GPUs; PC Sampling NVML Virtual Environment Virtual Environment

Power Series Westmore, Sandy/Ivy Bridge, Haswell,Broadwell, Skylake(-X), Kaby Lake

RAPL (power/energy),power capping

PaRSECUTK

http://icl.utk.edu/parsec/

HPCToolkitRice University

http://hpctoolkit.org/

TAUUniversity of Oregon

http://www.cs.uoregon.edu/research/tau/

ScalascaFZ Juelich, TU Darmstadt

http://scalasca.org/

VampirTU Dresden

http://www.vampir.eu/

PerfSuite NCSA

http://perfsuite.ncsa.uiuc.edu/

Open|SpeedshopSGI

http://oss.sgi.com/projects/openspeedshop/

SvPabloRENCI at UNC

http://www.renci.org/research/pablo/

ompPUTK

http://www.ompp-tool.com/

Score-P http://score-p.org/

KNC, Knights Landingincluding power/energy

0

20

40

60

80

100

16 64 256 1024 4096 16384 65536

Aver

age

coun

t per

100

acc

esse

s

Buffer size in KB

L1 HITL2 HITL3 HIT

0

20

40

60

80

100

16 64 256 1024 4096 16384 65536

Aver

age

coun

t per

100

acc

esse

s

Buffer size in KB

L1 MISSL2 MISSL3 MISS

0

10

20

30

40

50

60

70

80

90

100

16 32 64 128 256 512 1024 2048 4096

Aver

age

coun

t per

100

acc

esse

s

Buffer size in KB

unit size= 64Bunit size=128Bunit size=256B

0

10

20

30

40

50

60

70

80

90

100

256 768 1280 1792 2304 2816 3328 3840

Aver

age

coun

t per

100

acc

esse

s

Buffer size in KB

ppb=4ppb=8

ppb=16ppb=32ppb=33ppb=34ppb=35

TDP = 215W

Boxplot showing read latency for various versions of PAPI and the large improvement by using rdpmc

Comparison of historical performance counter interfaces (perfmon2, perfctr) showing that perf_event rdpmc matches even the best historical interface.

Time (sec)0 10 20 30 40 50 60 70 80 90 100 110

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300320

1991

8.22

1383

1901

8.36

1341

1785

8.58

1314

1615

8.57

1325

1340

8.02

1419

1154

7.36

1562

971

6.66

1715

785

5.81

1982

575

4.64

2489

Performancein Gflop/s

Gflops/WattsJoules

Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set

MH

z

0 200 400 600 800 100012001400160018002000

Frequency

Time (sec)0 10 20 30 40 50 60 70 80 90 100 110

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300320

1997

8.82

1303

1904

8.92

1279

1741

8.95

1267

1589

9.04

1253

1328

8.47

1345

1137

7.73

1480

956

6.95

1661

773

6.03

1904

560

4.72

2443

Performancein Gflop/s

Gflops/WattsJoules

Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set

MH

z

0 200 400 600 800 100012001400160018002000

Frequency

DDR4 MCDRAM

MCDRAM

DDR4

MCDRAM

Time (sec)0 11 22 33 44 55 66 77 88 99 110 121 132 143 154

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300320

2185

0.12

2588

2185

0.12

2623

2185

0.12

2639

2185

0.12

2664

2185

0.12

2661

2184

0.13

2558

2182

0.13

2432

2082

0.14

2305

1978

0.14

2240

Performancein Gflop/sAchievedBandwidth GB/sGflops/WattsJoules

Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set

MH

z

0 200 400 600 800 100012001400160018002000

Frequency

Time (sec)0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300320

8433

50.

3981

5

8333

10.

3980

5

8232

80.

4271

779

317

0.45

686

6526

10.

4274

5

5622

50.

3882

2

4718

80.

3490

9

3815

00.

2910

77

2811

10.

2413

61

Performancein Gflop/sAchievedBandwidth GB/sGflops/WattsJoules

Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set

MH

z

0 200 400 600 800 100012001400160018002000

Frequency

DDR4

MCDRAMDDR4DDR4 MCDRAM

Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300

2770 joules2878 joules2689 joules2438 joules2137 joules2731 joules

DDR_215WattsDDR_200WattsDDR_180WattsDDR_160WattsDDR_140WattsDDR_120Watts

Time (sec)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300

826 joules842 joules724 joules803 joules1066 joules1865 joules

MCDRAM_215WattsMCDRAM_200WattsMCDRAM_180WattsMCDRAM_160WattsMCDRAM_140WattsMCDRAM_120Watts

Time (sec)0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300

9064 joules9196 joules9194 joules9178 joules9005 joules8084 joules8467 joules

DDR_270wattsDDR_215wattsDDR_200wattsDDR_180wattsDDR_160wattsDDR_140wattsDDR_120watts

Time (sec)0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300

3050 joules

2981 joules

2872 joules

3043 joules3383 joules

4033 joules5734 joules

MCDRAM_270wattsMCDRAM_215wattsMCDRAM_200wattsMCDRAM_180wattsMCDRAM_160wattsMCDRAM_140wattsMCDRAM_120watts