XPDL: Extensible Platform Description Language to Support ...XPDL: Extensible Platform Description...

Preview:

Citation preview

XPDL: Extensible Platform Description Languageto Support Energy Modeling and Optimization

Published in ICPP-EMS’15

Christoph Kessler, Lu Li, Ming-jie Yang, Aras Atalar and Alin Dobrechristoph.kessler@liu.se, lu.li@liu.se

1 / 15

Motivation

Optimization:Platform-independent: algorithmic improvementPlatform-dependent: SIMD, GPU etc

Platform dependent optimization such as SIMD can yieldsignificant performance and energy improvements.Usually platform dependent optimizations are manuallytuned or partly automated.

Requires understanding of the machine-specific features.Automation of systematic platform-dependentoptimizations is both interesting and challenging.

Adaptivity

RetargetabilityPrerequisite: a formal platform description languagemodeling optimization-relevant platform properties.

2 / 15

Motivation

Platform = Hardware + System Software

Previous work: PDL (Platform Description Language)Limitations: flexibility and scalability.

XPDL is designed to overcome those limitations,Furthermore, new features are added to better supportenergy optimization, such as power island modeling.

3 / 15

A Hello World XPDL Example

Examples are only to illustrate XPDL language constructs,not meant to be complete.

CoreL1

CoreL1

CoreL1

CoreL1

L2 L2

L3

(a) A Typical CPUStructure

1 <cpu name=" Intel_Xeon_E5_2630L ">2 <group p r e f i x = " core_group " q u a n t i t y = " 2 " >3 <group p r e f i x = " core " q u a n t i t y = " 2 " >4 <!�� Embedded d e f i n i t i o n ��>5 < core f requency=" 2 " f requency_un i t= "GHz" >6 <cache name= " L1 " s ize=" 32 " u n i t = " KiB " / >7 </ core>8 </group>9 <cache name= " L2 " s ize=" 256 " u n i t = " KiB " / >

10 </group>11 <cache name= " L3 " s ize=" 15 " u n i t = " MiB " / >12 <power_model type=" power_model_E5_2630L " / >13 </cpu>14

(b) The corresponding XPDL code

4 / 15

Modular and extensible

name: defining a meta-model, stores type informationid: defining a model, stores object information

Figure 1: Type reference and inheritance

5 / 15

Control Relation Decoupled From Hardware

1 < Master i d = " 0 " q u a n t i t y = " 1 ">2 <peppher : PEDescriptor>3 <peppher : Property f i x e d =" t rue ">4 <name>ARCHITECTURE</name>5 <value> x86 </ value> . . .6 <peppher : Worker q u a n t i t y = " 1 " i d = " 1 ">7 <peppher : PEDescriptor>8 <peppher : Property f i x e d =" t rue ">9 <name>ARCHITECTURE</name>

10 <value> gpu </ value> . . .11 </peppher : Worker>12 </Master>13

Listing 1: PDL example description for x86-core (Master) and gpu (Worker)

1 <system i d = " l i u�gpu�server ">2 <socket>3 < cpu i d = " gpu�host " type= " I n t e l�Xeon�E5�2630L " / >4 </ socket>5 < device i d = " gpu1 " type= " Nvidia�K20c " / >6 <interconnects >7 <interconnect i d = " connect ion1 " type=" pcie3 " head=" gpu�host " t a i l = " gpu1 " / >8 </ interconnects>9 </system>

10

Listing 2: XPDL example description for such a GPU server 6 / 15

System Software Modeling

1 <system i d = " XScluster ">2 < cluster >3 <group p r e f i x = " n " q u a n t i t y = " 4 ">4 < node >5 <group i d = " cpu1 ">6 <socket> <cpu i d = "PE0" type=" I n t e l�Xeon� . . . " / > </socket>7 </group>8 <group p r e f i x = " main�mem" q u a n t i t y = " 4 "> <memory type="DDR3�4G" / > </group>9 <device i d = " gpu1 " type=" Nvidia�K20c " / >

10 <interconnects >11 <interconnect i d = " conn1 " type=" pcie3 " head=" cpu1 " t a i l = " gpu1 " / >12 </ interconnects>13 </node>14 </group>15 <interconnects >16 <interconnect i d = " conn3 " type=" i n f i n i b a n d 1 " head=" n1 " t a i l = " n2 " / >17 </ interconnects>18 </ cluster>19 <software >20 <hostOS i d = " l i n u x 1 " type=" Linux � . . . " / >21 < ins ta l l ed type="CUDA�6.0 " path=" / ex t / l o c a l / cuda6 . 0 / " / >22 < ins ta l l ed type="CUBLAS� . . . " path=" . . . " / >23 < ins ta l l ed type=" StarPU�1.0 " path=" . . . " / >24 </ software>25 <properties >26 <property name=" ExternalPowerMeter " type=" . . . " command=" myscr ip t . sh " / >27 </ properties>28 </system>

Listing 3: Example of a concrete cluster machine 7 / 15

Power Modeling

A power model in XPDL consists ofpower domains and their power state machinesmicrobenchmarks with deployment information

8 / 15

Modeling Power Domains

Power domains: hardware components that must changestate together.E.g. in Movidius Myriad2, each SHAVE core form aseparate power island.

9 / 15

Modeling Power Domains

1 <power_domains name=" Myriad1�power�domains ">2 <!�� t h i s i s l a n d i s the main i s l a n d ��>3 <!�� and cannot be turned o f f ��>4 <power_domain name=" main�pd " enableSwi tchOff= " f a l s e ">5 <core type=" Leon " / >6 </power_domain>7 <group name=" Shave�pds " q u a n t i t y = " 8 ">8 < power_domain name= " Shave�pd " >9 <core type= " Myriad1�Shave " / >

10 </power_domain>11 </group>12 <!�� t h i s i s l a n d can only be turned o f f ��>13 <!�� i f a l l the Shave cores are switched o f f ��>14 <power_domain name="CMX�pd "15 s w i t c h o f f C o n d i t i o n = " Shave�pds o f f " >16 <memory type="CMX" / >17 </power_domain>18 </power_domains>

Listing 4: Example meta-model for power domains of Movidius Myriad1

10 / 15

Modeling Power State Machine

Figure 2: Intel Xscale processor (2000)1 <power_state_machine name=" power�s ta te�machine1 "2 power_domain =" I n t e l�Xscale�pd " >3 <power_states>4 < power_state name= "P1" frequency =" 150 " f requency_un i t= "MHz"5 power=" 60 " power_uni t= "mW" vo l tage=" 0.75 " vo l tage="V" / >6 <power_state name="P2" frequency=" 600 " f requency_un i t= "MHz"7 power=" 450 " power_uni t= "mW" vo l tage=" 1.3 " vo l tage="V" / >8 . . .9 </ power_states>

10 <transi t ions >11 < t rans i t ion head= "P1" t a i l = "P2" t ime =" 160 " t ime_un i t = " us " / >12 <t rans i t ion head="P2" t a i l = "P1" t ime=" 160 " t ime_un i t = " us " / >13 . . .14 </ t rans i t ions>15 </power_state_machine>

Listing 5: Example meta-model for a power state machine of Intel Xscale processor(2000) 11 / 15

Modeling Microbenchmarks With Deployment Information

1 <instruct ions name=" x86�base�i sa "2 mb = "mb�x86�base�1" >3 < inst name= " fmul "4 energy = " ? " energy_uni t= " pJ " mb=" fm1 " / >5 < inst name=" fadd "6 energy=" ? " energy_uni t= " pJ " mb=" fa1 " / >7 < inst name= " d ivsd " >8 <data f requency=" 2.8 "9 energy= " 18.625 " energy_uni t= " nJ " / >

10 <data f requency=" 2.9 "11 energy=" 19.573 " energy_uni t= " nJ " / >12 <data f requency=" 3.4 "13 energy=" 21.023 " energy_uni t= " nJ " / >14 </ inst>15 </ instruct ions>

Listing 6: Example meta-model for instruction energy cost

1 <microbenchmarks i d = "mb�x86�base�1"2 i n s t r u c t i o n _ s e t = " x86�base�i sa "3 path=" / usr / l o c a l / micr / s rc "4 command =" mbscr ip t . sh ">5 <microbenchmark i d = " fa1 " type= " fadd " f i l e = " fadd . c "6 c f l a g s ="�O0" l f l a g s =" . . . " / >7 <microbenchmark i d = "mo1" type="mov" f i l e = "mov . c "8 c f l a g s ="�O0" l f l a g s =" . . . " / >9 </microbenchmarks>

Listing 7: An example model for instruction energy cost 12 / 15

XPDL Toolchain and Microbenchmark Generation

XPDL Schema

XPDLXML Files

XPDL Parser

XPDL Query API

IntermediateRepresentation

Microbenchmark generator

Microbenchmark execution

Code generator

C++ library

Applicationsand tool chain

C++ API

Figure 3: XPDL Tool Chain Diagram

13 / 15

XPDL Runtime Query API

Initialization of the XPDL run-time query libraryFunctions for browsing the model treeFunctions for looking up attributes of model elementsModel analysis functions for derived attributes

Static inference, e.g. PCIe bandwidthAggregate numbers,e.g. get the total static poweror the total number of GPUs

14 / 15

Summary of Contributions

We propose XPDL, a portable and extensible platformdescription languageModular and extensibleSoftware roles decoupled from hardware structureSystem software modeledMicrobenchmark generationOpen source tool chainhttp://www.ida.liu.se/labs/pelab/xpdl/

15 / 15

Bibliography I

15 / 15

MeterPU:

A Generic Measurement Abstraction API

– Enabling Energy-tuned Skeleton Backend Selection

Published in Journal of Supercomputing, 2016

Lu Li, Christoph Kesslerlu.li@liu.se, christoph.kessler@liu.se

1 / 11

Motivation

Energy MeasurementNecessary for energy modeling and optimization.Difficult, especially for GPUs

Power visualization (Li et al., 2015 [2]) for a CUDA program.Illustrating "capacitor effect". Data obtained by Nvidia NVML.

6

6.5

7

7.5

8

8.5

9

·104

Power(mW

)

40

42

44

46

48

50

Temperature

30

32

34

36

Fanspeed

65 70 75 80 85

Time(s)

Requires numerical postprocessing(Burtscher et al., 2014 [1]) to calculate real energy cost

Goal: as simple as measuring time in good old days.2 / 11

MeterPU

A software multimeterA generic measurement abstraction,hiding platform-specific measurement details.Four simple functions to unify measurement interface forvarious metrics on different hardware components.

Time, Energy on CPU, GPUEasy to extend: FLOPS, cache misses etc.

On top of native measurement libraries (plug-ins).CPU time: clock_gettime()GPU time: cudaEventRecord()CPU and DRAM energy: Intel PCM library.GPU energy: Nvidia NVML library.Wattsup_Energy: hardware measurement...

3 / 11

MeterPU API

template<cl ass T ype> / / analogous to swi tch/ / on a r eal mult imeter

cl ass Meter{

publ i c :void st ar t ( ) ; / / st ar t a measurementvoid stop ( ) ; / / stop a measurementvoid cal c ( ) ; / / cal cul at e a met r i c value

/ / of a code regi ontypename Meter_ T raits<Type>:: R esultT ype const &

get_ value( ) const ;/ / get the cal cul at ed met r i c value

pr i vat e :/ / P lat form speci f i c measurement det ai l s hidden . . .

}

4 / 11

An MeterPU Application: Measure CPU Time

#incl ude <MeterPU .h>

i nt main ( ){

cpu_ func( ) ; / / Do sth here

}

#incl ude <MeterPU .h>

i nt main ( ){

using namespace MeterPU ;Meter<CPU_ Time> meter ;

cpu_ func( ) ; / / Do sth here

}

#incl ude <MeterPU .h>

i nt main ( ){

using namespace MeterPU ;Meter<CPU_ Time> meter ;

meter . st ar t ( ) ; / / Measurement Star t !

cpu_ func( ) ; / / Do sth here

meter . stop ( ) ; / / Measurement Stop !

}

#incl ude <MeterPU .h>

i nt main ( ){

using namespace MeterPU ;Meter<CPU_ Time> meter ;

meter . st ar t ( ) ; / / Measurement Star t !

cpu_ func( ) ; / / Do sth here

meter . stop ( ) ; / / Measurement Stop !

meter . cal c ( ) ;

}

#incl ude <MeterPU .h>

i nt main ( ){

using namespace MeterPU ;Meter<CPU_ Time> meter ;

meter . st ar t ( ) ; / / Measurement Star t !

cpu_ func( ) ; / / Do sth here

meter . stop ( ) ; / / Measurement Stop !

meter . cal c ( ) ;BUILD_ CPU_ TIME_ MODEL( meter . get_ value( ) ) ;

}

5 / 11

An MeterPU Application: Measure GPU Energy

#incl ude <MeterPU .h>

i nt main ( ){

using namespace MeterPU ;/ / Only one l i ne d i f f er s ! ! ! !Meter<NVML_ Energy<> > meter ;

meter . st ar t ( ) ; / / Measurement Star t !

cuda_ func<<< . . . , . . . >>>(...) ;cudaDeviceSynchronize( ) ;

meter . stop ( ) ; / / Measurement Stop !

meter . cal c ( ) ;BUILD_GPU_ ENERGY_MODEL( meter . get_ value( ) ) ;

}

6 / 11

Measure Combinations of CPUs and GPUs

#incl ude <MeterPU .h>

i nt main ( ){

using namespace MeterPU ;/ / Only one l i ne d i f f er s ! ! ! !Meter< System_ Energy<0> > meter ;

meter . st ar t ( ) ; / / Measurement Star t !

async_ cpu_ func( ) ;cuda_ func<<< . . . , . . . >>>(...) ;wait_ for_ cpu_ func_ to_ finish ( ) ;cudaDeviceSynchronize( ) ;

meter . stop ( ) ; / / Measurement Stop !

meter . cal c ( ) ;BUILD_ SYSTEM_ ENERGY_MODEL( meter . get_ value( ) ) ;

}

7 / 11

Unification ! Reuse Legacy Autotuning Framework

Figure 1: Unification allows empirical autotuning framework to switch tomultiple meter types on different hardware components.

8 / 11

SkePU Reduce Skeleton (A Single Skeleton Call)

2e+02 1e+03 5e+03 5e+04 5e+05Problem size for reduce

1

10

100

1000

10000

Tim

e(us

): av

erag

e of

100

runs ●

CPUOMPCUDASelection

(a) Time tuning forReduce.

●●

●●

● ● ● ●

●●

2e+03 1e+04 5e+04 2e+05Problem size for reduce

100

200

500

1000

Ener

gy(m

illiJ)

: ave

rage

of 1

000

runs

CPUOMPCUDASelection

(b) Energy tuning forReduce.

Further results see paperEmpirical autotuning framework for time optimizationreused for energy optimization.

9 / 11

LU decomposition (Multiple Skeleton Calls)

20 30 40 50Problem size for LU decomposition

200

500

1000

2000

5000

10000Ti

me(

us):

aver

age

of 3

000

runs ●

CPUOMPCUDASelection

(a) Time tuning for LUdecomposition

● ●●

● ●●

5 10 20 50 100Problem size for LU decomposition

100

200

500

1000

2000

5000

10000

Ener

gy(m

illiJ)

: ave

rage

of 1

000

runs

CPUOMPCUDASelection

(b) Energy tuning for LUdecomposition

Easy switching for optimization goals.

10 / 11

Summary of Contributions

Hide complexity of platform-specific energy measurement,especially for GPUs.MeterPU enables to reuse legacy empirical autotuningframeworks, such as the one in SkePU.With MeterPU, SkePU offers the first energy-tunedskeletons, as far as we know.Switching optimization goal can be easy,facilitates building time and energy models formulti-objective optimization.

, Open source, download at:http://www.ida.liu.se/labs/pelab/meterpu

11 / 11

Bibliography

Martin Burtscher, Ivan Zecena, and Ziliang Zong.Measuring GPU power with the K20 built-in sensor.In Proc. Workshop on General Purpose Processing Using GPUs (GPGPU-7). ACM, March 2014.

Lu Li and Christoph Kessler.Validating Energy Compositionality of GPU Computations.In Proc. HIPEAC Workshop on Energy Efficiency with Heterogeneous Computing (EEHCO-2015), January2015.

11 / 11

VectorPU: A Generic and Efficient Data-Containerand Component Model for Transparent Data Transfer

on GPU-based Heterogeneous Systems

Lu Li, Christoph Kesslerlu.li@liu.se, christoph.kessler@liu.se

1 / 2

Benchmark Results

Laptop A AGC Triolith0

1

2

3

4

Spee

dup

to u

nifie

d m

emor

y by

Vec

torP

U

Figure 1: Conjugate Gradient Benchmark.A Comparison with Nvidia’s UM.

2 / 2

Recommended