Simulation of power consumption of energy ... - uni-hamburg.de · power consumption of the hardware with and without en-ergy saving mechanism in Sect. 3. With trace ﬁles containing

Comput Sci Res Dev (2010) 25: 165–175DOI 10.1007/s00450-010-0120-6

S P E C I A L I S S U E PA P E R

Simulation of power consumption of energy efficient clusterhardware

Timo Minartz · Julian M. Kunkel · Thomas Ludwig

Published online: 29 July 2010© Springer-Verlag 2010

Abstract In recent years the power consumption of high-performance computing clusters has become a growingproblem because the number and size of cluster installationshas been rising. The high power consumption of clusters isa consequence of their design goal: High performance. Withlow utilization, cluster hardware consumes nearly as muchenergy as when it is fully utilized. Theoretically, in theselow utilization phases cluster hardware can be turned off orswitched to a lower power consuming state.

We designed a model to estimate power consumption ofhardware based on the utilization. Applications are instru-mented to create utilization trace files for a simulator realiz-ing this model. Different hardware components can be sim-ulated using multiple estimation strategies. An optimal strat-egy determines an upper bound of energy savings for exist-ing hardware without affecting the time-to-solution. Addi-tionally, the simulator can estimate the power consumptionof efficient hardware which is energy-proportional. This waythe minimum power consumption can be determined for agiven application. Naturally, this minimal power consump-tion provides an upper bound for any power saving strategy.

After evaluating the correctness of the simulator severaldifferent strategies and energy-proportional hardware arecompared.

T. Minartz (�) · T. LudwigDepartment of Informatics, University of Hamburg, 22527Hamburg, Germanye-mail: [email protected]

T. Ludwige-mail: [email protected]

J.M. KunkelDeutsches Klimarechenzentrum GmbH, 20146 Hamburg,Germanye-mail: [email protected]

Keywords Simulation · Energy-to-solution · Powerconsumption · HPC

1 Introduction

The high-performance design goal of clusters leads to hard-ware without power saving mechanisms and worst casecooling scenarios. If cluster components are fully utilized,this is the most energy efficient way to get this calculationpower. But with a low utilization the hardware consumesnearly as much energy as when it is fully utilized. There-fore, there exist multiple approaches to reduce low utiliza-tion phases, but these phases still arise due to hardware bot-tlenecks, unbalanced load or sequential phases of a program.

In these low utilization phases cluster hardware can beturned off or switched to a low-power mode if supportingthese modes. This approach is not new and is discussed inthe related work (Sect. 2).

Based on this approach, a model is designed to estimatepower consumption of the hardware with and without en-ergy saving mechanism in Sect. 3.

With trace files containing the component utilization andhardware power characteristics, the power consumption ofeach component can be estimated.

Furthermore, energy aware hardware is simulated to de-termine upper bounds for energy savings without perfor-mance degradation. This simulation is based on severalstrategies for switching to modes with different power con-sumption in low utilization phases. The estimated powerconsumption under usage of the different strategies is an-alyzed based on different program and hardware configura-tions to show the potential of this approach.

The model and the resulting software are evaluated inSect. 5 after describing the methodology in Sect. 4 for gen-erating trace files.

mailto:[email protected]



166 T. Minartz et al.

2 Related work

There has been much research performed in the area ofenergy consumption in high-performance computing. Thissection contains an overview and the closely related re-search.

The general facility to reduce energy consumption us-ing hardware supporting multiple operating states (such asDVFS) is introduced in [21] (shutting down devices basedon information about tasks from the kernel). Ge et al. classi-fied the impact of using DVFS in [12] for different applica-tion types and focused on general characteristics of power-performance metrics and micro benchmarks [11].

There are multiple approaches to reduce energy con-sumption of HPC applications, only a few will be character-ized in the following. When detecting phases with low uti-lization, e.g. using load balancing [24], exploiting the inter-node slack [18] or bottleneck detection in general [9], thehardware can be turned off to save energy. For specific ap-plications it is also possible that executing the applicationwith a larger number of power scalable nodes at a lower fre-quency saves energy [9]. A similar approach is to use moreenergy efficient hardware (like hardware for mobile devices)to build a cluster which increases TTS (time-to-solution),but can decrease overall energy consumption [26]. An ade-quate metric would be ETS (energy-to-solution).

The main problem when using power scalable hardwareis to identify the phases of low utilization. There exist mul-tiple approaches, partly on-line (just-in-time) and off-line.

Dividing traces into phases using performance countersand calculating a solution with specific gears (DVFS set-tings) for program phases via a heuristic [10] (off-line)and [13] (on-line) has been done.

Hsu et al. use the current MIPS of the processor to iden-tify the low utilization phases [14], while Huang et al. char-acterize the actual workload [15] (both on-line).

It has been analyzed to scale down the processors on-linein communication phases of the MPI program [20] and find-ing off-line a schedule that realizes energy bounds by divid-ing the application trace into sections by its communicationphases [25].

Naturally, the on-line (and some of the off-line) algo-rithms have impact on the applications’ performance. Eachof the algorithms is rated by the energy consumption and theperformance impact, an upper bound for energy savings hasyet—as far as we know—not been specified.

To estimate the energy consumption, most of the existingsimulators (such as [4] and [5]) use the performance coun-ters of the processor.

Freeh et al. developed a model to predict energy-timetrade-off of larger clusters estimating the idle times usingregression to fit a curve to the measured communication [9].

Etinski et al. use a trace file as input for Dimemas (a per-formance simulator) and scale the processors depending on

their load (using DVFS and over-clocking to prevent loadimbalance) [7]. They don’t have a model for the energy con-sumption, they argue with the reduced overall run time of theapplication and cannot monetarily specify the energy sav-ings.

Feng et al. already developed a power profiling tool oncomponent level [8], but for clusters in a productive environ-ment the measuring points may be not applicable becauseof the high hardware density of the nodes and components.Further, it seems that measuring the power consumption ofa larger count of nodes is not practical.

This simulator differs from the mentioned simulators, be-cause the components utilization for estimating the powerconsumption is extracted from the system’s kernel usingLibgtop, not using performance counters. The componentpower characteristics are determined using micro bench-marks and vendor information (as in most of the mentionedsimulators).

With our analytical model the power estimation for thewhole cluster (and a breakdown to its components) is possi-ble to identify the ETS for parallel applications.

Because the components’ future utilization is knownfrom trace files, several look-ahead strategies are designedto calculate upper bounds for energy saving for mul-tiple constraints. Simulating hardware which is energy-proportional [3], it will be possible to estimate the powerconsumption of specific applications without hardware over-head.

3 Model of cluster power consumption

This section describes the designed model of cluster powerconsumption. After analyzing some general assumptionseach of the mentioned strategies is described in detail.

In general, power consumption of a cluster environmentis the sum of the power consumption of its components(nodes, switches and HVAC1). The power consumption ofthe HVAC depends on the power consumption of the othercomponents. Every watt needed by a component results inwaste heat, which has to be cooled down. Therefore, totalpower consumption can be estimated by multiplying powerconsumption of the cluster with the PUE2 of the whole envi-ronment. For future installations the PUE will be decreasedfollowing approaches like server housing or free cooling,which are already established in datacenters (e.g. Google’sdatacenters reach a PUE of about 1.2, non-HPC computecenters have PUEs of 2 and more).

Power consumption of today’s switches used in the HPCenvironment is nearly independent of the network utiliza-tion. The switch consumes as much power when active as

1Heating, Ventilating and Air Conditioning.2Power Usage Effectiveness.

Simulation of power consumption of energy efficient cluster hardware 167

when it is inactive. Hence, the power consumption of aswitch is a constant, which can be added to the node’s powerconsumption. For this reason, the estimation of the node’sconsumption is the main challenge. The power consumptionof a node is the sum of its components’ power consumption.Some of these hardware components (like CPU or powersupply) have a bigger effect on the power consumption thanother components. And for most components the power con-sumption is based on the utilization (a calculating CPU usesmore energy than an idle one).

The chosen model contains abstractions for the follow-ing node components: CPU, memory, disk, NIC and thepower supply. It is possible to include a percentaged over-head for other components like main board and others ifknown. To estimate power consumption for a given utiliza-tion the power consumption of the component is interpolatedlinearly from two values: Zero utilization and full utilization.

In reality, the power consumption/utilization graph is notlinear and the characteristics are different for each compo-nent. But these characteristics are not easy to obtain: Themajority of hardware vendors does not publish this informa-tion. To measure an indicated value for a specific componentspecial equipment is needed. The power consumption for aCPU (without main board) for different utilization values isnot measurable easily. Hence, only the two values for highand low utilization are used. This model will be extended inthe future using a power profiling tool like PowerPack [8].

Further, the power consumption for modeling the differ-ent states is needed. The modelling of these states is basedon the ACPI standard [6]. To increase the state means todecrease energy consumption and to increase wake-up time(latency) of the specified component.

The power consumption for ACPI Device Power State 0is as already mentioned interpolated linearly. For all othernon-working states (1–3) no interpolation has to be done,because the components are in a sleeping mode where thecomponent is not utilized.

Different working states where for example the CPUis used at a lower frequency (C-States) are not modeledhere, because the trace file does not contain this informationand the cluster’s hardware does not support this feature. Toswitch to a different ACPI Device Power State two furthervalues are needed: The duration and the energy consump-tion. If a disk is switched from ACPI Device Power State 3(deeper sleep) to ACPI Device Power State 0 (working), thistakes a few seconds. Additionally some energy is needed toe.g. spin up the disk to the required revolutions per minute.These values depend on the component and the concretechange, so for every state change values can be considered.

Switching between two states is realized by sequentiallyincrementing or decrementing states until the desired stateis reached.

Based on this model (visualized in Fig. 1) and the utiliza-tion from the trace file, the power consumption can be esti-

Fig. 1 Linear energy consumption model compared to en-ergy-proportional model. The estimation of the energy consumptionfor utilized hardware is for all strategies the same, while the ACPImodel uses different power consumptions in case of unutilized hard-ware

mated. For each fixed step (defined by the stepsize of the uti-lization values in the trace file), the power consumption foreach component is calculated. This process (without usageof the ACPI Device Power States) is the simplest strategy forpower estimation and adequate to the real power consump-tion, because the tested cluster hardware is not ACPI capa-ble. However, with the energy consumptions of the differentACPI Device Power States potential power savings can becalculated, when putting components to a higher ACPI De-vice Power State.

To estimate the power consumption of a cluster four dif-ferent strategies have been implemented, every componentin a node can use a different strategy. The four strategies arenamed Simple Strategy, Optimal Strategy, Approach Strat-egy and MultipleState Strategy.

The first strategy estimates the power consumption with-out using any power saving mechanism. For the SimpleStrategy, every single timestep is evaluated independently.The other strategies are look-ahead strategies, which takedifferent decisions based on future utilization values. TheOptimal Strategy decides to put a component into sleepmode if the future utilization is zero for sufficient timestepsand wakes it up before it is used again. Because such zeroutilization times are not frequent for some components suchas a CPU, the Approach Strategy puts a component intosleep if the utilization is under a specified level, e.g. 5%. Tocompensate for the peculated utilization, at the end of thelow utilization phase an equivalent phase with high utiliza-tion is performed. The last strategy assigns different levelsof utilization to different ACPI Device Power States. Thisis useful for components like the main memory. Theoreti-cally, if the main memory utilization is under 50%, half ofthe banks can be switched off to save energy.


These four strategies are explained and discussed in de-tail in the next subsections.

3.1 Simple strategy

To interpolate the power consumption for each hardwarecomponent the consumption values for low and peak utiliza-tion have to be defined. The power consumption Pcurr for agiven utilization ucurr and the power consumption at ACPIDevice Power State 0 P0% (zero utilization) and P100% (fullutilization) is calculated in Eq. 1.

Pcurr = (P100% − P0%) ucurr + P0% (1)

Each timestep of the given utilization is evaluated inde-pendently (the step power consumption is calculated withEq. 1), hence the power consumption values are as granu-lar as the utilization values. The power consumption for thecomponent is the accumulated sum of the step power con-sumptions, the power consumption for the node is the accu-mulated sum of its components’ power consumption multi-plied with the efficiency of the power supply for this powerlevel.

Because the calculated power consumption only dependson the utilization and the power consumption in ACPI De-vice Power State 0 this strategy can also be used to esti-mate the power consumption with different hardware (e.g.energy-proportional hardware). The model is visualized inFig. 1.

3.2 Optimal strategy

The Optimal Strategy uses the different ACPI Device PowerStates of the component. This strategy calculates the mini-mal power consumption without reducing the performance,because components only switch to sleep mode if the com-ponent is not used and wakes up before the component isused again. Only the ACPI Device Power State 0 and 3are used. The decision whether a component switches tosleep mode depends on different factors: The duration ofthe zero utilization phase, the power saving potential of thestate change (Pdiff , difference of power consumption be-tween states ACPI0 and ACPI3), the duration of the statechange (tchange, sum duration ACPI0 to ACPI3 and dura-tion ACPI3 to ACPI0) and the energy of the state change(Echange, also the sum of both changes). With these valuesthe minimal duration of a zero utilization phase toptimal canbe calculated, for which sleeping reduces the power con-sumption without reducing performance (see Eq. 2).

toptimal = Echange

Pdiff+ tchange (2)

In difference to the power consumption with the SimpleStrategy power consumption is decreased (without increas-ing the calculating time). Because the future utilization in

general is not known, the output of the Optimal Strategy isan approximated upper bound for power saving using theapproach of shutting down hardware components.

3.3 Approach strategy

Because some components do not have zero utilizationphases but low utilization phases, the Approach Strategy hasbeen developed. The intention is to simulate smart utiliza-tion of each component. If a component works for example10 seconds with a low utilization of 10%, the same workcould be done working 1 second with a utilization of 100%.This results in 9 seconds with zero utilization where thecomponent can switch to sleep mode. The rearrangement ofload can be thought of reordering the code. However, in re-ality this might not be possible due to dependencies betweencalls et cetera. Consequently, the output produced by the Ap-proach Strategy shows scopes where power saving could bescheduled.

The calculation of the minimum time for an efficientstate change is now dependent on a tolerance value δ. Thetimesteps with a utilization lower than δ will be rearrangedif the count of timesteps under the specified tolerance isgreater or equal tapproach (see Eq. 3). The rearrangementof load results in a zero utilization phase (toptimal) and theutilization phase containing the equivalent load (tload). Theequivalent load is the aggregated utilization for the durationof toptimal before the rearrangement.

tapproach = toptimal + tload (3)

This strategy also provides an upper bound for powersaving. In contrast to the Optimal Strategy this strategy re-arranges the load, which more likely than not affects the cal-culating time of the parallel program because of the depen-dencies inside the code and/or the hardware.

3.4 Multiple state strategy

The last strategy is an extension of the Optimal Strategy touse multiple ACPI Device Power States for modeling an-other use case: It is possible for a component to use differ-ent states with different power consumptions and wake uptimes.

A component to use this strategy is the main memory,because the main memory will never reach a zero utilizationphase. Neither could it be put to sleep if only utilized with10%. But if the memory is split into multiple banks on themain board, some of these banks can be turned off if notutilized. It is also possible to use this strategy for the ACPIC-States, for example if the CPU is utilized about 50%, it ispossible to lower the frequency, but it is not recommendedto go in sleep mode.


For the Multiple State Strategy multiple utilization levelshave to be defined to trigger a state change. If the utiliza-tion is under the specified value the state change from ACPIDevice Power State 0 to ACPI Device Power State 1 is trig-gered. For increasing the ACPI Device Power State (and de-creasing the power consumption) the time tmin can be calcu-lated based on the future utilization steps and the next higherutilization level.

Based on the input values a quantification of the energysaving potential is still not possible, but it comes clear thatan energy saving potential is existent. With the choice of thestrategy the power saving varies: When using the ApproachStrategy in the main case this strategy saves more energyas Optimal Strategy. But this strategy recalculates the uti-lization, thus the state changes of the Approach Strategy arehints for the developer of the algorithm to review possiblecode modifications to change the hardware utilization. TheOptimal Strategy is an upper bound for energy saving be-cause the hardware switches to power saving states where itis efficient. With Simple Strategy it is possible to estimate thepower consumption of the hardware as granular as the inputvalues (for the utilization and the hardware power character-istics) are specified.

4 Methodology

The tracing is realized by a MPI wrapper library called HD-MPIwrapper which intercepts the MPI library calls andlogs information on their invocation: The ResourcesUtiliza-tionTracingLibrary logs the utilization of various compo-nents of a node (CPU utilization in percent, memory usagein bytes, network usage in bytes, disk usage in blocks). ThePowerTracingLibrary logs the amperage in A, the volt-age in V and the resulting power consumption in W alsoon node level. To trace the power consumption an externalpower meter (LMG-450 of ZES ZIMMER Electronic Sys-tems) is used. This power meter has a high accuracy andallows to trace the power consumption of four cluster nodes.The ResourcesUtilizationTracingLibrary and the Power-TracingLibrary are further described in [19]. For these ex-periments, the tracing periodicity is set to 100 ms. A graph-ical view on the tracing environment is given in Fig. 2.

The simulator realizing the model (named HDPowerEs-timation) is implemented in Java. If tracing the hardwareutilization of one node with a periodicity of 100 ms for about1.5 h this results in a trace file of about 11 MB. To processthe trace file, the preprocessing (reading the trace file) needsabout 5 seconds, while the simulation itself takes about 15seconds meaning a ratio of 360 : 1. Out of this, the executiontime can be unattended for small and middle scale simula-tions.

Fig. 2 Cluster tracing environment

5 Evaluation

To evaluate the model the accuracy of the simulator has tobe validated for the chosen hardware. After describing thehard- and software environment, the component power con-sumption is determined using micro benchmarks. In the fol-lowing the simulator is used for two test sets: Comparison ofmultiple environment configurations for a specific applica-tion and to simulate energy-proportional hardware devices.

5.1 Hard- and software environment

A cluster node has two Intel Xeon Northwood CPUs (singlecore) with 1024 MB main memory (two 512 MB memorybars). Each node has a local P-ATA disk and GigE. The spe-cific hardware of a cluster node is ACPI aware. The concreteenergy savings features are not in use, because these are noteffectively manageable.

The cluster nodes run Ubuntu 8.04 (SMP-Kernel 2.6.30)via NFS with mpich2 version 1.0.8 and PVFS (version 2.8.1)as parallel filesystem.

As a benchmark the program partdiff-par (a parallel PDEsolver implemented using MPICH) has been chosen due tothe following features:

• the calculation-communication-relation is flexible basedon the input values for the boundary values of the matrix


Fig. 3 Power consumption(without power supplyoverhead)

Table 1 Power characteristicsof a node Value CPU Main memory Disk NIC

P100% 58 W 13.585 W 7.02 W 2 W

P0% 28.1 W 10.985 W 4.42 W 0.78 W

P3 4 W 0.1 W 2 W 0.2 W

E0−3 2.74 × 10−7 W h 1.1 × 10−8 W h 0.001 W h 1 × 10−6 W h

E3−0 2.74 × 10−7 W h 1.1 × 10−8 W h 0.026 W h 0.001 W h

t0−3 0.017 ms 0.006 ms 1000 ms 0.1 ms

t3−0 0.017 ms 0.006 ms 4000 ms 0.1 ms

• the component utilization depends on the checkpointingfrequency and the calculation-communication-relation

• it support parallel I/O• partdiff-par is a real application and no synthetic bench-

mark

5.2 Determining component power consumption

To determine the component power consumption the zeroutilization power consumption is measured with differenthardware configurations. To estimate the power consump-tion of the utilized components a micro benchmark is used(see [22] for a further description of the micro benchmark).The power consumption in the ACPI Device Power State 3is also estimated for each component in this section.

For estimating the power consumption the ACPI De-vice Power State 3 consumptions must be specified. Ta-ble 1 shows the components’ power characteristics for themodeled components for one cluster node (typical valuesfrom desktop and mobile components that have been avail-able, because the concrete hardware doesn’t support multi-ple ACPI states). P3 labels the power consumption in ACPIDevice Power State 3, while E0−3 and t0−3 label the energyconsumption and duration for changing from ACPI DevicePower State 0 to ACPI Device Power State 3.

The ACPI values for the CPU are based on the valuesfor a Intel Centrino Core Duo 1.8 GHz CPU (mobile edi-

tion). This CPU consumes about 53 W when fully utilizedand 12 W, 7 W and 1 W in different ACPI Device PowerStates [17]. Hence, a value of 4 W for the power consump-tion in ACPI Device Power State 3 has been chosen. The du-ration for decreasing the ACPI Device Power State is takenfrom ACPI table of the operating system which is 0.017 msfor the Core Duo processor.

The power consumption values of the main memory areestimated based on Moona’s assumptions [23]. The ACPIDevice Power State consumption for the disk are extractedfrom Hylic’s work [16], while the NIC values are extractedfrom Agarwal’s work [2].

Missing input values for the model have been approxi-mated using the maximum power consumption. For exam-ple the energy consumption for increasing the CPU ACPIDevice Power State has been determined by multiplying theduration with the maximum power consumption of 58 W.

5.3 Assessing the power consumption of traces

Table 2 shows the results with different strategies for dif-ferent traces already discussed in the last sections. The de-viance between the measured energy consumption valuesand the ones estimated with Simple Strategy (without usageof ACPI states) ranges from 1.2% to 3.2% for the specifictraces. Based on the non-determinism of the parallel pro-


Table 2 Measured andestimated ETS for differentsetups

Setup Energy consumption in W h Deviance in %

Measured Estimated

mb 16.697 16.468 1.370

calculation intensive 21.202 20.571 2.977

communication intensive (50 iterations) 133.845 138.126 3.198

communication intensive 12.831 12.620 1.638

calculation intensive (×2) 18.476 18.251 1.187

communication intensive (×2) 13.362 13.557 1.453

Table 3 Estimated ETS withdifferent strategies Setup Energy consumption in W h Savings in %

Config Intensive Simple Optimal Approach Optimal Approach

mb – 16.468 14.852 14.083 9.813 14.482

4,calc calc 20.571 19.071 18.260 7.290 11.230

4,comm comm 12.620 12.183 11.719 3.460 7.142

8,calc calc (×2) 18.251 17.394 17.027 4.723 6.731

8,comm comm (×2) 13.557 12.957 12.591 4.420 7.116

Fig. 4 Estimated ETS fordifferent traces

gram the measured energy consumption values have a de-viance of about 1% for multiple runs. Due to this, a devianceof about 3% is acceptable.

The Optimal Strategy reaches power savings of nearly9.8% for the micro benchmark trace (see Table 3). This re-play does not include changes in the component utilization,the value of 9.8% is an upper bound for power saving with-out changes to the parallel program.

With the Approach Strategy a decrease in energy con-sumption of about 14.5% is possible. But this strategy as-sumes the parallel program to better utilize its componentto get longer and more low utilization phases for using the

more energy efficient ACPI states. In general, savings of3.5% to 9.8% are possible for different parameter sets forpartdiff-par using the Optimal Strategy, in case of the Ap-proach Strategy savings of 6.5% up to 14.5% are possible(see Fig. 4).

5.4 Energy efficient sleeping

In this test set multiple program configurations withoutcheckpointing and a varying count of MPI processes aretraced. Based on these traces the proportion between cal-culating and sleeping devices is analyzed. Table 4 shows the


Table 4 Estimated ETS on fourcluster nodes Setup Energy consumption in W h Savings in % Time in sec

Iter. MPI proc. Simple Optimal Approach Optimal Approach

5 4 2.153 1.554 1.432 27.8 33.5 6

5 8 2.141 1.610 1.598 24.8 25.4 4

100 4 18.663 13.186 14.121 29.4 24.3 105

100 8 15.262 13.983 14.105 8.4 8.6 67

200 4 36.142 26.429 27.467 26.9 24.0 209

200 8 29.048 27.007 27.101 7.0 6.7 133

500 4 90.615 71.330 69.193 21.3 24.6 536

500 8 70.592 66.253 66.218 6.1 6.2 334

Fig. 5 Estimated ETS fordifferent setups

estimated energy consumption and the measured run timefor different setups. The program partdiff-par is configuredwith a varying count of iterations, each of these runs with 4and 8 MPI processes respectively.

The different setups with 5 iterations have only a runtime of 6 and 4 seconds respectively, based on the countof MPI processes. With 8 MPI processes the run time is 2seconds smaller and the estimated energy consumption withSimple Strategy is about 0.05 W h smaller, mainly based onthe shorter run time because all 8 CPUs are used for cal-culating and less CPU idle times appear. If comparing theestimated energy consumption with Optimal Strategy, thesaving is decreased to 0.008 W h compared to the setup with4 MPI processes, but the run time is still constant for bothsetups.

The 4 idle CPUs of the 4 MPI process setup can switchto ACPI Device Power State 3 and sleep, while the other 4CPUs are calculating. The parallelization overhead of the 8MPI process setup, especially the additional communicationoverhead that increases the CPU utilization and the energysavings of the 4 MPI process setup decreases the total powerconsumption difference between the two setups. Using theApproach Strategy the estimated energy consumption for 4MPI processes is even lower than for 8 MPI processes. Fig-ure 5 visualizes the estimated energy consumption of eachsetup.

For this experiment the configuration of four calculat-ing CPUs and four sleeping CPUs can be more energy ef-ficient than using eight CPUs for calculating. This is basedon the parallelization overhead, which exceeds the perfor-


Table 5 Estimated ETS withenergy-proportional devices Setup Energy consumption in W h Savings

Config Workload Simple Optimal Energy-proportional

mb benchmark 16.468 14.852 11.554 29.83%

4,calc calc 20.571 19.071 14.662 28.72%

4,comm comm 12.620 12.187 9.1503 27.49%

8,calc calc (×2) 18.251 17.394 13.973 23.46%

8,comm comm (×2) 13.557 12.957 9.9154 26.85%

1 PVFS comm, 1 PVFS 6.2007 4.6752 2.2676 63.430%

no IO comm (×2), no IO 70.383 68.761 63.826 9.3157%

mance growth. Using 100 and 200 iterations respectively,the run time is increased by a multiple of the run time with5 iterations. Even with those setups the configuration withfour calculating CPUs (and sleeping of the remaining fourCPUs) is more power efficient than using eight CPUs forcalculating, although the total power saving decreases.

With 500 iterations the performance growth exceeds theoverhead of the parallelization and using eight CPUs forcalculating decreases the total energy consumption in com-parison to the setup with four CPUs. To conclude this testset: Based on the properties of specific algorithm and corre-sponding parallelization overhead energy can be saved whenusing less calculating units, even if the runtime increases.This is not surprising, but in general these savings can beonly reached if using power aware hardware.

5.5 Energy-proportional devices

This test set should show the capability of the tool to simu-late different hardware characteristics. On the one hand thismotivates research on more efficient devices, on the otherhand this allows to investigate the savings with differenthardware (TCO against acquisition costs).

This test set characterizes each node device as an energy-proportional device for a specific trace file. As already dis-cussed the power consumption for zero utilization is rela-tively high for each device. When this overhead can be tech-nically reduced by the hardware vendors, power efficiencycan be increased. The CPU with the optimal specificationwould consume zero watt if not utilized, the mentioned over-head would be omitted (see [3] for a case study on energyproportional computing).

Table 5 shows the estimated power consumption with theSimple Strategy, Optimal Strategy, with energy-proportionaldevices (CPUs, NIC, main memory and hard disk) and thepercentaged savings of energy-proportional devices com-pared to Simple Strategy.

For the micro benchmark trace savings of about 30% areestimated for the energy-proportional devices, the minimalsaving of the traces with checkpointing is about 23%. In

these setups the savings are mainly dependent on the hard-ware utilization, the setups with four MPI processes have agreater energy saving potential, because the other four CPUsdo not consume any energy when idle.

For the trace without checkpointing and better utilizationof the hardware (using eight MPI processes for four nodes)potential savings of only 9% have been estimated. For thetrace with many idle times (using one PVFS server andfour MPI processes on four nodes with checkpointing) theestimated saving with energy-proportional devices is about63%.

Comparing the power estimation with Optimal Strategyand the estimation with energy-proportional devices, it be-comes clear that the energy saving with energy-proportionaldevices are greater than the savings with Optimal Strategy.For these specific test cases the additional savings comparedto Optimal Strategy are between 7% and 51%. The smallerthis additional saving, the smaller is the overhead for thehardware for this specific application. This conclusion isverified in further experiments in [22].

Of course the savings (and the hardware energy over-head) for well-utilized hardware are much smaller than foridle hardware. Hence, these values are heavily dependent onthe traced program and the configuration.

In the current configuration the calculation includes apercentaged overhead of the power supply of 35%. It ispossible to reduce this overhead, for example by usinga switched-mode power supply (SMPS). When using thispower efficient supply the percentaged overhead can be re-duced to about 5% (device specific, see [1]). Hence, the re-sulting power consumption can be simply calculated for spe-cific power supplies (see [22]).

Figure 6 visualizes the energy estimations with SimpleStrategy, with Optimal Strategy, with energy-proportionaldevices and with energy-proportional devices and SMPSwith 95% efficiency.

6 Summary and conclusions

The existing work in the fields of Green IT in HPC has al-ready shown that switching the hardware to power saving


Fig. 6 Estimated ETS withenergy-proportional devices

states is a confident approach to decrease power consump-tion with small effects on time-to-solution.

The analytical model designed in this work allows usto identify an upper bound for these energy savings. Usingthe different developed look-ahead strategies this is possiblewith and without affecting the time-to-solution. The capabil-ity to show application chunks where rearrangement of loadwould increase energy efficiency can be used as a kind of adebugger for energy efficiency.

Further the simulation of energy-proportional hardwareoffers the capability to destine the total energy consumed bya specific application without the hardware overhead.

The power consumption for the hardware overhead esti-mated in this work is not surprising, but the capability tomonetarily evaluate concrete application traces shows thepotential of this approach.

References

1. Aebischer B, Huser A (2003) Energy efficiency of computerpower supplies. In: EEDAL ’03: proceedings of the 3rd interna-tional conference on energy efficiency in domestic appliances andlighting

2. Agarwal Y, Hodges S, Chandra R, Scott J, Bahl P, Gupta R (2009)Somniloquy: augmenting network interfaces to reduce PC energyusage. In: NSDI’09: proceedings of the 6th USENIX symposiumon networked systems design and implementation. USENIX As-sociation, Berkeley, pp 365–380

3. Barroso LA, Hölzle U (2007) The case for energy-proportionalcomputing. Computer 40(12):33–37. doi:10.1109/MC.2007.443

4. Bircher W, John L (2007) Complete system power estimation: atrickle-down approach based on performance events. In: ISPASS’07: proceedings of the 2007 IEEE international symposium onperformance analysis of systems and software. IEEE ComputerSociety, Los Alamitos, pp 158–168

5. Contreras G, Martonosi M (2005) Power prediction for Intel xs-cale processors using performance monitoring unit events. In:ISLPED ’05: proceedings of the 2005 international symposium onlow power electronics and design. ACM, New York, pp 221–226.doi:10.1145/1077603.1077657

6. Corporation, HP, Corporation, I, Corporation, M, Ltd, PT, Cor-poration, T (2005) Advanced configuration and power interfacespecification

7. Etinski M, Corbalan J, Labarta J, Valero M, Veidenbaum A (2009)Power-aware load balancing of large scale MPI applications. In:IPDPS ’09: proceedings of the 2009 IEEE international sympo-sium on parallel and distributed processing. IEEE Computer Soci-ety, Washington, pp 1–8. doi:10.1109/IPDPS.2009.5160973

8. Feng X, Ge R, Cameron KW (2005) Power and energy profilingof scientific applications on distributed systems. In: IPDPS ’05:proceedings of the 19th IEEE international parallel and distrib-uted processing symposium (IPDPS’05) papers. IEEE ComputerSociety, Washington, p 34. doi:10.1109/IPDPS.2005.346

9. Freeh V, Pan F, Kappiah N, Lowenthal D, Springer R (2005) Ex-ploring the energy-time tradeoff in MPI programs on a power-scalable cluster. In: IPDPS ’05: proceedings of parallel and dis-tributed processing symposium. doi:10.1109/IPDPS.2005.214

10. Freeh VW, Lowenthal DK (2005) Using multiple energy gears inMPI programs on a power-scalable cluster. In: PPoPP ’05: pro-ceedings of the tenth ACM SIGPLAN symposium on principlesand practice of parallel programming. ACM, New York, pp 164–173. doi:10.1145/1065944.1065967

11. Ge R, Feng X, Cameron KW (2005) Improvement of power-performance efficiency for high-end computing. In: IPDPS ’05:proceedings of the 19th IEEE international parallel and distrib-

http://dx.doi.org/10.1109/MC.2007.443

http://dx.doi.org/10.1145/1077603.1077657

http://dx.doi.org/10.1109/IPDPS.2009.5160973



http://dx.doi.org/10.1145/1065944.1065967


uted processing symposium. IEEE Computer Society, Washing-ton, p 233. doi:10.1109/IPDPS.2005.251

12. Ge R, Feng X, Cameron KW (2005) Performance-constraineddistributed DVS scheduling for scientific applications on power-aware clusters. In: SC ’05: proceedings of the 2005 ACM/IEEEconference on supercomputing. IEEE Computer Society, Wash-ington, p 34. doi:10.1109/SC.2005.57

13. Hotta Y, Sato M, Kimura H, Matsuoka S, Boku T, Takahashi D(2006) Profile-based optimization of power performance by usingdynamic voltage scaling on a PC cluster. In: IPDPS ’06: proceed-ings of the 20th international parallel and distributed processingsymposium. doi:10.1109/IPDPS.2006.1639597

14. Hsu Ch, Feng Wc (2005) A power-aware run-time system forhigh-performance computing. In: SC ’05: proceedings of the 2005ACM/IEEE conference on supercomputing. IEEE Computer Soci-ety, Washington, p 1. doi:10.1109/SC.2005.3

15. Huang S, Feng W (2009) Energy-efficient cluster computing viaaccurate workload characterization. In: CCGRID ’09: proceed-ings of the 2009 9th IEEE/ACM international symposium on clus-ter computing and the grid. IEEE Computer Society, Washington,pp 68–75. doi:10.1109/CCGRID.2009.88

16. Hylick A, Sohan R, Rice A, Jones B (2008) An analysis of harddrive energy consumption. In: MASCOTS 2008: IEEE interna-tional symposium on modeling, analysis and simulation of com-puters and telecommunication systems, pp 1–10. doi:10.1109/MASCOT.2008.4770567

17. Intel MTP (2006) Intel Core2 duo mobile processor for Intel cen-trino duo mobile processor technology datasheet

18. Kappiah N, Freeh VW, Lowenthal DK (2005) Just in time dynamicvoltage scaling: exploiting inter-node slack to save energy in MPIprograms. In: SC ’05: proceedings of the 2005 ACM/IEEE con-ference on supercomputing. IEEE Computer Society, Washington,p 33. doi:10.1109/SC.2005.39

19. Krempel S, Kunkel J, Ludwig T (2009) Design and implementa-tion of a profiling environment for trace based analysis of energyefficiency benchmarks in high performance computing. Master’sthesis, Institute of Computer Science, University of Heidelberg

20. Lim MY, Freeh VW, Lowenthal DK (2006) Adaptive, transpar-ent frequency and voltage scaling of communication phases inMPI programs. In: SC ’06: proceedings of the 2006 ACM/IEEEconference on supercomputing. ACM, New York, p 107. doi:10.1145/1188455.1188567

21. Lu YH, Benini L, De Micheli G (2000) Operating-system directedpower reduction. In: ISLPED ’00: proceedings of the 2000 inter-national symposium on low power electronics and design. ACM,New York, pp 37–42. doi:10.1145/344166.344189

22. Minartz T, Kunkel J, Ludwig T (2009) Model and simulation ofpower consumption and power saving potential of energy efficientcluster hardware. Master’s thesis, Institute of Computer Science,University of Heidelberg

23. Moona PR, Chole S, Harneja S (2007) Memory management us-ing dynamic memory switching. Project report, Department ofComputer Science and Engineering, Indian Institute of Technol-ogy Kanpur

24. Pinheiro E, Bianchini R, Carrera E, Heath T (2001) Load balanc-ing and unbalancing for power and performance in cluster-basedsystems. In: COLP ’01: workshop on compilers and operating sys-tems for low power

25. Rountree B, Lowenthal DK, Funk S, Freeh VW, de Supin-ski BR, Schulz M (2007) Bounding energy consumption inlarge-scale MPI programs. In: SC ’07: proceedings of the 2007ACM/IEEE conference on supercomputing. ACM, New York,pp 1–9. doi:10.1145/1362622.1362688

26. Vasudevan V, Franklin J, Andersen D, Phanishayee A, Tan L,Kaminsky M, Moraru I (2009) FAWNdamentally power-efficientclusters. In: HotOS XII: 12th workshop on hot topics in operatingsystems

Timo Minartz is a PhD studentat the University of Hamburg. Hismajor research interests are high-performance computing and energyefficiency. He received his MSc de-gree at the University of Heidelbergin 2009.

Julian M. Kunkel received his MScdegree in computer science at theUniversity of Heidelberg in 2007.Currently he is employed at the Ger-man High Performance ComputingCentre for Climate- and Earth Sys-tem Research. In his leasure time heworks on his PhD. Interests coverhigh performance file systems andmodeling of cluster systems’ perfor-mance.

Thomas Ludwig became Professorat the Ruprecht-Karls-UniversitätHeidelberg and lead the researchgroup Parallel and Distributed Sys-tems in 2001. Since 2009 he is Pro-fessor at the University of Ham-burg and CEO of the German HighPerformance Computing Centre forClimate- and Earth System Re-search. His major research interestsare high performance storage andenergy efficiency in HPC.


http://dx.doi.org/10.1109/SC.2005.57


http://dx.doi.org/10.1109/SC.2005.3

http://dx.doi.org/10.1109/CCGRID.2009.88

http://dx.doi.org/10.1109/MASCOT.2008.4770567

http://dx.doi.org/10.1109/MASCOT.2008.4770567

http://dx.doi.org/10.1109/SC.2005.39

http://dx.doi.org/10.1145/1188455.1188567

http://dx.doi.org/10.1145/1188455.1188567

http://dx.doi.org/10.1145/344166.344189

http://dx.doi.org/10.1145/1362622.1362688

Documents

Simulation of power consumption of energy ... - uni-hamburg.de · power consumption of the hardware with and without en-ergy saving mechanism in Sect. 3. With trace ﬁles containing