E3C: Exploring Energy Efficient Computing

E3C Exploring Energy Efficient Computing

Dawn Geatches Science amp Technology Facilities Council Daresbury Laboratory

Warrington WA4 4AD dawngeatchesstfcacuk

This scoping project was funded under the Environmental Sustainability Concept Fund

(ESCF) within the Business Innovation Department of STFC

This document is a first attempt to demonstrate how users of the quantum mechanics-

based software code CASTEP1 can run their simulations on high performance computing

(HPC) architectures efficiently Whatever the level of experience a user might have the

climate crisis we are facing dictates that we need to (i) become aware of the consumption of

computational resources of our simulations (ii) understand how we as users can reduce this

consumption (iii) actively develop energy efficient computing habits This document provides

some small insight to help users progress through stages (i) and (ii) empowering them to

adopt stage (iii) with confidence

This document is not a guide to setting-up and running simulations using CASTEP these

already exist (see for example CASTEP ) Assumptions are made throughout this document

that the user has a basic familiarity of the software and its terminology This document does

not exhaust all of the possible ways to reduce computational cost ndash much will be left to the

user to discover for themselves and to share with the wider CASTEP community (eg via the

JISCMAIL CASTEP Users Mailing List ) Thank you

Sections

1 Computational cost of simulations

2 Reducing the energy used by your simulation

A Cell file

B Param file

C Submission script

D An (extreme) example

3 Developing energy efficient computing habits A recipe

4 What else can a user do

5 What are the developers doing

1 Computational cost of simulations

lsquoComputational costrsquo in the context of this project is synonymous with lsquoenergy usedrsquo As a

user of high performance computing (HPC) resources have you ever wondered what effect

your simulations have on the environment through the energy they consume You might be

working on some great new renewable energy material and running hundreds or thousands

of simulations over the lifetime of the research How does the energy consumed by the

research stack against the energy that will be generatedsavedstored etc by the new

material Hopefully the stacking is gigantically in favour of the new material and its

promised benefits

Fortunately we can do more than hope that that is the case we can actively reduce the

energy consumed by our simulations indeed itrsquos the responsibility of every single

computational modeller to do exactly that Wouldnrsquot it be great (not to say impressive) if

when you write your next funding application you can give a ballpark figure as to the amount

of energy your computational research will consume over the lifetime of the project

As a user you might be thinking lsquobut what effect can I have when surely the HPC architecture

is responsible for energy usagersquo and lsquothen therersquos the code itself which should be as

efficient as possible but if itrsquos not I canrsquot do anything about thatrsquo Both of these thoughts are

grounded in truth the HPC architecture is fixed - but we can use it efficiently the software

wersquore using is structurally fixed ndash but we can run it efficiently

The energy cost (E) of a simulation is the total power per core (P) consumed over the length

of time (T ) of the simulation which for parallelised simulations run on (N) cores is 119864 = 119873119875119879

From this it is logical to think that reducing N P andor T will reduce E which is theoretically

true Practically though letrsquos assume that the power consumed by each core is a fixed

property of the HPC architecture we now have 119864 prop 119873119879 This effectively encapsulates where

we as users of HPC can control the amount of energy our simulations consume and seems

simple All we need to do is learn how to optimize the number of cores and the length of time

of our simulations

We use multiple cores to share the memory load and to speed-up a calculation giving us

three calculation properties to optimise number of cores memory per core time To reduce

the calculation time we might first increase the number of cores Many users might already

know that the relationship between core count and calculation time is non-linear thanks to

the required increase in core-to-core and node-to-node communication time Taking the

latter into account means the total energy used is 119864 = 119873119879 + 119891(119873 119879) where 119891(119873 119879) captures

the energy cost of the core-corenode-node communication time

To optimise energy efficiency any speed-up in calculation time gained by increasing the

number of cores needs to balance the increased energy cost of using additional cores

Therefore the speed-up factor needs to be more than the factor of the number of cores as

shown in the equations below for a 2-core vs serial example

119864119904 = 119879119904 (119891(119879119904) = 0) Energy of serial (ie 1-core) calculation

1198642119873 = 21198792119873 + 119891(2 1198792119873) Energy of 2-core calculation

1198642119873 le 119864119878 For the energy cost of using 2 cores to be no greater

than the energy cost of the serial calculation

21198792119873 + 119891(2 1198792119873) le 119879119878 ie 1198792119873 + 1

2119891(2 1198792119873) le

1

2119879119878 which means that the total

calculation time using 2-cores needs to be less than half of the serial time So for users to

run simulations efficiently in parallel they need to balance the number of cores and the

associated memory load per core and the total calculation time The following section shows

how some of the more commonly used parameters within CASTEP affect these three

properties

NB The main purpose of the following examples is to illustrate the impact of different

user-choices on the total energy cost of simulations These examples do not indicate

the level of lsquoaccuracyrsquo attained because lsquoaccuracyrsquo is determined by the user

according to the type contents and aims of their simulations


This section uses an example of a small model of a clay mineral (and later a carbon

nanotube) to illustrate how a user can change the total energy their simulation uses by a

judicious choice of CASTEP input parameters

Figure 1 unit cell of generic silicate clay mineral comprising 41 atoms

A Cell file

Pseudopotentials

Choose the pseudopotential according to the type of simulation eg for simulations of cell

structures ultrasofts2 are often sufficient although if the pseudopotential library does not

contain an ultrasoft version for a particular element the on-the-fly-generated (OTFG)

ultrasofts3 might suffice If a user is running a spectroscopic simulation such as infrared

using density functional perturbation theory4 then norm-conserving5 or OTFG norm-

conserving3 could be the better choices The impact of pseudopotential type on the

computational cost is shown in Table 1 through the total (calculation) time

Type of pseudopotential

Ultrasoft Norm-conserving

OTFG Ultrasoft

OTFG Ultrasoft QC5 setb

OTFG Norm-conserving

Cut-off energy (eV)

370 900 598 340 925

coresa 5 5 5 5 5

Memoryprocess (MB)

666 681 2072 1007 681

Peak memory use (MB)

777 802 2785 1590 791

Total time (secs) 55 89 250 109 136

Table 1 Pseudopotential and size of planewave set required on lsquofinersquo setting of Materials Studio 20206 and an example of memory requirements and time required for a single point energy calculation using the recorded number of cores on a single node Unless otherwise stated the same cut-off energy per type of pseudopotential is implied throughout this document aUsing Sunbird (CPU 2x Intel(R) Xeon(R) Gold 6148 CPU 240GHz with 20 cores each) unless stated otherwise all calculations were performed on this HPC cluster bDesigned to be used at the same modest (340 eV) kinetic energy cut-off across the periodic table They are ideal for moderate accuracy high throughout calculations eg ab initio random structure searching (AIRSS)

K-points

Changing the number of Brillouin zone sampling points can have a dramatic effect on

computational time as shown in Table 2 Bear in mind that increasing the number of k-points

increases the memory requirements often tempting users to increase the number of cores

further increasing overall computational cost Remember though itrsquos important to use the

number of k-points that provide the level of accuracy your simulations need


Ultrasoft


kpoints_mp_grid 2 1 1 (1) 3 2 1 (3) 4 3 2 (12) 2 1 1 (1) 3 2 1 (3) 4 3 2 (12)

Memoryprocess (MB)

652 666 1249 630 681 1287


768 777 1580 764 791 1296

Total time (secs) 32 55 222 85 136 477

Table 2 Single point energy calculations run on 5 cores using different numbers of k-points (in brackets) showing the

effects for different pseudopotentials

Vacuum space

When building a material surface it is necessary to add vacuum space to a cell (see Figure 2

for example) and this adds to the memory requirements and calculation time because the

lsquoempty spacersquo (as well as the atoms) is lsquofilledrsquo by planewaves Table 3 shows that doubling

the volume of vacuum space doubles the total calculation time (using the same number of

cores)

Vacuum space (Aring)

0 5 10 20

Memoryprocess (MB)

666 766 834 1078


777 928 1066 1372

Total time (secs) 55 102 202 406

Overall parallel efficiencya

69 66 67 61

Figure 2 Vacuum space added to create clay mineral surface (to study adsorbate-surface interactions for example ndashadsorbate not included in the above)

Table 3 Single point energy calculations using ultrasoft pseudopotentials and 3 k-points run on 5 cores showing the effects of vacuum space aCalculated automatically by CASTEP

Supercell size

The size of a system is one of the more obvious choices that affects the demands on

computational resources nevertheless it is interesting to see (from Table 4) that for the

same number of kpoints doubling the number of atoms increases the memory load per

process between 35 (41 to 82 atoms) to 72 (82 to 164 atoms) and the corresponding

calculation times increase by factors 11 and 85 respectively In good practice the number

of kpoints is scaled according to the supercell size increasing the computational cost more

modestly

Supercell size ( atoms) 1 x 1 x 1 (41)

2 x 1 x 1 (82) 2 x 2 x 1 (164)

Kpoints (mp grid)

Kpoints scaled for supercells 2x1x1 and 2x2x1

3 2 1 (3) 3 2 1 (3)

2 1 1 (1)

3 2 1 (3)

2 1 1 (1)

Memoryprocess (MB) 666 897 732 1547 1315

Peak memory use (MB) 777 1175 1025 2330 2177

Total time (secs) 55 631 329 5416 1660

Overall parallel efficiencya 69 69 74 67 72 Table 4 Single point energy calculations using ultrasoft pseudo-potentials run on 5 cores showing the effects of supercells aCalculated automatically by CASTEP

Figure 3 Example of 2 x 2 x 1 supercell

Orientation of axes

This might be one of the more surprising and unexpected properties of a model that affects

computational efficiency The effect becomes significant when a system is large

disproportionately longer along one of its lengths and is misaligned with the x- y- z-axes

see Figure 4 and Table 5 for exaggerated examples of misalignment This effect is due to

the way CASTEP transforms real-space properties between real-space and reciprocal-

space it converts the 3-d fast Fourier transforms (FFT) to three 1-d FFT columns that lie

parallel to the x- y z-axes

Figure 4 Top row A capped carbon nanotube (160 atoms) and bottom row a long carbon nanotube (1000 atoms) showing

long axes aligned in the x-direction (left) z-direction (middle) skewed (right)

Orientation ( atoms)

X (160)

Z (160)

Skewed (160)

X (1000)

Z (1000)

Skewed (1000)

Cores 5 5 5 60 60 60

Memoryprocess (MB) 884 882 882 2870 2870 2870

Peak memory use (MB) 1893 1885 1838 7077 7077 7077

Total time (secs) 392 359 409 3906 3908 5232


79 84 82 78 78 75

Relative total energy ( cores total time core-seconds)

1960 1795 2045 234360 234480 313920

Table 5 Single point energy calculations of carbon nanotubes shown as oriented in Fig 4 using ultrasoft pseudopotentials (280 eV cut-off energy) and 1 k-point aCalculated automatically by CASTEP

B Param file

Grid-scale

Although the ultrasofts require a smaller size of planewave basis set than the norm-

conserving they do need a finer electron density grid scale in the settings lsquogrid_scalersquo and

lsquofine_grid_scalersquo As shown in Table 6 the denser grid scale setting for the OTFG ultrasofts

(with the exception of the QC5 set) can almost double the calculation time over the larger

planewave hungry OTFG norm-conserving pseudopotentials that converge well under a less

dense grid


Norm-conserving

Ultrasoft OTFG Norm-conserving

OTFG Ultrasoft

OTFG Ultrasoft QC5 set

grid_scale fine_grid_scale

15175 2030 2030 15175 2030 2030 2030

Memoryprocess (MB)

792 681 666 680 731 2072 1007


803 1070 777 791 956 2785 1590

Total time (secs) 89 150 55 136 221 250 109

Table 6 Single point energy calculations run on 5 cores showing the effects of different electron density grid settings

Data Distribution

Parallelizing over plane wave vectors (lsquoG-vectorsrsquo) k-points or a mix of the two has an

impact on computational efficiency as shown in Table 7

The default for a param file without the keyword lsquodata_distributionrsquo is to prioritize k-point

distribution across a number of cores (less than or equal to the number requested in the

submission script) that is a factor of the number of k-points see for example Table 7

columns 2 and 3 Inserting lsquodata_distribution kpointrsquo into the param file prioritizes and

optimizes the k-point distribution across the number of cores requested in the script In the

example tested selecting data distribution over kpoints increased the calculation time over

the default of no data distribution compare columns 3 and 5 of Table 7

Requesting G-vector distribution has the largest impact on calculation time and combining

this with requesting a number of cores that is also a factor of the number of k-points has the

overall largest impact on reducing calculation time ndashsee columns 6 and 7 of Table 7

Requesting mixed data distribution has a similar impact on calculation time as not requesting

any data distribution for 5 cores but not for 6 cores the lsquomixedrsquo distribution used 4-way

kpoint distribution rather than the default (non-) request that applied 6-way distribution ndash

compare columns 2 and 3 with 8 and 9

For the small clay model system the optimal efficiency was obtained using G-vector data

distribution over 6 cores (852 core-seconds) and the least efficient choice was mixed data

distribution over 6 cores (1584 core-seconds) The results are system-specific and need

careful testing to tailor to different systems

Number of tasks per node

This is invoked by adding lsquonum_proc_in_smprsquo to the param file and controls the number of message parsing interface (MPI) tasks that are placed in a specifically OpenMP (SMP) group This means that the ldquoall-to-allrdquo communications is then done in three phases instead of one (1) tasks within an SMP collect their data together on a chosen ldquocontrollerrdquo task within their group (2) the ldquoall-to-allrdquo is done between the controller tasks (3) the controllers all distribute the data back to the tasks in their SMP groups For small core counts the overhead of the two extra phases makes this method slower than just doing an all-to-all for large core counts the reduction in the all-to-all time more than

compensates for the extra overhead so itrsquos faster Indeed the tests (shown in Table 8) reveal that invoking this flag fails to produce as large a speed-up as the flag lsquodata_distribution gvectorrsquo (compare columns 3 and 9) for the test HPC cluster ndash Sunbird reflecting the requested small core count Generally speaking the more cores in the G-vector group the higher you want to set ldquonum_proc_in_smprdquo (up to the physical number of cores on a node)

Column 1 2 3 4 5 6 7 8 9

Requested data distribution + cores in HPC submission script

None 5 cores

None 6 cores

Kpoints 5 cores

Kpoints 6 cores

Gvector 5 cores

Gvector 6 cores

Mixed 5 cores

Mixed 6 cores

Actual data distribution

kpoint 4-way

kpoint 6-way

kpoint 5-way

kpoint 6-way

Gvector 5-way

Gvector 6-way

kpoint 4-way

kpoint 4-way

Memoryprocess (MB)

1249 1219 1249 1219 728 698 1249 1253


1581 1561 1581 1561 839 804 1581 1585

Total time (secs) 295 199 292 226 191 142 294 264


99 96 98 96 66 71 98 96


1475 1194 1460 1356 955 852 1470 1584

Table 7 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points showing the effects of data distribution across different numbers of cores requested in the script file lsquoActual data distributionrsquo means that reported by CASTEP on completion in this and (where applicable) all following Tables lsquoRelative total energyrsquo assumes that each core requested by the script consumes X amount of electricity aCalculated automatically by CASTEP

num_proc_in_smp Default 2 4 5

Requested data_distribution None Gvector None Gvector None Gvector None Gvector

Actual data distribution kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way

Memoryprocess (MB) 1249 728 1249 728 1249 728 1249 728

Peak memory use (MB) 1580 837 1581 839 1581 844 1581 846

Total time (secs) 222 156 231 171 230 182 237 183

Overall parallel efficiencya 96 66 98 60 98 56 96 56

Column 1 2 3 4 5 6 7 8 9

Table 8 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of setting lsquonum_proc_in_smp 2 4 5rsquo both with and without the lsquodata_distribution gvectorrsquo flag lsquoDefaultrsquo means lsquonum_proc_in_smprsquo absent from param file aCalculated automatically by CASTEP

Optimization strategy

This parameter has three settings and is invoked through the lsquoopt_strategyrsquo flag in the

param file

Default - Balances speed and memory use Wavefunction coefficients for all k-points

in a calculation will be kept in memory rather than be paged to disk Some large

work arrays will be paged to disk

Memory - Minimizes memory use All wavefunctions and large work arrays are paged

to disk

Speed - Maximizes speed by not paging to disk

This means that if a user runs a large memory calculation optimizing for memory could

obviate the need to request additional cores although the calculation will take longer - see

Table 9 for comparisons

opt_strategy Default Memory Speed

Memoryprocess (MB) 793 750 1249


1566 1092 1581

Total time (secs) 232 290 221


94 97 96

Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of optimizing for speed or memory lsquoDefaultrsquo means either omitting the lsquoopt_strategyrsquo flag from the param file or adding it as lsquoopt_strategy defaultrsquo aCalculated automatically by CASTEP

Spin polarization

If a system comprises an odd number of electrons it might be important to differentiate

between the spin-up and spin-down states of the odd electron This directly affects the

calculation time effectively doubling it as shown in Table 10

param flag and setting

spin_polarization

false true

Memoryprocess (MB)

1249 1415


1581 1710

Total time (secs) 222 455


96 98

Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of spin polarization aCalculated automatically by CASTEP

Electronic energy minimizer

Insulating systems often behave well during the self-consistent field (SCF) minimizations and

converge smoothly using density mixing (lsquoDMrsquo) When SCF convergence is problematic and

all attempts to tweak DM-related parameters have failed it is necessary to turn to ensemble

density functional theory7 and accept the consequent (and considerable) increase in

computational cost ndashsee Table 11


metals_method (Electron minimization) DM EDFT

Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97

Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory

aCalculated automatically by CASTEP

C Script submission file

Figure 5 An example HPC batch submission script

Figure 5 captures the script variables that affect HPC computational energy and usage

efficiency

(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)

requested for the simulation Unless the calculation is memory hungry configure

the requested number of cores to sit on the fewest nodes because this reduces

expensive node-to-node communication time

(ii) Choosing the shortest job run time gives the calculation a better chance of

progressing through the job queue swiftly

(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo

flag to accelerate progress through the job queue

(iv) Using the most recent version of software captures the latest upgrades and bug-

fixes that might otherwise slow down a calculation

(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory

requirements In one example the estimate of peak memory use was frac14 of that

actually used during the simulation proper


Clay mineral (Figure 2) Careful optimisation for energy efficiency

Careless ndash no optimisation for energy efficiency

Vacuum space 10Aring Vacuum space 10 Aring

Pseudopotential and cut-off energy (eV)

Ultrasoft 370 OTFG-Ultrasoft 599

K-points 3 12

Grid-scale fine-grid-scale 2 3 3 4

num_proc_in_smprequested data distribution

default Gvector 20 none


5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp

Optimization strategy Speed Default

Spin polarization False True

Electronic energy minimizer Density mixing EDFT

Number of cores requested 5 40

RESULTS

Memoryprocess (MB) Scratch disk (MB)

834 0 1461 6518

Peak memory use (MB) 1066 9107

Total time (seconds) 215 45302

Overall parallel efficiencya 69 96

Relative total energy ( cores total time core-seconds core-hours)

1075 030

1 812080 50336

kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP

Table 12 illustrates the combined effects of many of the model properties and parameters

discussed in the previous section on the total time and overall use of computational

resources Itrsquos unlikely a user would choose the whole combination of model properties and

parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a

user can have on the energy consumption of their simulations For comparison the cheapest

electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means

that the carelessly run simulation used the equivalent energy of driving this car about 86 km

whereas the efficiently run simulation lsquodroversquo it 033 km

For computational scientists and modellers applying good energy efficiency practices needs

to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to

embedding this practice as a habit


1) Build a model of a system that contains only the essential ingredients that

allows exploration of the scientific question This is one of the key factors that

determines the size of a model

2) Find out how many cores per node there are on the available HPC cluster This

enables users to request the number of corestasks that minimizes inter-node

communication during a simulation

3) Choose the pseudopotentials to match the science This ensures users donrsquot use

pseudopotentials that are unnecessarily computationally expensive

4) Carry out extensive convergence testing based on the minimum accuracy

required for the production run results eg

(i) Kinetic energy cut-off (depends on pseudopotential choice)

(ii) Grid scale and fine grid scale (depends on pseudopotential choice)

(iii) Size and orientation of model including eg number of bulk atoms

number of layers size of surface vacuum space etc

(iv) Number of k-points

These decrease the possibility of over-convergence and its associated

computational cost

5) Spend time optimising the param file properties described in Section B using a

small number of SCF cycles

a Data distribution Gvector k-points or mixed

b Number of tasks per node

c Optimization strategy

d Spin polarization

e Electronic energy (SCF) minimization method

This increases the chances of using resources efficiently due to matching the

model and material requirements to the simulation parameters

6) Optimise the script file This increases the efficient use of HPC resources

7) Submit the calculation and initially monitor it to check itrsquos progressing as

expected This reduces the chances of wasting computational time due to trivial

(lsquoFriday afternoonrsquo) mistakes

8) Carry out your own energy efficient computing tests (and send your findings to

the JISCMAIL CASTEP mailing list)

9) Sit back and wait for the simulation to complete basking in the knowledge that

the simulation is running as energy efficiently1 as a user can possibly make it


In addition to using the above recipe to embed energy-efficient computing habits a user

can take a number of actions to encourage the wider awareness and adoption of energy

efficient computing

a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the

amount of energy consumed2 (in Joules) by a job -see Figure 6

b If your local cluster uses a different job-scheduler ask your local IT helpdesk if

it has the facility to monitor the energy consumed by each HPC job

c Include the energy consumption of simulations in all forms of reports and

presentations eg informal talks posters peer reviewed journal articles social

media posts etc This will increase awareness of our role as environmentally

aware and conscientious computational scientists and users of HPC resources

1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml

Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command

d Include estimates of the energy consumption of simulations in applications for

funding Although not yet explicitly requested in EPSRC funding applications

there is the expectation that UKRIrsquos 2020 commitment to Environmental

Sustainability will filter down to all activities of its research councils including

funding This will mean that funding applicants will need to demonstrate their

awareness of the environmental impact of their proposed work Become an

impressive pioneer and include environmental impact through energy

consumption in your next application


The compilation of this document included a chat with several of the developers of CASTEP

who are keen to help users run their software energy efficiently they shared their plans and

projects in this field

Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold

speed-up (for non-local functionals)

Work on a CASTEP simulator is underway that should reduce the number of

CASTEP calculations required per simulation by choosing an optimal parallel domain

decomposition and implementing timings for FFTs ndash the big parallel cost also it will

estimate compute usage This simulator will go a long way to providing the structure

needed to add energy efficiency to CASTEP and will be accessible through the

rsquo- -dryrunrsquo command The toy code is available in bitbucket

The developers recognise the need for energy consumption to be acknowledged as

an additional factor to be included in the cost of computational simulations They will

be planning their approach beyond the software itself such as including energy

efficient computing in their training courses

Acknowledgements

I acknowledge the support of the Supercomputing Wales project which is part-funded by the

European Regional Development Fund (ERDF) via Welsh Government

Thank you to the following CASTEP developers for their invaluable input and support for this

small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard

(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of

Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing

Wales) for sharing their research engineering perspectives

References

(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570

(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895

(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for

Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys

Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego

2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab

Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340

As a user you might be thinking lsquobut what effect can I have when surely the HPC architecture

is responsible for energy usagersquo and lsquothen therersquos the code itself which should be as

efficient as possible but if itrsquos not I canrsquot do anything about thatrsquo Both of these thoughts are

grounded in truth the HPC architecture is fixed - but we can use it efficiently the software

wersquore using is structurally fixed ndash but we can run it efficiently

The energy cost (E) of a simulation is the total power per core (P) consumed over the length

of time (T ) of the simulation which for parallelised simulations run on (N) cores is 119864 = 119873119875119879

From this it is logical to think that reducing N P andor T will reduce E which is theoretically

true Practically though letrsquos assume that the power consumed by each core is a fixed

property of the HPC architecture we now have 119864 prop 119873119879 This effectively encapsulates where

we as users of HPC can control the amount of energy our simulations consume and seems

simple All we need to do is learn how to optimize the number of cores and the length of time

of our simulations

We use multiple cores to share the memory load and to speed-up a calculation giving us

three calculation properties to optimise number of cores memory per core time To reduce

the calculation time we might first increase the number of cores Many users might already

know that the relationship between core count and calculation time is non-linear thanks to

the required increase in core-to-core and node-to-node communication time Taking the

latter into account means the total energy used is 119864 = 119873119879 + 119891(119873 119879) where 119891(119873 119879) captures

the energy cost of the core-corenode-node communication time

To optimise energy efficiency any speed-up in calculation time gained by increasing the

number of cores needs to balance the increased energy cost of using additional cores

Therefore the speed-up factor needs to be more than the factor of the number of cores as

shown in the equations below for a 2-core vs serial example

119864119904 = 119879119904 (119891(119879119904) = 0) Energy of serial (ie 1-core) calculation

1198642119873 = 21198792119873 + 119891(2 1198792119873) Energy of 2-core calculation

1198642119873 le 119864119878 For the energy cost of using 2 cores to be no greater

than the energy cost of the serial calculation

21198792119873 + 119891(2 1198792119873) le 119879119878 ie 1198792119873 + 1

2119891(2 1198792119873) le

1

2119879119878 which means that the total

calculation time using 2-cores needs to be less than half of the serial time So for users to

run simulations efficiently in parallel they need to balance the number of cores and the

associated memory load per core and the total calculation time The following section shows

how some of the more commonly used parameters within CASTEP affect these three

properties

NB The main purpose of the following examples is to illustrate the impact of different

user-choices on the total energy cost of simulations These examples do not indicate

the level of lsquoaccuracyrsquo attained because lsquoaccuracyrsquo is determined by the user

according to the type contents and aims of their simulations


This section uses an example of a small model of a clay mineral (and later a carbon

nanotube) to illustrate how a user can change the total energy their simulation uses by a

judicious choice of CASTEP input parameters


A Cell file

Pseudopotentials










OTFG Ultrasoft



Cut-off energy (eV)

370 900 598 340 925

coresa 5 5 5 5 5

Memoryprocess (MB)

666 681 2072 1007 681


777 802 2785 1590 791

Total time (secs) 55 89 250 109 136


K-points







Ultrasoft


kpoints_mp_grid 2 1 1 (1) 3 2 1 (3) 4 3 2 (12) 2 1 1 (1) 3 2 1 (3) 4 3 2 (12)

Memoryprocess (MB)

652 666 1249 630 681 1287


768 777 1580 764 791 1296

Total time (secs) 32 55 222 85 136 477



Vacuum space





cores)


0 5 10 20

Memoryprocess (MB)

666 766 834 1078


777 928 1066 1372



69 66 67 61



Supercell size







modestly


2 x 1 x 1 (82) 2 x 2 x 1 (164)

Kpoints (mp grid)


3 2 1 (3) 3 2 1 (3)

2 1 1 (1)

3 2 1 (3)

2 1 1 (1)



Total time (secs) 55 631 329 5416 1660



Orientation of axes











X (160)

Z (160)

Skewed (160)

X (1000)

Z (1000)

Skewed (1000)

Cores 5 5 5 60 60 60



Total time (secs) 392 359 409 3906 3908 5232


79 84 82 78 78 75


1960 1795 2045 234360 234480 313920


B Param file

Grid-scale






dense grid


Norm-conserving


OTFG Ultrasoft



15175 2030 2030 15175 2030 2030 2030

Memoryprocess (MB)

792 681 666 680 731 2072 1007


803 1070 777 791 956 2785 1590

Total time (secs) 89 150 55 136 221 250 109


Data Distribution
























Column 1 2 3 4 5 6 7 8 9


None 5 cores

None 6 cores

Kpoints 5 cores

Kpoints 6 cores

Gvector 5 cores

Gvector 6 cores

Mixed 5 cores

Mixed 6 cores


kpoint 4-way

kpoint 6-way

kpoint 5-way

kpoint 6-way

Gvector 5-way

Gvector 6-way

kpoint 4-way

kpoint 4-way

Memoryprocess (MB)

1249 1219 1249 1219 728 698 1249 1253


1581 1561 1581 1561 839 804 1581 1585

Total time (secs) 295 199 292 226 191 142 294 264


99 96 98 96 66 71 98 96


1475 1194 1460 1356 955 852 1470 1584





Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way



Total time (secs) 222 156 231 171 230 182 237 183


Column 1 2 3 4 5 6 7 8 9




param file





to disk








1566 1092 1581



94 97 96


Spin polarization





spin_polarization

false true

Memoryprocess (MB)

1249 1415


1581 1710



96 98
















efficiency




















K-points 3 12










RESULTS


834 0 1461 6518





1075 030

1 812080 50336






























computational cost






d Spin polarization















efficient computing



































Acknowledgements








References









A Cell file

Pseudopotentials










OTFG Ultrasoft



Cut-off energy (eV)

370 900 598 340 925

coresa 5 5 5 5 5

Memoryprocess (MB)

666 681 2072 1007 681


777 802 2785 1590 791

Total time (secs) 55 89 250 109 136


K-points







Ultrasoft


kpoints_mp_grid 2 1 1 (1) 3 2 1 (3) 4 3 2 (12) 2 1 1 (1) 3 2 1 (3) 4 3 2 (12)

Memoryprocess (MB)

652 666 1249 630 681 1287


768 777 1580 764 791 1296

Total time (secs) 32 55 222 85 136 477



Vacuum space





cores)


0 5 10 20

Memoryprocess (MB)

666 766 834 1078


777 928 1066 1372



69 66 67 61



Supercell size







modestly


2 x 1 x 1 (82) 2 x 2 x 1 (164)

Kpoints (mp grid)


3 2 1 (3) 3 2 1 (3)

2 1 1 (1)

3 2 1 (3)

2 1 1 (1)



Total time (secs) 55 631 329 5416 1660



Orientation of axes











X (160)

Z (160)

Skewed (160)

X (1000)

Z (1000)

Skewed (1000)

Cores 5 5 5 60 60 60



Total time (secs) 392 359 409 3906 3908 5232


79 84 82 78 78 75


1960 1795 2045 234360 234480 313920


B Param file

Grid-scale






dense grid


Norm-conserving


OTFG Ultrasoft



15175 2030 2030 15175 2030 2030 2030

Memoryprocess (MB)

792 681 666 680 731 2072 1007


803 1070 777 791 956 2785 1590

Total time (secs) 89 150 55 136 221 250 109


Data Distribution
























Column 1 2 3 4 5 6 7 8 9


None 5 cores

None 6 cores

Kpoints 5 cores

Kpoints 6 cores

Gvector 5 cores

Gvector 6 cores

Mixed 5 cores

Mixed 6 cores


kpoint 4-way

kpoint 6-way

kpoint 5-way

kpoint 6-way

Gvector 5-way

Gvector 6-way

kpoint 4-way

kpoint 4-way

Memoryprocess (MB)

1249 1219 1249 1219 728 698 1249 1253


1581 1561 1581 1561 839 804 1581 1585

Total time (secs) 295 199 292 226 191 142 294 264


99 96 98 96 66 71 98 96


1475 1194 1460 1356 955 852 1470 1584





Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way



Total time (secs) 222 156 231 171 230 182 237 183


Column 1 2 3 4 5 6 7 8 9




param file





to disk








1566 1092 1581



94 97 96


Spin polarization





spin_polarization

false true

Memoryprocess (MB)

1249 1415


1581 1710



96 98
















efficiency




















K-points 3 12










RESULTS


834 0 1461 6518





1075 030

1 812080 50336






























computational cost






d Spin polarization















efficient computing



































Acknowledgements








References








K-points







Ultrasoft


kpoints_mp_grid 2 1 1 (1) 3 2 1 (3) 4 3 2 (12) 2 1 1 (1) 3 2 1 (3) 4 3 2 (12)

Memoryprocess (MB)

652 666 1249 630 681 1287


768 777 1580 764 791 1296

Total time (secs) 32 55 222 85 136 477



Vacuum space





cores)


0 5 10 20

Memoryprocess (MB)

666 766 834 1078


777 928 1066 1372



69 66 67 61



Supercell size







modestly


2 x 1 x 1 (82) 2 x 2 x 1 (164)

Kpoints (mp grid)


3 2 1 (3) 3 2 1 (3)

2 1 1 (1)

3 2 1 (3)

2 1 1 (1)



Total time (secs) 55 631 329 5416 1660



Orientation of axes











X (160)

Z (160)

Skewed (160)

X (1000)

Z (1000)

Skewed (1000)

Cores 5 5 5 60 60 60



Total time (secs) 392 359 409 3906 3908 5232


79 84 82 78 78 75


1960 1795 2045 234360 234480 313920


B Param file

Grid-scale






dense grid


Norm-conserving


OTFG Ultrasoft



15175 2030 2030 15175 2030 2030 2030

Memoryprocess (MB)

792 681 666 680 731 2072 1007


803 1070 777 791 956 2785 1590

Total time (secs) 89 150 55 136 221 250 109


Data Distribution
























Column 1 2 3 4 5 6 7 8 9


None 5 cores

None 6 cores

Kpoints 5 cores

Kpoints 6 cores

Gvector 5 cores

Gvector 6 cores

Mixed 5 cores

Mixed 6 cores


kpoint 4-way

kpoint 6-way

kpoint 5-way

kpoint 6-way

Gvector 5-way

Gvector 6-way

kpoint 4-way

kpoint 4-way

Memoryprocess (MB)

1249 1219 1249 1219 728 698 1249 1253


1581 1561 1581 1561 839 804 1581 1585

Total time (secs) 295 199 292 226 191 142 294 264


99 96 98 96 66 71 98 96


1475 1194 1460 1356 955 852 1470 1584





Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way



Total time (secs) 222 156 231 171 230 182 237 183


Column 1 2 3 4 5 6 7 8 9




param file





to disk








1566 1092 1581



94 97 96


Spin polarization





spin_polarization

false true

Memoryprocess (MB)

1249 1415


1581 1710



96 98
















efficiency




















K-points 3 12










RESULTS


834 0 1461 6518





1075 030

1 812080 50336






























computational cost






d Spin polarization















efficient computing



































Acknowledgements








References








Supercell size







modestly


2 x 1 x 1 (82) 2 x 2 x 1 (164)

Kpoints (mp grid)


3 2 1 (3) 3 2 1 (3)

2 1 1 (1)

3 2 1 (3)

2 1 1 (1)



Total time (secs) 55 631 329 5416 1660



Orientation of axes











X (160)

Z (160)

Skewed (160)

X (1000)

Z (1000)

Skewed (1000)

Cores 5 5 5 60 60 60



Total time (secs) 392 359 409 3906 3908 5232


79 84 82 78 78 75


1960 1795 2045 234360 234480 313920


B Param file

Grid-scale






dense grid


Norm-conserving


OTFG Ultrasoft



15175 2030 2030 15175 2030 2030 2030

Memoryprocess (MB)

792 681 666 680 731 2072 1007


803 1070 777 791 956 2785 1590

Total time (secs) 89 150 55 136 221 250 109


Data Distribution
























Column 1 2 3 4 5 6 7 8 9


None 5 cores

None 6 cores

Kpoints 5 cores

Kpoints 6 cores

Gvector 5 cores

Gvector 6 cores

Mixed 5 cores

Mixed 6 cores


kpoint 4-way

kpoint 6-way

kpoint 5-way

kpoint 6-way

Gvector 5-way

Gvector 6-way

kpoint 4-way

kpoint 4-way

Memoryprocess (MB)

1249 1219 1249 1219 728 698 1249 1253


1581 1561 1581 1561 839 804 1581 1585

Total time (secs) 295 199 292 226 191 142 294 264


99 96 98 96 66 71 98 96


1475 1194 1460 1356 955 852 1470 1584





Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way



Total time (secs) 222 156 231 171 230 182 237 183


Column 1 2 3 4 5 6 7 8 9




param file





to disk








1566 1092 1581



94 97 96


Spin polarization





spin_polarization

false true

Memoryprocess (MB)

1249 1415


1581 1710



96 98
















efficiency




















K-points 3 12










RESULTS


834 0 1461 6518





1075 030

1 812080 50336






























computational cost






d Spin polarization















efficient computing



































Acknowledgements








References











X (160)

Z (160)

Skewed (160)

X (1000)

Z (1000)

Skewed (1000)

Cores 5 5 5 60 60 60



Total time (secs) 392 359 409 3906 3908 5232


79 84 82 78 78 75


1960 1795 2045 234360 234480 313920


B Param file

Grid-scale






dense grid


Norm-conserving


OTFG Ultrasoft



15175 2030 2030 15175 2030 2030 2030

Memoryprocess (MB)

792 681 666 680 731 2072 1007


803 1070 777 791 956 2785 1590

Total time (secs) 89 150 55 136 221 250 109


Data Distribution
























Column 1 2 3 4 5 6 7 8 9


None 5 cores

None 6 cores

Kpoints 5 cores

Kpoints 6 cores

Gvector 5 cores

Gvector 6 cores

Mixed 5 cores

Mixed 6 cores


kpoint 4-way

kpoint 6-way

kpoint 5-way

kpoint 6-way

Gvector 5-way

Gvector 6-way

kpoint 4-way

kpoint 4-way

Memoryprocess (MB)

1249 1219 1249 1219 728 698 1249 1253


1581 1561 1581 1561 839 804 1581 1585

Total time (secs) 295 199 292 226 191 142 294 264


99 96 98 96 66 71 98 96


1475 1194 1460 1356 955 852 1470 1584





Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way



Total time (secs) 222 156 231 171 230 182 237 183


Column 1 2 3 4 5 6 7 8 9




param file





to disk








1566 1092 1581



94 97 96


Spin polarization





spin_polarization

false true

Memoryprocess (MB)

1249 1415


1581 1710



96 98
















efficiency




















K-points 3 12










RESULTS


834 0 1461 6518





1075 030

1 812080 50336






























computational cost






d Spin polarization















efficient computing



































Acknowledgements








References









Norm-conserving


OTFG Ultrasoft



15175 2030 2030 15175 2030 2030 2030

Memoryprocess (MB)

792 681 666 680 731 2072 1007


803 1070 777 791 956 2785 1590

Total time (secs) 89 150 55 136 221 250 109


Data Distribution
























Column 1 2 3 4 5 6 7 8 9


None 5 cores

None 6 cores

Kpoints 5 cores

Kpoints 6 cores

Gvector 5 cores

Gvector 6 cores

Mixed 5 cores

Mixed 6 cores


kpoint 4-way

kpoint 6-way

kpoint 5-way

kpoint 6-way

Gvector 5-way

Gvector 6-way

kpoint 4-way

kpoint 4-way

Memoryprocess (MB)

1249 1219 1249 1219 728 698 1249 1253


1581 1561 1581 1561 839 804 1581 1585

Total time (secs) 295 199 292 226 191 142 294 264


99 96 98 96 66 71 98 96


1475 1194 1460 1356 955 852 1470 1584





Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way



Total time (secs) 222 156 231 171 230 182 237 183


Column 1 2 3 4 5 6 7 8 9




param file





to disk








1566 1092 1581



94 97 96


Spin polarization





spin_polarization

false true

Memoryprocess (MB)

1249 1415


1581 1710



96 98
















efficiency




















K-points 3 12










RESULTS


834 0 1461 6518





1075 030

1 812080 50336






























computational cost






d Spin polarization















efficient computing



































Acknowledgements








References









Column 1 2 3 4 5 6 7 8 9


None 5 cores

None 6 cores

Kpoints 5 cores

Kpoints 6 cores

Gvector 5 cores

Gvector 6 cores

Mixed 5 cores

Mixed 6 cores


kpoint 4-way

kpoint 6-way

kpoint 5-way

kpoint 6-way

Gvector 5-way

Gvector 6-way

kpoint 4-way

kpoint 4-way

Memoryprocess (MB)

1249 1219 1249 1219 728 698 1249 1253


1581 1561 1581 1561 839 804 1581 1585

Total time (secs) 295 199 292 226 191 142 294 264


99 96 98 96 66 71 98 96


1475 1194 1460 1356 955 852 1470 1584





Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way

kpoint 4-way

Gvector 5-way



Total time (secs) 222 156 231 171 230 182 237 183


Column 1 2 3 4 5 6 7 8 9




param file





to disk








1566 1092 1581



94 97 96


Spin polarization





spin_polarization

false true

Memoryprocess (MB)

1249 1415


1581 1710



96 98
















efficiency




















K-points 3 12










RESULTS


834 0 1461 6518





1075 030

1 812080 50336






























computational cost






d Spin polarization















efficient computing



































Acknowledgements








References










param file





to disk








1566 1092 1581



94 97 96


Spin polarization





spin_polarization

false true

Memoryprocess (MB)

1249 1415


1581 1710



96 98
















efficiency




















K-points 3 12










RESULTS


834 0 1461 6518





1075 030

1 812080 50336






























computational cost






d Spin polarization















efficient computing



































Acknowledgements








References
















efficiency




















K-points 3 12










RESULTS


834 0 1461 6518





1075 030

1 812080 50336






























computational cost






d Spin polarization















efficient computing



































Acknowledgements








References














K-points 3 12










RESULTS


834 0 1461 6518





1075 030

1 812080 50336






























computational cost






d Spin polarization















efficient computing



































Acknowledgements








References


















computational cost






d Spin polarization















efficient computing



































Acknowledgements








References

































Acknowledgements








References








Acknowledgements








References








Documents

E3C: Exploring Energy Efficient Computing