Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
E3C Exploring Energy Efficient Computing
Dawn Geatches Science amp Technology Facilities Council Daresbury Laboratory
Warrington WA4 4AD dawngeatchesstfcacuk
This scoping project was funded under the Environmental Sustainability Concept Fund
(ESCF) within the Business Innovation Department of STFC
This document is a first attempt to demonstrate how users of the quantum mechanics-
based software code CASTEP1 can run their simulations on high performance computing
(HPC) architectures efficiently Whatever the level of experience a user might have the
climate crisis we are facing dictates that we need to (i) become aware of the consumption of
computational resources of our simulations (ii) understand how we as users can reduce this
consumption (iii) actively develop energy efficient computing habits This document provides
some small insight to help users progress through stages (i) and (ii) empowering them to
adopt stage (iii) with confidence
This document is not a guide to setting-up and running simulations using CASTEP these
already exist (see for example CASTEP ) Assumptions are made throughout this document
that the user has a basic familiarity of the software and its terminology This document does
not exhaust all of the possible ways to reduce computational cost ndash much will be left to the
user to discover for themselves and to share with the wider CASTEP community (eg via the
JISCMAIL CASTEP Users Mailing List ) Thank you
Sections
1 Computational cost of simulations
2 Reducing the energy used by your simulation
A Cell file
B Param file
C Submission script
D An (extreme) example
3 Developing energy efficient computing habits A recipe
4 What else can a user do
5 What are the developers doing
1 Computational cost of simulations
lsquoComputational costrsquo in the context of this project is synonymous with lsquoenergy usedrsquo As a
user of high performance computing (HPC) resources have you ever wondered what effect
your simulations have on the environment through the energy they consume You might be
working on some great new renewable energy material and running hundreds or thousands
of simulations over the lifetime of the research How does the energy consumed by the
research stack against the energy that will be generatedsavedstored etc by the new
material Hopefully the stacking is gigantically in favour of the new material and its
promised benefits
Fortunately we can do more than hope that that is the case we can actively reduce the
energy consumed by our simulations indeed itrsquos the responsibility of every single
computational modeller to do exactly that Wouldnrsquot it be great (not to say impressive) if
when you write your next funding application you can give a ballpark figure as to the amount
of energy your computational research will consume over the lifetime of the project
As a user you might be thinking lsquobut what effect can I have when surely the HPC architecture
is responsible for energy usagersquo and lsquothen therersquos the code itself which should be as
efficient as possible but if itrsquos not I canrsquot do anything about thatrsquo Both of these thoughts are
grounded in truth the HPC architecture is fixed - but we can use it efficiently the software
wersquore using is structurally fixed ndash but we can run it efficiently
The energy cost (E) of a simulation is the total power per core (P) consumed over the length
of time (T ) of the simulation which for parallelised simulations run on (N) cores is 119864 = 119873119875119879
From this it is logical to think that reducing N P andor T will reduce E which is theoretically
true Practically though letrsquos assume that the power consumed by each core is a fixed
property of the HPC architecture we now have 119864 prop 119873119879 This effectively encapsulates where
we as users of HPC can control the amount of energy our simulations consume and seems
simple All we need to do is learn how to optimize the number of cores and the length of time
of our simulations
We use multiple cores to share the memory load and to speed-up a calculation giving us
three calculation properties to optimise number of cores memory per core time To reduce
the calculation time we might first increase the number of cores Many users might already
know that the relationship between core count and calculation time is non-linear thanks to
the required increase in core-to-core and node-to-node communication time Taking the
latter into account means the total energy used is 119864 = 119873119879 + 119891(119873 119879) where 119891(119873 119879) captures
the energy cost of the core-corenode-node communication time
To optimise energy efficiency any speed-up in calculation time gained by increasing the
number of cores needs to balance the increased energy cost of using additional cores
Therefore the speed-up factor needs to be more than the factor of the number of cores as
shown in the equations below for a 2-core vs serial example
119864119904 = 119879119904 (119891(119879119904) = 0) Energy of serial (ie 1-core) calculation
1198642119873 = 21198792119873 + 119891(2 1198792119873) Energy of 2-core calculation
1198642119873 le 119864119878 For the energy cost of using 2 cores to be no greater
than the energy cost of the serial calculation
21198792119873 + 119891(2 1198792119873) le 119879119878 ie 1198792119873 + 1
2119891(2 1198792119873) le
1
2119879119878 which means that the total
calculation time using 2-cores needs to be less than half of the serial time So for users to
run simulations efficiently in parallel they need to balance the number of cores and the
associated memory load per core and the total calculation time The following section shows
how some of the more commonly used parameters within CASTEP affect these three
properties
NB The main purpose of the following examples is to illustrate the impact of different
user-choices on the total energy cost of simulations These examples do not indicate
the level of lsquoaccuracyrsquo attained because lsquoaccuracyrsquo is determined by the user
according to the type contents and aims of their simulations
2 Reducing the energy used by your simulation
This section uses an example of a small model of a clay mineral (and later a carbon
nanotube) to illustrate how a user can change the total energy their simulation uses by a
judicious choice of CASTEP input parameters
Figure 1 unit cell of generic silicate clay mineral comprising 41 atoms
A Cell file
Pseudopotentials
Choose the pseudopotential according to the type of simulation eg for simulations of cell
structures ultrasofts2 are often sufficient although if the pseudopotential library does not
contain an ultrasoft version for a particular element the on-the-fly-generated (OTFG)
ultrasofts3 might suffice If a user is running a spectroscopic simulation such as infrared
using density functional perturbation theory4 then norm-conserving5 or OTFG norm-
conserving3 could be the better choices The impact of pseudopotential type on the
computational cost is shown in Table 1 through the total (calculation) time
Type of pseudopotential
Ultrasoft Norm-conserving
OTFG Ultrasoft
OTFG Ultrasoft QC5 setb
OTFG Norm-conserving
Cut-off energy (eV)
370 900 598 340 925
coresa 5 5 5 5 5
Memoryprocess (MB)
666 681 2072 1007 681
Peak memory use (MB)
777 802 2785 1590 791
Total time (secs) 55 89 250 109 136
Table 1 Pseudopotential and size of planewave set required on lsquofinersquo setting of Materials Studio 20206 and an example of memory requirements and time required for a single point energy calculation using the recorded number of cores on a single node Unless otherwise stated the same cut-off energy per type of pseudopotential is implied throughout this document aUsing Sunbird (CPU 2x Intel(R) Xeon(R) Gold 6148 CPU 240GHz with 20 cores each) unless stated otherwise all calculations were performed on this HPC cluster bDesigned to be used at the same modest (340 eV) kinetic energy cut-off across the periodic table They are ideal for moderate accuracy high throughout calculations eg ab initio random structure searching (AIRSS)
K-points
Changing the number of Brillouin zone sampling points can have a dramatic effect on
computational time as shown in Table 2 Bear in mind that increasing the number of k-points
increases the memory requirements often tempting users to increase the number of cores
further increasing overall computational cost Remember though itrsquos important to use the
number of k-points that provide the level of accuracy your simulations need
Type of pseudopotential
Ultrasoft
OTFG Norm-conserving
kpoints_mp_grid 2 1 1 (1) 3 2 1 (3) 4 3 2 (12) 2 1 1 (1) 3 2 1 (3) 4 3 2 (12)
Memoryprocess (MB)
652 666 1249 630 681 1287
Peak memory use (MB)
768 777 1580 764 791 1296
Total time (secs) 32 55 222 85 136 477
Table 2 Single point energy calculations run on 5 cores using different numbers of k-points (in brackets) showing the
effects for different pseudopotentials
Vacuum space
When building a material surface it is necessary to add vacuum space to a cell (see Figure 2
for example) and this adds to the memory requirements and calculation time because the
lsquoempty spacersquo (as well as the atoms) is lsquofilledrsquo by planewaves Table 3 shows that doubling
the volume of vacuum space doubles the total calculation time (using the same number of
cores)
Vacuum space (Aring)
0 5 10 20
Memoryprocess (MB)
666 766 834 1078
Peak memory use (MB)
777 928 1066 1372
Total time (secs) 55 102 202 406
Overall parallel efficiencya
69 66 67 61
Figure 2 Vacuum space added to create clay mineral surface (to study adsorbate-surface interactions for example ndashadsorbate not included in the above)
Table 3 Single point energy calculations using ultrasoft pseudopotentials and 3 k-points run on 5 cores showing the effects of vacuum space aCalculated automatically by CASTEP
Supercell size
The size of a system is one of the more obvious choices that affects the demands on
computational resources nevertheless it is interesting to see (from Table 4) that for the
same number of kpoints doubling the number of atoms increases the memory load per
process between 35 (41 to 82 atoms) to 72 (82 to 164 atoms) and the corresponding
calculation times increase by factors 11 and 85 respectively In good practice the number
of kpoints is scaled according to the supercell size increasing the computational cost more
modestly
Supercell size ( atoms) 1 x 1 x 1 (41)
2 x 1 x 1 (82) 2 x 2 x 1 (164)
Kpoints (mp grid)
Kpoints scaled for supercells 2x1x1 and 2x2x1
3 2 1 (3) 3 2 1 (3)
2 1 1 (1)
3 2 1 (3)
2 1 1 (1)
Memoryprocess (MB) 666 897 732 1547 1315
Peak memory use (MB) 777 1175 1025 2330 2177
Total time (secs) 55 631 329 5416 1660
Overall parallel efficiencya 69 69 74 67 72 Table 4 Single point energy calculations using ultrasoft pseudo-potentials run on 5 cores showing the effects of supercells aCalculated automatically by CASTEP
Figure 3 Example of 2 x 2 x 1 supercell
Orientation of axes
This might be one of the more surprising and unexpected properties of a model that affects
computational efficiency The effect becomes significant when a system is large
disproportionately longer along one of its lengths and is misaligned with the x- y- z-axes
see Figure 4 and Table 5 for exaggerated examples of misalignment This effect is due to
the way CASTEP transforms real-space properties between real-space and reciprocal-
space it converts the 3-d fast Fourier transforms (FFT) to three 1-d FFT columns that lie
parallel to the x- y z-axes
Figure 4 Top row A capped carbon nanotube (160 atoms) and bottom row a long carbon nanotube (1000 atoms) showing
long axes aligned in the x-direction (left) z-direction (middle) skewed (right)
Orientation ( atoms)
X (160)
Z (160)
Skewed (160)
X (1000)
Z (1000)
Skewed (1000)
Cores 5 5 5 60 60 60
Memoryprocess (MB) 884 882 882 2870 2870 2870
Peak memory use (MB) 1893 1885 1838 7077 7077 7077
Total time (secs) 392 359 409 3906 3908 5232
Overall parallel efficiencya
79 84 82 78 78 75
Relative total energy ( cores total time core-seconds)
1960 1795 2045 234360 234480 313920
Table 5 Single point energy calculations of carbon nanotubes shown as oriented in Fig 4 using ultrasoft pseudopotentials (280 eV cut-off energy) and 1 k-point aCalculated automatically by CASTEP
B Param file
Grid-scale
Although the ultrasofts require a smaller size of planewave basis set than the norm-
conserving they do need a finer electron density grid scale in the settings lsquogrid_scalersquo and
lsquofine_grid_scalersquo As shown in Table 6 the denser grid scale setting for the OTFG ultrasofts
(with the exception of the QC5 set) can almost double the calculation time over the larger
planewave hungry OTFG norm-conserving pseudopotentials that converge well under a less
dense grid
Type of pseudopotential
Norm-conserving
Ultrasoft OTFG Norm-conserving
OTFG Ultrasoft
OTFG Ultrasoft QC5 set
grid_scale fine_grid_scale
15175 2030 2030 15175 2030 2030 2030
Memoryprocess (MB)
792 681 666 680 731 2072 1007
Peak memory use (MB)
803 1070 777 791 956 2785 1590
Total time (secs) 89 150 55 136 221 250 109
Table 6 Single point energy calculations run on 5 cores showing the effects of different electron density grid settings
Data Distribution
Parallelizing over plane wave vectors (lsquoG-vectorsrsquo) k-points or a mix of the two has an
impact on computational efficiency as shown in Table 7
The default for a param file without the keyword lsquodata_distributionrsquo is to prioritize k-point
distribution across a number of cores (less than or equal to the number requested in the
submission script) that is a factor of the number of k-points see for example Table 7
columns 2 and 3 Inserting lsquodata_distribution kpointrsquo into the param file prioritizes and
optimizes the k-point distribution across the number of cores requested in the script In the
example tested selecting data distribution over kpoints increased the calculation time over
the default of no data distribution compare columns 3 and 5 of Table 7
Requesting G-vector distribution has the largest impact on calculation time and combining
this with requesting a number of cores that is also a factor of the number of k-points has the
overall largest impact on reducing calculation time ndashsee columns 6 and 7 of Table 7
Requesting mixed data distribution has a similar impact on calculation time as not requesting
any data distribution for 5 cores but not for 6 cores the lsquomixedrsquo distribution used 4-way
kpoint distribution rather than the default (non-) request that applied 6-way distribution ndash
compare columns 2 and 3 with 8 and 9
For the small clay model system the optimal efficiency was obtained using G-vector data
distribution over 6 cores (852 core-seconds) and the least efficient choice was mixed data
distribution over 6 cores (1584 core-seconds) The results are system-specific and need
careful testing to tailor to different systems
Number of tasks per node
This is invoked by adding lsquonum_proc_in_smprsquo to the param file and controls the number of message parsing interface (MPI) tasks that are placed in a specifically OpenMP (SMP) group This means that the ldquoall-to-allrdquo communications is then done in three phases instead of one (1) tasks within an SMP collect their data together on a chosen ldquocontrollerrdquo task within their group (2) the ldquoall-to-allrdquo is done between the controller tasks (3) the controllers all distribute the data back to the tasks in their SMP groups For small core counts the overhead of the two extra phases makes this method slower than just doing an all-to-all for large core counts the reduction in the all-to-all time more than
compensates for the extra overhead so itrsquos faster Indeed the tests (shown in Table 8) reveal that invoking this flag fails to produce as large a speed-up as the flag lsquodata_distribution gvectorrsquo (compare columns 3 and 9) for the test HPC cluster ndash Sunbird reflecting the requested small core count Generally speaking the more cores in the G-vector group the higher you want to set ldquonum_proc_in_smprdquo (up to the physical number of cores on a node)
Column 1 2 3 4 5 6 7 8 9
Requested data distribution + cores in HPC submission script
None 5 cores
None 6 cores
Kpoints 5 cores
Kpoints 6 cores
Gvector 5 cores
Gvector 6 cores
Mixed 5 cores
Mixed 6 cores
Actual data distribution
kpoint 4-way
kpoint 6-way
kpoint 5-way
kpoint 6-way
Gvector 5-way
Gvector 6-way
kpoint 4-way
kpoint 4-way
Memoryprocess (MB)
1249 1219 1249 1219 728 698 1249 1253
Peak memory use (MB)
1581 1561 1581 1561 839 804 1581 1585
Total time (secs) 295 199 292 226 191 142 294 264
Overall parallel efficiencya
99 96 98 96 66 71 98 96
Relative total energy ( cores total time core-seconds)
1475 1194 1460 1356 955 852 1470 1584
Table 7 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points showing the effects of data distribution across different numbers of cores requested in the script file lsquoActual data distributionrsquo means that reported by CASTEP on completion in this and (where applicable) all following Tables lsquoRelative total energyrsquo assumes that each core requested by the script consumes X amount of electricity aCalculated automatically by CASTEP
num_proc_in_smp Default 2 4 5
Requested data_distribution None Gvector None Gvector None Gvector None Gvector
Actual data distribution kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
Memoryprocess (MB) 1249 728 1249 728 1249 728 1249 728
Peak memory use (MB) 1580 837 1581 839 1581 844 1581 846
Total time (secs) 222 156 231 171 230 182 237 183
Overall parallel efficiencya 96 66 98 60 98 56 96 56
Column 1 2 3 4 5 6 7 8 9
Table 8 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of setting lsquonum_proc_in_smp 2 4 5rsquo both with and without the lsquodata_distribution gvectorrsquo flag lsquoDefaultrsquo means lsquonum_proc_in_smprsquo absent from param file aCalculated automatically by CASTEP
Optimization strategy
This parameter has three settings and is invoked through the lsquoopt_strategyrsquo flag in the
param file
Default - Balances speed and memory use Wavefunction coefficients for all k-points
in a calculation will be kept in memory rather than be paged to disk Some large
work arrays will be paged to disk
Memory - Minimizes memory use All wavefunctions and large work arrays are paged
to disk
Speed - Maximizes speed by not paging to disk
This means that if a user runs a large memory calculation optimizing for memory could
obviate the need to request additional cores although the calculation will take longer - see
Table 9 for comparisons
opt_strategy Default Memory Speed
Memoryprocess (MB) 793 750 1249
Peak memory use (MB)
1566 1092 1581
Total time (secs) 232 290 221
Overall parallel efficiencya
94 97 96
Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of optimizing for speed or memory lsquoDefaultrsquo means either omitting the lsquoopt_strategyrsquo flag from the param file or adding it as lsquoopt_strategy defaultrsquo aCalculated automatically by CASTEP
Spin polarization
If a system comprises an odd number of electrons it might be important to differentiate
between the spin-up and spin-down states of the odd electron This directly affects the
calculation time effectively doubling it as shown in Table 10
param flag and setting
spin_polarization
false true
Memoryprocess (MB)
1249 1415
Peak memory use (MB)
1581 1710
Total time (secs) 222 455
Overall parallel efficiencya
96 98
Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of spin polarization aCalculated automatically by CASTEP
Electronic energy minimizer
Insulating systems often behave well during the self-consistent field (SCF) minimizations and
converge smoothly using density mixing (lsquoDMrsquo) When SCF convergence is problematic and
all attempts to tweak DM-related parameters have failed it is necessary to turn to ensemble
density functional theory7 and accept the consequent (and considerable) increase in
computational cost ndashsee Table 11
param flag and setting
metals_method (Electron minimization) DM EDFT
Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory
aCalculated automatically by CASTEP
C Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
efficiency
(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)
requested for the simulation Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly
(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo
flag to accelerate progress through the job queue
(iv) Using the most recent version of software captures the latest upgrades and bug-
fixes that might otherwise slow down a calculation
(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory
requirements In one example the estimate of peak memory use was frac14 of that
actually used during the simulation proper
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
As a user you might be thinking lsquobut what effect can I have when surely the HPC architecture
is responsible for energy usagersquo and lsquothen therersquos the code itself which should be as
efficient as possible but if itrsquos not I canrsquot do anything about thatrsquo Both of these thoughts are
grounded in truth the HPC architecture is fixed - but we can use it efficiently the software
wersquore using is structurally fixed ndash but we can run it efficiently
The energy cost (E) of a simulation is the total power per core (P) consumed over the length
of time (T ) of the simulation which for parallelised simulations run on (N) cores is 119864 = 119873119875119879
From this it is logical to think that reducing N P andor T will reduce E which is theoretically
true Practically though letrsquos assume that the power consumed by each core is a fixed
property of the HPC architecture we now have 119864 prop 119873119879 This effectively encapsulates where
we as users of HPC can control the amount of energy our simulations consume and seems
simple All we need to do is learn how to optimize the number of cores and the length of time
of our simulations
We use multiple cores to share the memory load and to speed-up a calculation giving us
three calculation properties to optimise number of cores memory per core time To reduce
the calculation time we might first increase the number of cores Many users might already
know that the relationship between core count and calculation time is non-linear thanks to
the required increase in core-to-core and node-to-node communication time Taking the
latter into account means the total energy used is 119864 = 119873119879 + 119891(119873 119879) where 119891(119873 119879) captures
the energy cost of the core-corenode-node communication time
To optimise energy efficiency any speed-up in calculation time gained by increasing the
number of cores needs to balance the increased energy cost of using additional cores
Therefore the speed-up factor needs to be more than the factor of the number of cores as
shown in the equations below for a 2-core vs serial example
119864119904 = 119879119904 (119891(119879119904) = 0) Energy of serial (ie 1-core) calculation
1198642119873 = 21198792119873 + 119891(2 1198792119873) Energy of 2-core calculation
1198642119873 le 119864119878 For the energy cost of using 2 cores to be no greater
than the energy cost of the serial calculation
21198792119873 + 119891(2 1198792119873) le 119879119878 ie 1198792119873 + 1
2119891(2 1198792119873) le
1
2119879119878 which means that the total
calculation time using 2-cores needs to be less than half of the serial time So for users to
run simulations efficiently in parallel they need to balance the number of cores and the
associated memory load per core and the total calculation time The following section shows
how some of the more commonly used parameters within CASTEP affect these three
properties
NB The main purpose of the following examples is to illustrate the impact of different
user-choices on the total energy cost of simulations These examples do not indicate
the level of lsquoaccuracyrsquo attained because lsquoaccuracyrsquo is determined by the user
according to the type contents and aims of their simulations
2 Reducing the energy used by your simulation
This section uses an example of a small model of a clay mineral (and later a carbon
nanotube) to illustrate how a user can change the total energy their simulation uses by a
judicious choice of CASTEP input parameters
Figure 1 unit cell of generic silicate clay mineral comprising 41 atoms
A Cell file
Pseudopotentials
Choose the pseudopotential according to the type of simulation eg for simulations of cell
structures ultrasofts2 are often sufficient although if the pseudopotential library does not
contain an ultrasoft version for a particular element the on-the-fly-generated (OTFG)
ultrasofts3 might suffice If a user is running a spectroscopic simulation such as infrared
using density functional perturbation theory4 then norm-conserving5 or OTFG norm-
conserving3 could be the better choices The impact of pseudopotential type on the
computational cost is shown in Table 1 through the total (calculation) time
Type of pseudopotential
Ultrasoft Norm-conserving
OTFG Ultrasoft
OTFG Ultrasoft QC5 setb
OTFG Norm-conserving
Cut-off energy (eV)
370 900 598 340 925
coresa 5 5 5 5 5
Memoryprocess (MB)
666 681 2072 1007 681
Peak memory use (MB)
777 802 2785 1590 791
Total time (secs) 55 89 250 109 136
Table 1 Pseudopotential and size of planewave set required on lsquofinersquo setting of Materials Studio 20206 and an example of memory requirements and time required for a single point energy calculation using the recorded number of cores on a single node Unless otherwise stated the same cut-off energy per type of pseudopotential is implied throughout this document aUsing Sunbird (CPU 2x Intel(R) Xeon(R) Gold 6148 CPU 240GHz with 20 cores each) unless stated otherwise all calculations were performed on this HPC cluster bDesigned to be used at the same modest (340 eV) kinetic energy cut-off across the periodic table They are ideal for moderate accuracy high throughout calculations eg ab initio random structure searching (AIRSS)
K-points
Changing the number of Brillouin zone sampling points can have a dramatic effect on
computational time as shown in Table 2 Bear in mind that increasing the number of k-points
increases the memory requirements often tempting users to increase the number of cores
further increasing overall computational cost Remember though itrsquos important to use the
number of k-points that provide the level of accuracy your simulations need
Type of pseudopotential
Ultrasoft
OTFG Norm-conserving
kpoints_mp_grid 2 1 1 (1) 3 2 1 (3) 4 3 2 (12) 2 1 1 (1) 3 2 1 (3) 4 3 2 (12)
Memoryprocess (MB)
652 666 1249 630 681 1287
Peak memory use (MB)
768 777 1580 764 791 1296
Total time (secs) 32 55 222 85 136 477
Table 2 Single point energy calculations run on 5 cores using different numbers of k-points (in brackets) showing the
effects for different pseudopotentials
Vacuum space
When building a material surface it is necessary to add vacuum space to a cell (see Figure 2
for example) and this adds to the memory requirements and calculation time because the
lsquoempty spacersquo (as well as the atoms) is lsquofilledrsquo by planewaves Table 3 shows that doubling
the volume of vacuum space doubles the total calculation time (using the same number of
cores)
Vacuum space (Aring)
0 5 10 20
Memoryprocess (MB)
666 766 834 1078
Peak memory use (MB)
777 928 1066 1372
Total time (secs) 55 102 202 406
Overall parallel efficiencya
69 66 67 61
Figure 2 Vacuum space added to create clay mineral surface (to study adsorbate-surface interactions for example ndashadsorbate not included in the above)
Table 3 Single point energy calculations using ultrasoft pseudopotentials and 3 k-points run on 5 cores showing the effects of vacuum space aCalculated automatically by CASTEP
Supercell size
The size of a system is one of the more obvious choices that affects the demands on
computational resources nevertheless it is interesting to see (from Table 4) that for the
same number of kpoints doubling the number of atoms increases the memory load per
process between 35 (41 to 82 atoms) to 72 (82 to 164 atoms) and the corresponding
calculation times increase by factors 11 and 85 respectively In good practice the number
of kpoints is scaled according to the supercell size increasing the computational cost more
modestly
Supercell size ( atoms) 1 x 1 x 1 (41)
2 x 1 x 1 (82) 2 x 2 x 1 (164)
Kpoints (mp grid)
Kpoints scaled for supercells 2x1x1 and 2x2x1
3 2 1 (3) 3 2 1 (3)
2 1 1 (1)
3 2 1 (3)
2 1 1 (1)
Memoryprocess (MB) 666 897 732 1547 1315
Peak memory use (MB) 777 1175 1025 2330 2177
Total time (secs) 55 631 329 5416 1660
Overall parallel efficiencya 69 69 74 67 72 Table 4 Single point energy calculations using ultrasoft pseudo-potentials run on 5 cores showing the effects of supercells aCalculated automatically by CASTEP
Figure 3 Example of 2 x 2 x 1 supercell
Orientation of axes
This might be one of the more surprising and unexpected properties of a model that affects
computational efficiency The effect becomes significant when a system is large
disproportionately longer along one of its lengths and is misaligned with the x- y- z-axes
see Figure 4 and Table 5 for exaggerated examples of misalignment This effect is due to
the way CASTEP transforms real-space properties between real-space and reciprocal-
space it converts the 3-d fast Fourier transforms (FFT) to three 1-d FFT columns that lie
parallel to the x- y z-axes
Figure 4 Top row A capped carbon nanotube (160 atoms) and bottom row a long carbon nanotube (1000 atoms) showing
long axes aligned in the x-direction (left) z-direction (middle) skewed (right)
Orientation ( atoms)
X (160)
Z (160)
Skewed (160)
X (1000)
Z (1000)
Skewed (1000)
Cores 5 5 5 60 60 60
Memoryprocess (MB) 884 882 882 2870 2870 2870
Peak memory use (MB) 1893 1885 1838 7077 7077 7077
Total time (secs) 392 359 409 3906 3908 5232
Overall parallel efficiencya
79 84 82 78 78 75
Relative total energy ( cores total time core-seconds)
1960 1795 2045 234360 234480 313920
Table 5 Single point energy calculations of carbon nanotubes shown as oriented in Fig 4 using ultrasoft pseudopotentials (280 eV cut-off energy) and 1 k-point aCalculated automatically by CASTEP
B Param file
Grid-scale
Although the ultrasofts require a smaller size of planewave basis set than the norm-
conserving they do need a finer electron density grid scale in the settings lsquogrid_scalersquo and
lsquofine_grid_scalersquo As shown in Table 6 the denser grid scale setting for the OTFG ultrasofts
(with the exception of the QC5 set) can almost double the calculation time over the larger
planewave hungry OTFG norm-conserving pseudopotentials that converge well under a less
dense grid
Type of pseudopotential
Norm-conserving
Ultrasoft OTFG Norm-conserving
OTFG Ultrasoft
OTFG Ultrasoft QC5 set
grid_scale fine_grid_scale
15175 2030 2030 15175 2030 2030 2030
Memoryprocess (MB)
792 681 666 680 731 2072 1007
Peak memory use (MB)
803 1070 777 791 956 2785 1590
Total time (secs) 89 150 55 136 221 250 109
Table 6 Single point energy calculations run on 5 cores showing the effects of different electron density grid settings
Data Distribution
Parallelizing over plane wave vectors (lsquoG-vectorsrsquo) k-points or a mix of the two has an
impact on computational efficiency as shown in Table 7
The default for a param file without the keyword lsquodata_distributionrsquo is to prioritize k-point
distribution across a number of cores (less than or equal to the number requested in the
submission script) that is a factor of the number of k-points see for example Table 7
columns 2 and 3 Inserting lsquodata_distribution kpointrsquo into the param file prioritizes and
optimizes the k-point distribution across the number of cores requested in the script In the
example tested selecting data distribution over kpoints increased the calculation time over
the default of no data distribution compare columns 3 and 5 of Table 7
Requesting G-vector distribution has the largest impact on calculation time and combining
this with requesting a number of cores that is also a factor of the number of k-points has the
overall largest impact on reducing calculation time ndashsee columns 6 and 7 of Table 7
Requesting mixed data distribution has a similar impact on calculation time as not requesting
any data distribution for 5 cores but not for 6 cores the lsquomixedrsquo distribution used 4-way
kpoint distribution rather than the default (non-) request that applied 6-way distribution ndash
compare columns 2 and 3 with 8 and 9
For the small clay model system the optimal efficiency was obtained using G-vector data
distribution over 6 cores (852 core-seconds) and the least efficient choice was mixed data
distribution over 6 cores (1584 core-seconds) The results are system-specific and need
careful testing to tailor to different systems
Number of tasks per node
This is invoked by adding lsquonum_proc_in_smprsquo to the param file and controls the number of message parsing interface (MPI) tasks that are placed in a specifically OpenMP (SMP) group This means that the ldquoall-to-allrdquo communications is then done in three phases instead of one (1) tasks within an SMP collect their data together on a chosen ldquocontrollerrdquo task within their group (2) the ldquoall-to-allrdquo is done between the controller tasks (3) the controllers all distribute the data back to the tasks in their SMP groups For small core counts the overhead of the two extra phases makes this method slower than just doing an all-to-all for large core counts the reduction in the all-to-all time more than
compensates for the extra overhead so itrsquos faster Indeed the tests (shown in Table 8) reveal that invoking this flag fails to produce as large a speed-up as the flag lsquodata_distribution gvectorrsquo (compare columns 3 and 9) for the test HPC cluster ndash Sunbird reflecting the requested small core count Generally speaking the more cores in the G-vector group the higher you want to set ldquonum_proc_in_smprdquo (up to the physical number of cores on a node)
Column 1 2 3 4 5 6 7 8 9
Requested data distribution + cores in HPC submission script
None 5 cores
None 6 cores
Kpoints 5 cores
Kpoints 6 cores
Gvector 5 cores
Gvector 6 cores
Mixed 5 cores
Mixed 6 cores
Actual data distribution
kpoint 4-way
kpoint 6-way
kpoint 5-way
kpoint 6-way
Gvector 5-way
Gvector 6-way
kpoint 4-way
kpoint 4-way
Memoryprocess (MB)
1249 1219 1249 1219 728 698 1249 1253
Peak memory use (MB)
1581 1561 1581 1561 839 804 1581 1585
Total time (secs) 295 199 292 226 191 142 294 264
Overall parallel efficiencya
99 96 98 96 66 71 98 96
Relative total energy ( cores total time core-seconds)
1475 1194 1460 1356 955 852 1470 1584
Table 7 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points showing the effects of data distribution across different numbers of cores requested in the script file lsquoActual data distributionrsquo means that reported by CASTEP on completion in this and (where applicable) all following Tables lsquoRelative total energyrsquo assumes that each core requested by the script consumes X amount of electricity aCalculated automatically by CASTEP
num_proc_in_smp Default 2 4 5
Requested data_distribution None Gvector None Gvector None Gvector None Gvector
Actual data distribution kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
Memoryprocess (MB) 1249 728 1249 728 1249 728 1249 728
Peak memory use (MB) 1580 837 1581 839 1581 844 1581 846
Total time (secs) 222 156 231 171 230 182 237 183
Overall parallel efficiencya 96 66 98 60 98 56 96 56
Column 1 2 3 4 5 6 7 8 9
Table 8 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of setting lsquonum_proc_in_smp 2 4 5rsquo both with and without the lsquodata_distribution gvectorrsquo flag lsquoDefaultrsquo means lsquonum_proc_in_smprsquo absent from param file aCalculated automatically by CASTEP
Optimization strategy
This parameter has three settings and is invoked through the lsquoopt_strategyrsquo flag in the
param file
Default - Balances speed and memory use Wavefunction coefficients for all k-points
in a calculation will be kept in memory rather than be paged to disk Some large
work arrays will be paged to disk
Memory - Minimizes memory use All wavefunctions and large work arrays are paged
to disk
Speed - Maximizes speed by not paging to disk
This means that if a user runs a large memory calculation optimizing for memory could
obviate the need to request additional cores although the calculation will take longer - see
Table 9 for comparisons
opt_strategy Default Memory Speed
Memoryprocess (MB) 793 750 1249
Peak memory use (MB)
1566 1092 1581
Total time (secs) 232 290 221
Overall parallel efficiencya
94 97 96
Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of optimizing for speed or memory lsquoDefaultrsquo means either omitting the lsquoopt_strategyrsquo flag from the param file or adding it as lsquoopt_strategy defaultrsquo aCalculated automatically by CASTEP
Spin polarization
If a system comprises an odd number of electrons it might be important to differentiate
between the spin-up and spin-down states of the odd electron This directly affects the
calculation time effectively doubling it as shown in Table 10
param flag and setting
spin_polarization
false true
Memoryprocess (MB)
1249 1415
Peak memory use (MB)
1581 1710
Total time (secs) 222 455
Overall parallel efficiencya
96 98
Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of spin polarization aCalculated automatically by CASTEP
Electronic energy minimizer
Insulating systems often behave well during the self-consistent field (SCF) minimizations and
converge smoothly using density mixing (lsquoDMrsquo) When SCF convergence is problematic and
all attempts to tweak DM-related parameters have failed it is necessary to turn to ensemble
density functional theory7 and accept the consequent (and considerable) increase in
computational cost ndashsee Table 11
param flag and setting
metals_method (Electron minimization) DM EDFT
Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory
aCalculated automatically by CASTEP
C Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
efficiency
(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)
requested for the simulation Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly
(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo
flag to accelerate progress through the job queue
(iv) Using the most recent version of software captures the latest upgrades and bug-
fixes that might otherwise slow down a calculation
(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory
requirements In one example the estimate of peak memory use was frac14 of that
actually used during the simulation proper
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
Figure 1 unit cell of generic silicate clay mineral comprising 41 atoms
A Cell file
Pseudopotentials
Choose the pseudopotential according to the type of simulation eg for simulations of cell
structures ultrasofts2 are often sufficient although if the pseudopotential library does not
contain an ultrasoft version for a particular element the on-the-fly-generated (OTFG)
ultrasofts3 might suffice If a user is running a spectroscopic simulation such as infrared
using density functional perturbation theory4 then norm-conserving5 or OTFG norm-
conserving3 could be the better choices The impact of pseudopotential type on the
computational cost is shown in Table 1 through the total (calculation) time
Type of pseudopotential
Ultrasoft Norm-conserving
OTFG Ultrasoft
OTFG Ultrasoft QC5 setb
OTFG Norm-conserving
Cut-off energy (eV)
370 900 598 340 925
coresa 5 5 5 5 5
Memoryprocess (MB)
666 681 2072 1007 681
Peak memory use (MB)
777 802 2785 1590 791
Total time (secs) 55 89 250 109 136
Table 1 Pseudopotential and size of planewave set required on lsquofinersquo setting of Materials Studio 20206 and an example of memory requirements and time required for a single point energy calculation using the recorded number of cores on a single node Unless otherwise stated the same cut-off energy per type of pseudopotential is implied throughout this document aUsing Sunbird (CPU 2x Intel(R) Xeon(R) Gold 6148 CPU 240GHz with 20 cores each) unless stated otherwise all calculations were performed on this HPC cluster bDesigned to be used at the same modest (340 eV) kinetic energy cut-off across the periodic table They are ideal for moderate accuracy high throughout calculations eg ab initio random structure searching (AIRSS)
K-points
Changing the number of Brillouin zone sampling points can have a dramatic effect on
computational time as shown in Table 2 Bear in mind that increasing the number of k-points
increases the memory requirements often tempting users to increase the number of cores
further increasing overall computational cost Remember though itrsquos important to use the
number of k-points that provide the level of accuracy your simulations need
Type of pseudopotential
Ultrasoft
OTFG Norm-conserving
kpoints_mp_grid 2 1 1 (1) 3 2 1 (3) 4 3 2 (12) 2 1 1 (1) 3 2 1 (3) 4 3 2 (12)
Memoryprocess (MB)
652 666 1249 630 681 1287
Peak memory use (MB)
768 777 1580 764 791 1296
Total time (secs) 32 55 222 85 136 477
Table 2 Single point energy calculations run on 5 cores using different numbers of k-points (in brackets) showing the
effects for different pseudopotentials
Vacuum space
When building a material surface it is necessary to add vacuum space to a cell (see Figure 2
for example) and this adds to the memory requirements and calculation time because the
lsquoempty spacersquo (as well as the atoms) is lsquofilledrsquo by planewaves Table 3 shows that doubling
the volume of vacuum space doubles the total calculation time (using the same number of
cores)
Vacuum space (Aring)
0 5 10 20
Memoryprocess (MB)
666 766 834 1078
Peak memory use (MB)
777 928 1066 1372
Total time (secs) 55 102 202 406
Overall parallel efficiencya
69 66 67 61
Figure 2 Vacuum space added to create clay mineral surface (to study adsorbate-surface interactions for example ndashadsorbate not included in the above)
Table 3 Single point energy calculations using ultrasoft pseudopotentials and 3 k-points run on 5 cores showing the effects of vacuum space aCalculated automatically by CASTEP
Supercell size
The size of a system is one of the more obvious choices that affects the demands on
computational resources nevertheless it is interesting to see (from Table 4) that for the
same number of kpoints doubling the number of atoms increases the memory load per
process between 35 (41 to 82 atoms) to 72 (82 to 164 atoms) and the corresponding
calculation times increase by factors 11 and 85 respectively In good practice the number
of kpoints is scaled according to the supercell size increasing the computational cost more
modestly
Supercell size ( atoms) 1 x 1 x 1 (41)
2 x 1 x 1 (82) 2 x 2 x 1 (164)
Kpoints (mp grid)
Kpoints scaled for supercells 2x1x1 and 2x2x1
3 2 1 (3) 3 2 1 (3)
2 1 1 (1)
3 2 1 (3)
2 1 1 (1)
Memoryprocess (MB) 666 897 732 1547 1315
Peak memory use (MB) 777 1175 1025 2330 2177
Total time (secs) 55 631 329 5416 1660
Overall parallel efficiencya 69 69 74 67 72 Table 4 Single point energy calculations using ultrasoft pseudo-potentials run on 5 cores showing the effects of supercells aCalculated automatically by CASTEP
Figure 3 Example of 2 x 2 x 1 supercell
Orientation of axes
This might be one of the more surprising and unexpected properties of a model that affects
computational efficiency The effect becomes significant when a system is large
disproportionately longer along one of its lengths and is misaligned with the x- y- z-axes
see Figure 4 and Table 5 for exaggerated examples of misalignment This effect is due to
the way CASTEP transforms real-space properties between real-space and reciprocal-
space it converts the 3-d fast Fourier transforms (FFT) to three 1-d FFT columns that lie
parallel to the x- y z-axes
Figure 4 Top row A capped carbon nanotube (160 atoms) and bottom row a long carbon nanotube (1000 atoms) showing
long axes aligned in the x-direction (left) z-direction (middle) skewed (right)
Orientation ( atoms)
X (160)
Z (160)
Skewed (160)
X (1000)
Z (1000)
Skewed (1000)
Cores 5 5 5 60 60 60
Memoryprocess (MB) 884 882 882 2870 2870 2870
Peak memory use (MB) 1893 1885 1838 7077 7077 7077
Total time (secs) 392 359 409 3906 3908 5232
Overall parallel efficiencya
79 84 82 78 78 75
Relative total energy ( cores total time core-seconds)
1960 1795 2045 234360 234480 313920
Table 5 Single point energy calculations of carbon nanotubes shown as oriented in Fig 4 using ultrasoft pseudopotentials (280 eV cut-off energy) and 1 k-point aCalculated automatically by CASTEP
B Param file
Grid-scale
Although the ultrasofts require a smaller size of planewave basis set than the norm-
conserving they do need a finer electron density grid scale in the settings lsquogrid_scalersquo and
lsquofine_grid_scalersquo As shown in Table 6 the denser grid scale setting for the OTFG ultrasofts
(with the exception of the QC5 set) can almost double the calculation time over the larger
planewave hungry OTFG norm-conserving pseudopotentials that converge well under a less
dense grid
Type of pseudopotential
Norm-conserving
Ultrasoft OTFG Norm-conserving
OTFG Ultrasoft
OTFG Ultrasoft QC5 set
grid_scale fine_grid_scale
15175 2030 2030 15175 2030 2030 2030
Memoryprocess (MB)
792 681 666 680 731 2072 1007
Peak memory use (MB)
803 1070 777 791 956 2785 1590
Total time (secs) 89 150 55 136 221 250 109
Table 6 Single point energy calculations run on 5 cores showing the effects of different electron density grid settings
Data Distribution
Parallelizing over plane wave vectors (lsquoG-vectorsrsquo) k-points or a mix of the two has an
impact on computational efficiency as shown in Table 7
The default for a param file without the keyword lsquodata_distributionrsquo is to prioritize k-point
distribution across a number of cores (less than or equal to the number requested in the
submission script) that is a factor of the number of k-points see for example Table 7
columns 2 and 3 Inserting lsquodata_distribution kpointrsquo into the param file prioritizes and
optimizes the k-point distribution across the number of cores requested in the script In the
example tested selecting data distribution over kpoints increased the calculation time over
the default of no data distribution compare columns 3 and 5 of Table 7
Requesting G-vector distribution has the largest impact on calculation time and combining
this with requesting a number of cores that is also a factor of the number of k-points has the
overall largest impact on reducing calculation time ndashsee columns 6 and 7 of Table 7
Requesting mixed data distribution has a similar impact on calculation time as not requesting
any data distribution for 5 cores but not for 6 cores the lsquomixedrsquo distribution used 4-way
kpoint distribution rather than the default (non-) request that applied 6-way distribution ndash
compare columns 2 and 3 with 8 and 9
For the small clay model system the optimal efficiency was obtained using G-vector data
distribution over 6 cores (852 core-seconds) and the least efficient choice was mixed data
distribution over 6 cores (1584 core-seconds) The results are system-specific and need
careful testing to tailor to different systems
Number of tasks per node
This is invoked by adding lsquonum_proc_in_smprsquo to the param file and controls the number of message parsing interface (MPI) tasks that are placed in a specifically OpenMP (SMP) group This means that the ldquoall-to-allrdquo communications is then done in three phases instead of one (1) tasks within an SMP collect their data together on a chosen ldquocontrollerrdquo task within their group (2) the ldquoall-to-allrdquo is done between the controller tasks (3) the controllers all distribute the data back to the tasks in their SMP groups For small core counts the overhead of the two extra phases makes this method slower than just doing an all-to-all for large core counts the reduction in the all-to-all time more than
compensates for the extra overhead so itrsquos faster Indeed the tests (shown in Table 8) reveal that invoking this flag fails to produce as large a speed-up as the flag lsquodata_distribution gvectorrsquo (compare columns 3 and 9) for the test HPC cluster ndash Sunbird reflecting the requested small core count Generally speaking the more cores in the G-vector group the higher you want to set ldquonum_proc_in_smprdquo (up to the physical number of cores on a node)
Column 1 2 3 4 5 6 7 8 9
Requested data distribution + cores in HPC submission script
None 5 cores
None 6 cores
Kpoints 5 cores
Kpoints 6 cores
Gvector 5 cores
Gvector 6 cores
Mixed 5 cores
Mixed 6 cores
Actual data distribution
kpoint 4-way
kpoint 6-way
kpoint 5-way
kpoint 6-way
Gvector 5-way
Gvector 6-way
kpoint 4-way
kpoint 4-way
Memoryprocess (MB)
1249 1219 1249 1219 728 698 1249 1253
Peak memory use (MB)
1581 1561 1581 1561 839 804 1581 1585
Total time (secs) 295 199 292 226 191 142 294 264
Overall parallel efficiencya
99 96 98 96 66 71 98 96
Relative total energy ( cores total time core-seconds)
1475 1194 1460 1356 955 852 1470 1584
Table 7 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points showing the effects of data distribution across different numbers of cores requested in the script file lsquoActual data distributionrsquo means that reported by CASTEP on completion in this and (where applicable) all following Tables lsquoRelative total energyrsquo assumes that each core requested by the script consumes X amount of electricity aCalculated automatically by CASTEP
num_proc_in_smp Default 2 4 5
Requested data_distribution None Gvector None Gvector None Gvector None Gvector
Actual data distribution kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
Memoryprocess (MB) 1249 728 1249 728 1249 728 1249 728
Peak memory use (MB) 1580 837 1581 839 1581 844 1581 846
Total time (secs) 222 156 231 171 230 182 237 183
Overall parallel efficiencya 96 66 98 60 98 56 96 56
Column 1 2 3 4 5 6 7 8 9
Table 8 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of setting lsquonum_proc_in_smp 2 4 5rsquo both with and without the lsquodata_distribution gvectorrsquo flag lsquoDefaultrsquo means lsquonum_proc_in_smprsquo absent from param file aCalculated automatically by CASTEP
Optimization strategy
This parameter has three settings and is invoked through the lsquoopt_strategyrsquo flag in the
param file
Default - Balances speed and memory use Wavefunction coefficients for all k-points
in a calculation will be kept in memory rather than be paged to disk Some large
work arrays will be paged to disk
Memory - Minimizes memory use All wavefunctions and large work arrays are paged
to disk
Speed - Maximizes speed by not paging to disk
This means that if a user runs a large memory calculation optimizing for memory could
obviate the need to request additional cores although the calculation will take longer - see
Table 9 for comparisons
opt_strategy Default Memory Speed
Memoryprocess (MB) 793 750 1249
Peak memory use (MB)
1566 1092 1581
Total time (secs) 232 290 221
Overall parallel efficiencya
94 97 96
Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of optimizing for speed or memory lsquoDefaultrsquo means either omitting the lsquoopt_strategyrsquo flag from the param file or adding it as lsquoopt_strategy defaultrsquo aCalculated automatically by CASTEP
Spin polarization
If a system comprises an odd number of electrons it might be important to differentiate
between the spin-up and spin-down states of the odd electron This directly affects the
calculation time effectively doubling it as shown in Table 10
param flag and setting
spin_polarization
false true
Memoryprocess (MB)
1249 1415
Peak memory use (MB)
1581 1710
Total time (secs) 222 455
Overall parallel efficiencya
96 98
Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of spin polarization aCalculated automatically by CASTEP
Electronic energy minimizer
Insulating systems often behave well during the self-consistent field (SCF) minimizations and
converge smoothly using density mixing (lsquoDMrsquo) When SCF convergence is problematic and
all attempts to tweak DM-related parameters have failed it is necessary to turn to ensemble
density functional theory7 and accept the consequent (and considerable) increase in
computational cost ndashsee Table 11
param flag and setting
metals_method (Electron minimization) DM EDFT
Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory
aCalculated automatically by CASTEP
C Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
efficiency
(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)
requested for the simulation Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly
(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo
flag to accelerate progress through the job queue
(iv) Using the most recent version of software captures the latest upgrades and bug-
fixes that might otherwise slow down a calculation
(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory
requirements In one example the estimate of peak memory use was frac14 of that
actually used during the simulation proper
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
K-points
Changing the number of Brillouin zone sampling points can have a dramatic effect on
computational time as shown in Table 2 Bear in mind that increasing the number of k-points
increases the memory requirements often tempting users to increase the number of cores
further increasing overall computational cost Remember though itrsquos important to use the
number of k-points that provide the level of accuracy your simulations need
Type of pseudopotential
Ultrasoft
OTFG Norm-conserving
kpoints_mp_grid 2 1 1 (1) 3 2 1 (3) 4 3 2 (12) 2 1 1 (1) 3 2 1 (3) 4 3 2 (12)
Memoryprocess (MB)
652 666 1249 630 681 1287
Peak memory use (MB)
768 777 1580 764 791 1296
Total time (secs) 32 55 222 85 136 477
Table 2 Single point energy calculations run on 5 cores using different numbers of k-points (in brackets) showing the
effects for different pseudopotentials
Vacuum space
When building a material surface it is necessary to add vacuum space to a cell (see Figure 2
for example) and this adds to the memory requirements and calculation time because the
lsquoempty spacersquo (as well as the atoms) is lsquofilledrsquo by planewaves Table 3 shows that doubling
the volume of vacuum space doubles the total calculation time (using the same number of
cores)
Vacuum space (Aring)
0 5 10 20
Memoryprocess (MB)
666 766 834 1078
Peak memory use (MB)
777 928 1066 1372
Total time (secs) 55 102 202 406
Overall parallel efficiencya
69 66 67 61
Figure 2 Vacuum space added to create clay mineral surface (to study adsorbate-surface interactions for example ndashadsorbate not included in the above)
Table 3 Single point energy calculations using ultrasoft pseudopotentials and 3 k-points run on 5 cores showing the effects of vacuum space aCalculated automatically by CASTEP
Supercell size
The size of a system is one of the more obvious choices that affects the demands on
computational resources nevertheless it is interesting to see (from Table 4) that for the
same number of kpoints doubling the number of atoms increases the memory load per
process between 35 (41 to 82 atoms) to 72 (82 to 164 atoms) and the corresponding
calculation times increase by factors 11 and 85 respectively In good practice the number
of kpoints is scaled according to the supercell size increasing the computational cost more
modestly
Supercell size ( atoms) 1 x 1 x 1 (41)
2 x 1 x 1 (82) 2 x 2 x 1 (164)
Kpoints (mp grid)
Kpoints scaled for supercells 2x1x1 and 2x2x1
3 2 1 (3) 3 2 1 (3)
2 1 1 (1)
3 2 1 (3)
2 1 1 (1)
Memoryprocess (MB) 666 897 732 1547 1315
Peak memory use (MB) 777 1175 1025 2330 2177
Total time (secs) 55 631 329 5416 1660
Overall parallel efficiencya 69 69 74 67 72 Table 4 Single point energy calculations using ultrasoft pseudo-potentials run on 5 cores showing the effects of supercells aCalculated automatically by CASTEP
Figure 3 Example of 2 x 2 x 1 supercell
Orientation of axes
This might be one of the more surprising and unexpected properties of a model that affects
computational efficiency The effect becomes significant when a system is large
disproportionately longer along one of its lengths and is misaligned with the x- y- z-axes
see Figure 4 and Table 5 for exaggerated examples of misalignment This effect is due to
the way CASTEP transforms real-space properties between real-space and reciprocal-
space it converts the 3-d fast Fourier transforms (FFT) to three 1-d FFT columns that lie
parallel to the x- y z-axes
Figure 4 Top row A capped carbon nanotube (160 atoms) and bottom row a long carbon nanotube (1000 atoms) showing
long axes aligned in the x-direction (left) z-direction (middle) skewed (right)
Orientation ( atoms)
X (160)
Z (160)
Skewed (160)
X (1000)
Z (1000)
Skewed (1000)
Cores 5 5 5 60 60 60
Memoryprocess (MB) 884 882 882 2870 2870 2870
Peak memory use (MB) 1893 1885 1838 7077 7077 7077
Total time (secs) 392 359 409 3906 3908 5232
Overall parallel efficiencya
79 84 82 78 78 75
Relative total energy ( cores total time core-seconds)
1960 1795 2045 234360 234480 313920
Table 5 Single point energy calculations of carbon nanotubes shown as oriented in Fig 4 using ultrasoft pseudopotentials (280 eV cut-off energy) and 1 k-point aCalculated automatically by CASTEP
B Param file
Grid-scale
Although the ultrasofts require a smaller size of planewave basis set than the norm-
conserving they do need a finer electron density grid scale in the settings lsquogrid_scalersquo and
lsquofine_grid_scalersquo As shown in Table 6 the denser grid scale setting for the OTFG ultrasofts
(with the exception of the QC5 set) can almost double the calculation time over the larger
planewave hungry OTFG norm-conserving pseudopotentials that converge well under a less
dense grid
Type of pseudopotential
Norm-conserving
Ultrasoft OTFG Norm-conserving
OTFG Ultrasoft
OTFG Ultrasoft QC5 set
grid_scale fine_grid_scale
15175 2030 2030 15175 2030 2030 2030
Memoryprocess (MB)
792 681 666 680 731 2072 1007
Peak memory use (MB)
803 1070 777 791 956 2785 1590
Total time (secs) 89 150 55 136 221 250 109
Table 6 Single point energy calculations run on 5 cores showing the effects of different electron density grid settings
Data Distribution
Parallelizing over plane wave vectors (lsquoG-vectorsrsquo) k-points or a mix of the two has an
impact on computational efficiency as shown in Table 7
The default for a param file without the keyword lsquodata_distributionrsquo is to prioritize k-point
distribution across a number of cores (less than or equal to the number requested in the
submission script) that is a factor of the number of k-points see for example Table 7
columns 2 and 3 Inserting lsquodata_distribution kpointrsquo into the param file prioritizes and
optimizes the k-point distribution across the number of cores requested in the script In the
example tested selecting data distribution over kpoints increased the calculation time over
the default of no data distribution compare columns 3 and 5 of Table 7
Requesting G-vector distribution has the largest impact on calculation time and combining
this with requesting a number of cores that is also a factor of the number of k-points has the
overall largest impact on reducing calculation time ndashsee columns 6 and 7 of Table 7
Requesting mixed data distribution has a similar impact on calculation time as not requesting
any data distribution for 5 cores but not for 6 cores the lsquomixedrsquo distribution used 4-way
kpoint distribution rather than the default (non-) request that applied 6-way distribution ndash
compare columns 2 and 3 with 8 and 9
For the small clay model system the optimal efficiency was obtained using G-vector data
distribution over 6 cores (852 core-seconds) and the least efficient choice was mixed data
distribution over 6 cores (1584 core-seconds) The results are system-specific and need
careful testing to tailor to different systems
Number of tasks per node
This is invoked by adding lsquonum_proc_in_smprsquo to the param file and controls the number of message parsing interface (MPI) tasks that are placed in a specifically OpenMP (SMP) group This means that the ldquoall-to-allrdquo communications is then done in three phases instead of one (1) tasks within an SMP collect their data together on a chosen ldquocontrollerrdquo task within their group (2) the ldquoall-to-allrdquo is done between the controller tasks (3) the controllers all distribute the data back to the tasks in their SMP groups For small core counts the overhead of the two extra phases makes this method slower than just doing an all-to-all for large core counts the reduction in the all-to-all time more than
compensates for the extra overhead so itrsquos faster Indeed the tests (shown in Table 8) reveal that invoking this flag fails to produce as large a speed-up as the flag lsquodata_distribution gvectorrsquo (compare columns 3 and 9) for the test HPC cluster ndash Sunbird reflecting the requested small core count Generally speaking the more cores in the G-vector group the higher you want to set ldquonum_proc_in_smprdquo (up to the physical number of cores on a node)
Column 1 2 3 4 5 6 7 8 9
Requested data distribution + cores in HPC submission script
None 5 cores
None 6 cores
Kpoints 5 cores
Kpoints 6 cores
Gvector 5 cores
Gvector 6 cores
Mixed 5 cores
Mixed 6 cores
Actual data distribution
kpoint 4-way
kpoint 6-way
kpoint 5-way
kpoint 6-way
Gvector 5-way
Gvector 6-way
kpoint 4-way
kpoint 4-way
Memoryprocess (MB)
1249 1219 1249 1219 728 698 1249 1253
Peak memory use (MB)
1581 1561 1581 1561 839 804 1581 1585
Total time (secs) 295 199 292 226 191 142 294 264
Overall parallel efficiencya
99 96 98 96 66 71 98 96
Relative total energy ( cores total time core-seconds)
1475 1194 1460 1356 955 852 1470 1584
Table 7 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points showing the effects of data distribution across different numbers of cores requested in the script file lsquoActual data distributionrsquo means that reported by CASTEP on completion in this and (where applicable) all following Tables lsquoRelative total energyrsquo assumes that each core requested by the script consumes X amount of electricity aCalculated automatically by CASTEP
num_proc_in_smp Default 2 4 5
Requested data_distribution None Gvector None Gvector None Gvector None Gvector
Actual data distribution kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
Memoryprocess (MB) 1249 728 1249 728 1249 728 1249 728
Peak memory use (MB) 1580 837 1581 839 1581 844 1581 846
Total time (secs) 222 156 231 171 230 182 237 183
Overall parallel efficiencya 96 66 98 60 98 56 96 56
Column 1 2 3 4 5 6 7 8 9
Table 8 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of setting lsquonum_proc_in_smp 2 4 5rsquo both with and without the lsquodata_distribution gvectorrsquo flag lsquoDefaultrsquo means lsquonum_proc_in_smprsquo absent from param file aCalculated automatically by CASTEP
Optimization strategy
This parameter has three settings and is invoked through the lsquoopt_strategyrsquo flag in the
param file
Default - Balances speed and memory use Wavefunction coefficients for all k-points
in a calculation will be kept in memory rather than be paged to disk Some large
work arrays will be paged to disk
Memory - Minimizes memory use All wavefunctions and large work arrays are paged
to disk
Speed - Maximizes speed by not paging to disk
This means that if a user runs a large memory calculation optimizing for memory could
obviate the need to request additional cores although the calculation will take longer - see
Table 9 for comparisons
opt_strategy Default Memory Speed
Memoryprocess (MB) 793 750 1249
Peak memory use (MB)
1566 1092 1581
Total time (secs) 232 290 221
Overall parallel efficiencya
94 97 96
Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of optimizing for speed or memory lsquoDefaultrsquo means either omitting the lsquoopt_strategyrsquo flag from the param file or adding it as lsquoopt_strategy defaultrsquo aCalculated automatically by CASTEP
Spin polarization
If a system comprises an odd number of electrons it might be important to differentiate
between the spin-up and spin-down states of the odd electron This directly affects the
calculation time effectively doubling it as shown in Table 10
param flag and setting
spin_polarization
false true
Memoryprocess (MB)
1249 1415
Peak memory use (MB)
1581 1710
Total time (secs) 222 455
Overall parallel efficiencya
96 98
Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of spin polarization aCalculated automatically by CASTEP
Electronic energy minimizer
Insulating systems often behave well during the self-consistent field (SCF) minimizations and
converge smoothly using density mixing (lsquoDMrsquo) When SCF convergence is problematic and
all attempts to tweak DM-related parameters have failed it is necessary to turn to ensemble
density functional theory7 and accept the consequent (and considerable) increase in
computational cost ndashsee Table 11
param flag and setting
metals_method (Electron minimization) DM EDFT
Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory
aCalculated automatically by CASTEP
C Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
efficiency
(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)
requested for the simulation Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly
(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo
flag to accelerate progress through the job queue
(iv) Using the most recent version of software captures the latest upgrades and bug-
fixes that might otherwise slow down a calculation
(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory
requirements In one example the estimate of peak memory use was frac14 of that
actually used during the simulation proper
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
Supercell size
The size of a system is one of the more obvious choices that affects the demands on
computational resources nevertheless it is interesting to see (from Table 4) that for the
same number of kpoints doubling the number of atoms increases the memory load per
process between 35 (41 to 82 atoms) to 72 (82 to 164 atoms) and the corresponding
calculation times increase by factors 11 and 85 respectively In good practice the number
of kpoints is scaled according to the supercell size increasing the computational cost more
modestly
Supercell size ( atoms) 1 x 1 x 1 (41)
2 x 1 x 1 (82) 2 x 2 x 1 (164)
Kpoints (mp grid)
Kpoints scaled for supercells 2x1x1 and 2x2x1
3 2 1 (3) 3 2 1 (3)
2 1 1 (1)
3 2 1 (3)
2 1 1 (1)
Memoryprocess (MB) 666 897 732 1547 1315
Peak memory use (MB) 777 1175 1025 2330 2177
Total time (secs) 55 631 329 5416 1660
Overall parallel efficiencya 69 69 74 67 72 Table 4 Single point energy calculations using ultrasoft pseudo-potentials run on 5 cores showing the effects of supercells aCalculated automatically by CASTEP
Figure 3 Example of 2 x 2 x 1 supercell
Orientation of axes
This might be one of the more surprising and unexpected properties of a model that affects
computational efficiency The effect becomes significant when a system is large
disproportionately longer along one of its lengths and is misaligned with the x- y- z-axes
see Figure 4 and Table 5 for exaggerated examples of misalignment This effect is due to
the way CASTEP transforms real-space properties between real-space and reciprocal-
space it converts the 3-d fast Fourier transforms (FFT) to three 1-d FFT columns that lie
parallel to the x- y z-axes
Figure 4 Top row A capped carbon nanotube (160 atoms) and bottom row a long carbon nanotube (1000 atoms) showing
long axes aligned in the x-direction (left) z-direction (middle) skewed (right)
Orientation ( atoms)
X (160)
Z (160)
Skewed (160)
X (1000)
Z (1000)
Skewed (1000)
Cores 5 5 5 60 60 60
Memoryprocess (MB) 884 882 882 2870 2870 2870
Peak memory use (MB) 1893 1885 1838 7077 7077 7077
Total time (secs) 392 359 409 3906 3908 5232
Overall parallel efficiencya
79 84 82 78 78 75
Relative total energy ( cores total time core-seconds)
1960 1795 2045 234360 234480 313920
Table 5 Single point energy calculations of carbon nanotubes shown as oriented in Fig 4 using ultrasoft pseudopotentials (280 eV cut-off energy) and 1 k-point aCalculated automatically by CASTEP
B Param file
Grid-scale
Although the ultrasofts require a smaller size of planewave basis set than the norm-
conserving they do need a finer electron density grid scale in the settings lsquogrid_scalersquo and
lsquofine_grid_scalersquo As shown in Table 6 the denser grid scale setting for the OTFG ultrasofts
(with the exception of the QC5 set) can almost double the calculation time over the larger
planewave hungry OTFG norm-conserving pseudopotentials that converge well under a less
dense grid
Type of pseudopotential
Norm-conserving
Ultrasoft OTFG Norm-conserving
OTFG Ultrasoft
OTFG Ultrasoft QC5 set
grid_scale fine_grid_scale
15175 2030 2030 15175 2030 2030 2030
Memoryprocess (MB)
792 681 666 680 731 2072 1007
Peak memory use (MB)
803 1070 777 791 956 2785 1590
Total time (secs) 89 150 55 136 221 250 109
Table 6 Single point energy calculations run on 5 cores showing the effects of different electron density grid settings
Data Distribution
Parallelizing over plane wave vectors (lsquoG-vectorsrsquo) k-points or a mix of the two has an
impact on computational efficiency as shown in Table 7
The default for a param file without the keyword lsquodata_distributionrsquo is to prioritize k-point
distribution across a number of cores (less than or equal to the number requested in the
submission script) that is a factor of the number of k-points see for example Table 7
columns 2 and 3 Inserting lsquodata_distribution kpointrsquo into the param file prioritizes and
optimizes the k-point distribution across the number of cores requested in the script In the
example tested selecting data distribution over kpoints increased the calculation time over
the default of no data distribution compare columns 3 and 5 of Table 7
Requesting G-vector distribution has the largest impact on calculation time and combining
this with requesting a number of cores that is also a factor of the number of k-points has the
overall largest impact on reducing calculation time ndashsee columns 6 and 7 of Table 7
Requesting mixed data distribution has a similar impact on calculation time as not requesting
any data distribution for 5 cores but not for 6 cores the lsquomixedrsquo distribution used 4-way
kpoint distribution rather than the default (non-) request that applied 6-way distribution ndash
compare columns 2 and 3 with 8 and 9
For the small clay model system the optimal efficiency was obtained using G-vector data
distribution over 6 cores (852 core-seconds) and the least efficient choice was mixed data
distribution over 6 cores (1584 core-seconds) The results are system-specific and need
careful testing to tailor to different systems
Number of tasks per node
This is invoked by adding lsquonum_proc_in_smprsquo to the param file and controls the number of message parsing interface (MPI) tasks that are placed in a specifically OpenMP (SMP) group This means that the ldquoall-to-allrdquo communications is then done in three phases instead of one (1) tasks within an SMP collect their data together on a chosen ldquocontrollerrdquo task within their group (2) the ldquoall-to-allrdquo is done between the controller tasks (3) the controllers all distribute the data back to the tasks in their SMP groups For small core counts the overhead of the two extra phases makes this method slower than just doing an all-to-all for large core counts the reduction in the all-to-all time more than
compensates for the extra overhead so itrsquos faster Indeed the tests (shown in Table 8) reveal that invoking this flag fails to produce as large a speed-up as the flag lsquodata_distribution gvectorrsquo (compare columns 3 and 9) for the test HPC cluster ndash Sunbird reflecting the requested small core count Generally speaking the more cores in the G-vector group the higher you want to set ldquonum_proc_in_smprdquo (up to the physical number of cores on a node)
Column 1 2 3 4 5 6 7 8 9
Requested data distribution + cores in HPC submission script
None 5 cores
None 6 cores
Kpoints 5 cores
Kpoints 6 cores
Gvector 5 cores
Gvector 6 cores
Mixed 5 cores
Mixed 6 cores
Actual data distribution
kpoint 4-way
kpoint 6-way
kpoint 5-way
kpoint 6-way
Gvector 5-way
Gvector 6-way
kpoint 4-way
kpoint 4-way
Memoryprocess (MB)
1249 1219 1249 1219 728 698 1249 1253
Peak memory use (MB)
1581 1561 1581 1561 839 804 1581 1585
Total time (secs) 295 199 292 226 191 142 294 264
Overall parallel efficiencya
99 96 98 96 66 71 98 96
Relative total energy ( cores total time core-seconds)
1475 1194 1460 1356 955 852 1470 1584
Table 7 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points showing the effects of data distribution across different numbers of cores requested in the script file lsquoActual data distributionrsquo means that reported by CASTEP on completion in this and (where applicable) all following Tables lsquoRelative total energyrsquo assumes that each core requested by the script consumes X amount of electricity aCalculated automatically by CASTEP
num_proc_in_smp Default 2 4 5
Requested data_distribution None Gvector None Gvector None Gvector None Gvector
Actual data distribution kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
Memoryprocess (MB) 1249 728 1249 728 1249 728 1249 728
Peak memory use (MB) 1580 837 1581 839 1581 844 1581 846
Total time (secs) 222 156 231 171 230 182 237 183
Overall parallel efficiencya 96 66 98 60 98 56 96 56
Column 1 2 3 4 5 6 7 8 9
Table 8 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of setting lsquonum_proc_in_smp 2 4 5rsquo both with and without the lsquodata_distribution gvectorrsquo flag lsquoDefaultrsquo means lsquonum_proc_in_smprsquo absent from param file aCalculated automatically by CASTEP
Optimization strategy
This parameter has three settings and is invoked through the lsquoopt_strategyrsquo flag in the
param file
Default - Balances speed and memory use Wavefunction coefficients for all k-points
in a calculation will be kept in memory rather than be paged to disk Some large
work arrays will be paged to disk
Memory - Minimizes memory use All wavefunctions and large work arrays are paged
to disk
Speed - Maximizes speed by not paging to disk
This means that if a user runs a large memory calculation optimizing for memory could
obviate the need to request additional cores although the calculation will take longer - see
Table 9 for comparisons
opt_strategy Default Memory Speed
Memoryprocess (MB) 793 750 1249
Peak memory use (MB)
1566 1092 1581
Total time (secs) 232 290 221
Overall parallel efficiencya
94 97 96
Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of optimizing for speed or memory lsquoDefaultrsquo means either omitting the lsquoopt_strategyrsquo flag from the param file or adding it as lsquoopt_strategy defaultrsquo aCalculated automatically by CASTEP
Spin polarization
If a system comprises an odd number of electrons it might be important to differentiate
between the spin-up and spin-down states of the odd electron This directly affects the
calculation time effectively doubling it as shown in Table 10
param flag and setting
spin_polarization
false true
Memoryprocess (MB)
1249 1415
Peak memory use (MB)
1581 1710
Total time (secs) 222 455
Overall parallel efficiencya
96 98
Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of spin polarization aCalculated automatically by CASTEP
Electronic energy minimizer
Insulating systems often behave well during the self-consistent field (SCF) minimizations and
converge smoothly using density mixing (lsquoDMrsquo) When SCF convergence is problematic and
all attempts to tweak DM-related parameters have failed it is necessary to turn to ensemble
density functional theory7 and accept the consequent (and considerable) increase in
computational cost ndashsee Table 11
param flag and setting
metals_method (Electron minimization) DM EDFT
Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory
aCalculated automatically by CASTEP
C Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
efficiency
(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)
requested for the simulation Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly
(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo
flag to accelerate progress through the job queue
(iv) Using the most recent version of software captures the latest upgrades and bug-
fixes that might otherwise slow down a calculation
(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory
requirements In one example the estimate of peak memory use was frac14 of that
actually used during the simulation proper
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
Figure 4 Top row A capped carbon nanotube (160 atoms) and bottom row a long carbon nanotube (1000 atoms) showing
long axes aligned in the x-direction (left) z-direction (middle) skewed (right)
Orientation ( atoms)
X (160)
Z (160)
Skewed (160)
X (1000)
Z (1000)
Skewed (1000)
Cores 5 5 5 60 60 60
Memoryprocess (MB) 884 882 882 2870 2870 2870
Peak memory use (MB) 1893 1885 1838 7077 7077 7077
Total time (secs) 392 359 409 3906 3908 5232
Overall parallel efficiencya
79 84 82 78 78 75
Relative total energy ( cores total time core-seconds)
1960 1795 2045 234360 234480 313920
Table 5 Single point energy calculations of carbon nanotubes shown as oriented in Fig 4 using ultrasoft pseudopotentials (280 eV cut-off energy) and 1 k-point aCalculated automatically by CASTEP
B Param file
Grid-scale
Although the ultrasofts require a smaller size of planewave basis set than the norm-
conserving they do need a finer electron density grid scale in the settings lsquogrid_scalersquo and
lsquofine_grid_scalersquo As shown in Table 6 the denser grid scale setting for the OTFG ultrasofts
(with the exception of the QC5 set) can almost double the calculation time over the larger
planewave hungry OTFG norm-conserving pseudopotentials that converge well under a less
dense grid
Type of pseudopotential
Norm-conserving
Ultrasoft OTFG Norm-conserving
OTFG Ultrasoft
OTFG Ultrasoft QC5 set
grid_scale fine_grid_scale
15175 2030 2030 15175 2030 2030 2030
Memoryprocess (MB)
792 681 666 680 731 2072 1007
Peak memory use (MB)
803 1070 777 791 956 2785 1590
Total time (secs) 89 150 55 136 221 250 109
Table 6 Single point energy calculations run on 5 cores showing the effects of different electron density grid settings
Data Distribution
Parallelizing over plane wave vectors (lsquoG-vectorsrsquo) k-points or a mix of the two has an
impact on computational efficiency as shown in Table 7
The default for a param file without the keyword lsquodata_distributionrsquo is to prioritize k-point
distribution across a number of cores (less than or equal to the number requested in the
submission script) that is a factor of the number of k-points see for example Table 7
columns 2 and 3 Inserting lsquodata_distribution kpointrsquo into the param file prioritizes and
optimizes the k-point distribution across the number of cores requested in the script In the
example tested selecting data distribution over kpoints increased the calculation time over
the default of no data distribution compare columns 3 and 5 of Table 7
Requesting G-vector distribution has the largest impact on calculation time and combining
this with requesting a number of cores that is also a factor of the number of k-points has the
overall largest impact on reducing calculation time ndashsee columns 6 and 7 of Table 7
Requesting mixed data distribution has a similar impact on calculation time as not requesting
any data distribution for 5 cores but not for 6 cores the lsquomixedrsquo distribution used 4-way
kpoint distribution rather than the default (non-) request that applied 6-way distribution ndash
compare columns 2 and 3 with 8 and 9
For the small clay model system the optimal efficiency was obtained using G-vector data
distribution over 6 cores (852 core-seconds) and the least efficient choice was mixed data
distribution over 6 cores (1584 core-seconds) The results are system-specific and need
careful testing to tailor to different systems
Number of tasks per node
This is invoked by adding lsquonum_proc_in_smprsquo to the param file and controls the number of message parsing interface (MPI) tasks that are placed in a specifically OpenMP (SMP) group This means that the ldquoall-to-allrdquo communications is then done in three phases instead of one (1) tasks within an SMP collect their data together on a chosen ldquocontrollerrdquo task within their group (2) the ldquoall-to-allrdquo is done between the controller tasks (3) the controllers all distribute the data back to the tasks in their SMP groups For small core counts the overhead of the two extra phases makes this method slower than just doing an all-to-all for large core counts the reduction in the all-to-all time more than
compensates for the extra overhead so itrsquos faster Indeed the tests (shown in Table 8) reveal that invoking this flag fails to produce as large a speed-up as the flag lsquodata_distribution gvectorrsquo (compare columns 3 and 9) for the test HPC cluster ndash Sunbird reflecting the requested small core count Generally speaking the more cores in the G-vector group the higher you want to set ldquonum_proc_in_smprdquo (up to the physical number of cores on a node)
Column 1 2 3 4 5 6 7 8 9
Requested data distribution + cores in HPC submission script
None 5 cores
None 6 cores
Kpoints 5 cores
Kpoints 6 cores
Gvector 5 cores
Gvector 6 cores
Mixed 5 cores
Mixed 6 cores
Actual data distribution
kpoint 4-way
kpoint 6-way
kpoint 5-way
kpoint 6-way
Gvector 5-way
Gvector 6-way
kpoint 4-way
kpoint 4-way
Memoryprocess (MB)
1249 1219 1249 1219 728 698 1249 1253
Peak memory use (MB)
1581 1561 1581 1561 839 804 1581 1585
Total time (secs) 295 199 292 226 191 142 294 264
Overall parallel efficiencya
99 96 98 96 66 71 98 96
Relative total energy ( cores total time core-seconds)
1475 1194 1460 1356 955 852 1470 1584
Table 7 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points showing the effects of data distribution across different numbers of cores requested in the script file lsquoActual data distributionrsquo means that reported by CASTEP on completion in this and (where applicable) all following Tables lsquoRelative total energyrsquo assumes that each core requested by the script consumes X amount of electricity aCalculated automatically by CASTEP
num_proc_in_smp Default 2 4 5
Requested data_distribution None Gvector None Gvector None Gvector None Gvector
Actual data distribution kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
Memoryprocess (MB) 1249 728 1249 728 1249 728 1249 728
Peak memory use (MB) 1580 837 1581 839 1581 844 1581 846
Total time (secs) 222 156 231 171 230 182 237 183
Overall parallel efficiencya 96 66 98 60 98 56 96 56
Column 1 2 3 4 5 6 7 8 9
Table 8 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of setting lsquonum_proc_in_smp 2 4 5rsquo both with and without the lsquodata_distribution gvectorrsquo flag lsquoDefaultrsquo means lsquonum_proc_in_smprsquo absent from param file aCalculated automatically by CASTEP
Optimization strategy
This parameter has three settings and is invoked through the lsquoopt_strategyrsquo flag in the
param file
Default - Balances speed and memory use Wavefunction coefficients for all k-points
in a calculation will be kept in memory rather than be paged to disk Some large
work arrays will be paged to disk
Memory - Minimizes memory use All wavefunctions and large work arrays are paged
to disk
Speed - Maximizes speed by not paging to disk
This means that if a user runs a large memory calculation optimizing for memory could
obviate the need to request additional cores although the calculation will take longer - see
Table 9 for comparisons
opt_strategy Default Memory Speed
Memoryprocess (MB) 793 750 1249
Peak memory use (MB)
1566 1092 1581
Total time (secs) 232 290 221
Overall parallel efficiencya
94 97 96
Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of optimizing for speed or memory lsquoDefaultrsquo means either omitting the lsquoopt_strategyrsquo flag from the param file or adding it as lsquoopt_strategy defaultrsquo aCalculated automatically by CASTEP
Spin polarization
If a system comprises an odd number of electrons it might be important to differentiate
between the spin-up and spin-down states of the odd electron This directly affects the
calculation time effectively doubling it as shown in Table 10
param flag and setting
spin_polarization
false true
Memoryprocess (MB)
1249 1415
Peak memory use (MB)
1581 1710
Total time (secs) 222 455
Overall parallel efficiencya
96 98
Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of spin polarization aCalculated automatically by CASTEP
Electronic energy minimizer
Insulating systems often behave well during the self-consistent field (SCF) minimizations and
converge smoothly using density mixing (lsquoDMrsquo) When SCF convergence is problematic and
all attempts to tweak DM-related parameters have failed it is necessary to turn to ensemble
density functional theory7 and accept the consequent (and considerable) increase in
computational cost ndashsee Table 11
param flag and setting
metals_method (Electron minimization) DM EDFT
Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory
aCalculated automatically by CASTEP
C Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
efficiency
(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)
requested for the simulation Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly
(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo
flag to accelerate progress through the job queue
(iv) Using the most recent version of software captures the latest upgrades and bug-
fixes that might otherwise slow down a calculation
(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory
requirements In one example the estimate of peak memory use was frac14 of that
actually used during the simulation proper
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
Type of pseudopotential
Norm-conserving
Ultrasoft OTFG Norm-conserving
OTFG Ultrasoft
OTFG Ultrasoft QC5 set
grid_scale fine_grid_scale
15175 2030 2030 15175 2030 2030 2030
Memoryprocess (MB)
792 681 666 680 731 2072 1007
Peak memory use (MB)
803 1070 777 791 956 2785 1590
Total time (secs) 89 150 55 136 221 250 109
Table 6 Single point energy calculations run on 5 cores showing the effects of different electron density grid settings
Data Distribution
Parallelizing over plane wave vectors (lsquoG-vectorsrsquo) k-points or a mix of the two has an
impact on computational efficiency as shown in Table 7
The default for a param file without the keyword lsquodata_distributionrsquo is to prioritize k-point
distribution across a number of cores (less than or equal to the number requested in the
submission script) that is a factor of the number of k-points see for example Table 7
columns 2 and 3 Inserting lsquodata_distribution kpointrsquo into the param file prioritizes and
optimizes the k-point distribution across the number of cores requested in the script In the
example tested selecting data distribution over kpoints increased the calculation time over
the default of no data distribution compare columns 3 and 5 of Table 7
Requesting G-vector distribution has the largest impact on calculation time and combining
this with requesting a number of cores that is also a factor of the number of k-points has the
overall largest impact on reducing calculation time ndashsee columns 6 and 7 of Table 7
Requesting mixed data distribution has a similar impact on calculation time as not requesting
any data distribution for 5 cores but not for 6 cores the lsquomixedrsquo distribution used 4-way
kpoint distribution rather than the default (non-) request that applied 6-way distribution ndash
compare columns 2 and 3 with 8 and 9
For the small clay model system the optimal efficiency was obtained using G-vector data
distribution over 6 cores (852 core-seconds) and the least efficient choice was mixed data
distribution over 6 cores (1584 core-seconds) The results are system-specific and need
careful testing to tailor to different systems
Number of tasks per node
This is invoked by adding lsquonum_proc_in_smprsquo to the param file and controls the number of message parsing interface (MPI) tasks that are placed in a specifically OpenMP (SMP) group This means that the ldquoall-to-allrdquo communications is then done in three phases instead of one (1) tasks within an SMP collect their data together on a chosen ldquocontrollerrdquo task within their group (2) the ldquoall-to-allrdquo is done between the controller tasks (3) the controllers all distribute the data back to the tasks in their SMP groups For small core counts the overhead of the two extra phases makes this method slower than just doing an all-to-all for large core counts the reduction in the all-to-all time more than
compensates for the extra overhead so itrsquos faster Indeed the tests (shown in Table 8) reveal that invoking this flag fails to produce as large a speed-up as the flag lsquodata_distribution gvectorrsquo (compare columns 3 and 9) for the test HPC cluster ndash Sunbird reflecting the requested small core count Generally speaking the more cores in the G-vector group the higher you want to set ldquonum_proc_in_smprdquo (up to the physical number of cores on a node)
Column 1 2 3 4 5 6 7 8 9
Requested data distribution + cores in HPC submission script
None 5 cores
None 6 cores
Kpoints 5 cores
Kpoints 6 cores
Gvector 5 cores
Gvector 6 cores
Mixed 5 cores
Mixed 6 cores
Actual data distribution
kpoint 4-way
kpoint 6-way
kpoint 5-way
kpoint 6-way
Gvector 5-way
Gvector 6-way
kpoint 4-way
kpoint 4-way
Memoryprocess (MB)
1249 1219 1249 1219 728 698 1249 1253
Peak memory use (MB)
1581 1561 1581 1561 839 804 1581 1585
Total time (secs) 295 199 292 226 191 142 294 264
Overall parallel efficiencya
99 96 98 96 66 71 98 96
Relative total energy ( cores total time core-seconds)
1475 1194 1460 1356 955 852 1470 1584
Table 7 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points showing the effects of data distribution across different numbers of cores requested in the script file lsquoActual data distributionrsquo means that reported by CASTEP on completion in this and (where applicable) all following Tables lsquoRelative total energyrsquo assumes that each core requested by the script consumes X amount of electricity aCalculated automatically by CASTEP
num_proc_in_smp Default 2 4 5
Requested data_distribution None Gvector None Gvector None Gvector None Gvector
Actual data distribution kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
Memoryprocess (MB) 1249 728 1249 728 1249 728 1249 728
Peak memory use (MB) 1580 837 1581 839 1581 844 1581 846
Total time (secs) 222 156 231 171 230 182 237 183
Overall parallel efficiencya 96 66 98 60 98 56 96 56
Column 1 2 3 4 5 6 7 8 9
Table 8 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of setting lsquonum_proc_in_smp 2 4 5rsquo both with and without the lsquodata_distribution gvectorrsquo flag lsquoDefaultrsquo means lsquonum_proc_in_smprsquo absent from param file aCalculated automatically by CASTEP
Optimization strategy
This parameter has three settings and is invoked through the lsquoopt_strategyrsquo flag in the
param file
Default - Balances speed and memory use Wavefunction coefficients for all k-points
in a calculation will be kept in memory rather than be paged to disk Some large
work arrays will be paged to disk
Memory - Minimizes memory use All wavefunctions and large work arrays are paged
to disk
Speed - Maximizes speed by not paging to disk
This means that if a user runs a large memory calculation optimizing for memory could
obviate the need to request additional cores although the calculation will take longer - see
Table 9 for comparisons
opt_strategy Default Memory Speed
Memoryprocess (MB) 793 750 1249
Peak memory use (MB)
1566 1092 1581
Total time (secs) 232 290 221
Overall parallel efficiencya
94 97 96
Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of optimizing for speed or memory lsquoDefaultrsquo means either omitting the lsquoopt_strategyrsquo flag from the param file or adding it as lsquoopt_strategy defaultrsquo aCalculated automatically by CASTEP
Spin polarization
If a system comprises an odd number of electrons it might be important to differentiate
between the spin-up and spin-down states of the odd electron This directly affects the
calculation time effectively doubling it as shown in Table 10
param flag and setting
spin_polarization
false true
Memoryprocess (MB)
1249 1415
Peak memory use (MB)
1581 1710
Total time (secs) 222 455
Overall parallel efficiencya
96 98
Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of spin polarization aCalculated automatically by CASTEP
Electronic energy minimizer
Insulating systems often behave well during the self-consistent field (SCF) minimizations and
converge smoothly using density mixing (lsquoDMrsquo) When SCF convergence is problematic and
all attempts to tweak DM-related parameters have failed it is necessary to turn to ensemble
density functional theory7 and accept the consequent (and considerable) increase in
computational cost ndashsee Table 11
param flag and setting
metals_method (Electron minimization) DM EDFT
Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory
aCalculated automatically by CASTEP
C Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
efficiency
(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)
requested for the simulation Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly
(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo
flag to accelerate progress through the job queue
(iv) Using the most recent version of software captures the latest upgrades and bug-
fixes that might otherwise slow down a calculation
(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory
requirements In one example the estimate of peak memory use was frac14 of that
actually used during the simulation proper
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
compensates for the extra overhead so itrsquos faster Indeed the tests (shown in Table 8) reveal that invoking this flag fails to produce as large a speed-up as the flag lsquodata_distribution gvectorrsquo (compare columns 3 and 9) for the test HPC cluster ndash Sunbird reflecting the requested small core count Generally speaking the more cores in the G-vector group the higher you want to set ldquonum_proc_in_smprdquo (up to the physical number of cores on a node)
Column 1 2 3 4 5 6 7 8 9
Requested data distribution + cores in HPC submission script
None 5 cores
None 6 cores
Kpoints 5 cores
Kpoints 6 cores
Gvector 5 cores
Gvector 6 cores
Mixed 5 cores
Mixed 6 cores
Actual data distribution
kpoint 4-way
kpoint 6-way
kpoint 5-way
kpoint 6-way
Gvector 5-way
Gvector 6-way
kpoint 4-way
kpoint 4-way
Memoryprocess (MB)
1249 1219 1249 1219 728 698 1249 1253
Peak memory use (MB)
1581 1561 1581 1561 839 804 1581 1585
Total time (secs) 295 199 292 226 191 142 294 264
Overall parallel efficiencya
99 96 98 96 66 71 98 96
Relative total energy ( cores total time core-seconds)
1475 1194 1460 1356 955 852 1470 1584
Table 7 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points showing the effects of data distribution across different numbers of cores requested in the script file lsquoActual data distributionrsquo means that reported by CASTEP on completion in this and (where applicable) all following Tables lsquoRelative total energyrsquo assumes that each core requested by the script consumes X amount of electricity aCalculated automatically by CASTEP
num_proc_in_smp Default 2 4 5
Requested data_distribution None Gvector None Gvector None Gvector None Gvector
Actual data distribution kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
kpoint 4-way
Gvector 5-way
Memoryprocess (MB) 1249 728 1249 728 1249 728 1249 728
Peak memory use (MB) 1580 837 1581 839 1581 844 1581 846
Total time (secs) 222 156 231 171 230 182 237 183
Overall parallel efficiencya 96 66 98 60 98 56 96 56
Column 1 2 3 4 5 6 7 8 9
Table 8 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of setting lsquonum_proc_in_smp 2 4 5rsquo both with and without the lsquodata_distribution gvectorrsquo flag lsquoDefaultrsquo means lsquonum_proc_in_smprsquo absent from param file aCalculated automatically by CASTEP
Optimization strategy
This parameter has three settings and is invoked through the lsquoopt_strategyrsquo flag in the
param file
Default - Balances speed and memory use Wavefunction coefficients for all k-points
in a calculation will be kept in memory rather than be paged to disk Some large
work arrays will be paged to disk
Memory - Minimizes memory use All wavefunctions and large work arrays are paged
to disk
Speed - Maximizes speed by not paging to disk
This means that if a user runs a large memory calculation optimizing for memory could
obviate the need to request additional cores although the calculation will take longer - see
Table 9 for comparisons
opt_strategy Default Memory Speed
Memoryprocess (MB) 793 750 1249
Peak memory use (MB)
1566 1092 1581
Total time (secs) 232 290 221
Overall parallel efficiencya
94 97 96
Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of optimizing for speed or memory lsquoDefaultrsquo means either omitting the lsquoopt_strategyrsquo flag from the param file or adding it as lsquoopt_strategy defaultrsquo aCalculated automatically by CASTEP
Spin polarization
If a system comprises an odd number of electrons it might be important to differentiate
between the spin-up and spin-down states of the odd electron This directly affects the
calculation time effectively doubling it as shown in Table 10
param flag and setting
spin_polarization
false true
Memoryprocess (MB)
1249 1415
Peak memory use (MB)
1581 1710
Total time (secs) 222 455
Overall parallel efficiencya
96 98
Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of spin polarization aCalculated automatically by CASTEP
Electronic energy minimizer
Insulating systems often behave well during the self-consistent field (SCF) minimizations and
converge smoothly using density mixing (lsquoDMrsquo) When SCF convergence is problematic and
all attempts to tweak DM-related parameters have failed it is necessary to turn to ensemble
density functional theory7 and accept the consequent (and considerable) increase in
computational cost ndashsee Table 11
param flag and setting
metals_method (Electron minimization) DM EDFT
Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory
aCalculated automatically by CASTEP
C Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
efficiency
(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)
requested for the simulation Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly
(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo
flag to accelerate progress through the job queue
(iv) Using the most recent version of software captures the latest upgrades and bug-
fixes that might otherwise slow down a calculation
(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory
requirements In one example the estimate of peak memory use was frac14 of that
actually used during the simulation proper
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
Optimization strategy
This parameter has three settings and is invoked through the lsquoopt_strategyrsquo flag in the
param file
Default - Balances speed and memory use Wavefunction coefficients for all k-points
in a calculation will be kept in memory rather than be paged to disk Some large
work arrays will be paged to disk
Memory - Minimizes memory use All wavefunctions and large work arrays are paged
to disk
Speed - Maximizes speed by not paging to disk
This means that if a user runs a large memory calculation optimizing for memory could
obviate the need to request additional cores although the calculation will take longer - see
Table 9 for comparisons
opt_strategy Default Memory Speed
Memoryprocess (MB) 793 750 1249
Peak memory use (MB)
1566 1092 1581
Total time (secs) 232 290 221
Overall parallel efficiencya
94 97 96
Table 9 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of optimizing for speed or memory lsquoDefaultrsquo means either omitting the lsquoopt_strategyrsquo flag from the param file or adding it as lsquoopt_strategy defaultrsquo aCalculated automatically by CASTEP
Spin polarization
If a system comprises an odd number of electrons it might be important to differentiate
between the spin-up and spin-down states of the odd electron This directly affects the
calculation time effectively doubling it as shown in Table 10
param flag and setting
spin_polarization
false true
Memoryprocess (MB)
1249 1415
Peak memory use (MB)
1581 1710
Total time (secs) 222 455
Overall parallel efficiencya
96 98
Table 10 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of spin polarization aCalculated automatically by CASTEP
Electronic energy minimizer
Insulating systems often behave well during the self-consistent field (SCF) minimizations and
converge smoothly using density mixing (lsquoDMrsquo) When SCF convergence is problematic and
all attempts to tweak DM-related parameters have failed it is necessary to turn to ensemble
density functional theory7 and accept the consequent (and considerable) increase in
computational cost ndashsee Table 11
param flag and setting
metals_method (Electron minimization) DM EDFT
Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory
aCalculated automatically by CASTEP
C Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
efficiency
(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)
requested for the simulation Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly
(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo
flag to accelerate progress through the job queue
(iv) Using the most recent version of software captures the latest upgrades and bug-
fixes that might otherwise slow down a calculation
(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory
requirements In one example the estimate of peak memory use was frac14 of that
actually used during the simulation proper
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
param flag and setting
metals_method (Electron minimization) DM EDFT
Memoryprocess (MB) 1249 1289 Peak memory use (MB) 1581 1650 Total time (secs) 222 370 Overall parallel efficiencya 96 97
Table 11 Single point energy calculations using ultrasoft pseudopotentials and 12 k-points run on 5 cores showing the effects of the electronic minimization method lsquoDMrsquo means density mixing and lsquoEDFTrsquo ensemble density functional theory
aCalculated automatically by CASTEP
C Script submission file
Figure 5 An example HPC batch submission script
Figure 5 captures the script variables that affect HPC computational energy and usage
efficiency
(i) The variable familiar to most HPC users describes the number of cores (lsquotasksrsquo)
requested for the simulation Unless the calculation is memory hungry configure
the requested number of cores to sit on the fewest nodes because this reduces
expensive node-to-node communication time
(ii) Choosing the shortest job run time gives the calculation a better chance of
progressing through the job queue swiftly
(iii) When not requesting use of all cores on a single node remove the lsquoexclusiversquo
flag to accelerate progress through the job queue
(iv) Using the most recent version of software captures the latest upgrades and bug-
fixes that might otherwise slow down a calculation
(v) Using the lsquodryrunrsquo tag provides a (very) broad estimate of the memory
requirements In one example the estimate of peak memory use was frac14 of that
actually used during the simulation proper
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
D An (extreme) example
Clay mineral (Figure 2) Careful optimisation for energy efficiency
Careless ndash no optimisation for energy efficiency
Vacuum space 10Aring Vacuum space 10 Aring
Pseudopotential and cut-off energy (eV)
Ultrasoft 370 OTFG-Ultrasoft 599
K-points 3 12
Grid-scale fine-grid-scale 2 3 3 4
num_proc_in_smprequested data distribution
default Gvector 20 none
Actual data distribution
5-way Gvector only 3-way Gvector12-way kpoint 3-way (Gvector) smp
Optimization strategy Speed Default
Spin polarization False True
Electronic energy minimizer Density mixing EDFT
Number of cores requested 5 40
RESULTS
Memoryprocess (MB) Scratch disk (MB)
834 0 1461 6518
Peak memory use (MB) 1066 9107
Total time (seconds) 215 45302
Overall parallel efficiencya 69 96
Relative total energy ( cores total time core-seconds core-hours)
1075 030
1 812080 50336
kiloJoules used (approx) 202 52000 Table 12 One clay mineral model (Figure 2) with vacuum spaces of 10Aring - Single point energy calculations showing the difference between carefully optimizing for energy efficiency and carelessly running without pre-testing aCalculated automatically by CASTEP
Table 12 illustrates the combined effects of many of the model properties and parameters
discussed in the previous section on the total time and overall use of computational
resources Itrsquos unlikely a user would choose the whole combination of model properties and
parameters shown in the lsquocarelessrsquo column but it nevertheless gives an idea of the impact a
user can have on the energy consumption of their simulations For comparison the cheapest
electric car listed in 2021 consumes 268 kWh per 100 miles or 603 kJkm which means
that the carelessly run simulation used the equivalent energy of driving this car about 86 km
whereas the efficiently run simulation lsquodroversquo it 033 km
For computational scientists and modellers applying good energy efficiency practices needs
to become second nature following an energy efficiency lsquorecipersquo or procedure is a route to
embedding this practice as a habit
3 Developing energy efficient computing habits A recipe
1) Build a model of a system that contains only the essential ingredients that
allows exploration of the scientific question This is one of the key factors that
determines the size of a model
2) Find out how many cores per node there are on the available HPC cluster This
enables users to request the number of corestasks that minimizes inter-node
communication during a simulation
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
3) Choose the pseudopotentials to match the science This ensures users donrsquot use
pseudopotentials that are unnecessarily computationally expensive
4) Carry out extensive convergence testing based on the minimum accuracy
required for the production run results eg
(i) Kinetic energy cut-off (depends on pseudopotential choice)
(ii) Grid scale and fine grid scale (depends on pseudopotential choice)
(iii) Size and orientation of model including eg number of bulk atoms
number of layers size of surface vacuum space etc
(iv) Number of k-points
These decrease the possibility of over-convergence and its associated
computational cost
5) Spend time optimising the param file properties described in Section B using a
small number of SCF cycles
a Data distribution Gvector k-points or mixed
b Number of tasks per node
c Optimization strategy
d Spin polarization
e Electronic energy (SCF) minimization method
This increases the chances of using resources efficiently due to matching the
model and material requirements to the simulation parameters
6) Optimise the script file This increases the efficient use of HPC resources
7) Submit the calculation and initially monitor it to check itrsquos progressing as
expected This reduces the chances of wasting computational time due to trivial
(lsquoFriday afternoonrsquo) mistakes
8) Carry out your own energy efficient computing tests (and send your findings to
the JISCMAIL CASTEP mailing list)
9) Sit back and wait for the simulation to complete basking in the knowledge that
the simulation is running as energy efficiently1 as a user can possibly make it
4 What else can a user do
In addition to using the above recipe to embed energy-efficient computing habits a user
can take a number of actions to encourage the wider awareness and adoption of energy
efficient computing
a If the HPC cluster uses SLURM use the lsquosacctrsquo command to check the
amount of energy consumed2 (in Joules) by a job -see Figure 6
b If your local cluster uses a different job-scheduler ask your local IT helpdesk if
it has the facility to monitor the energy consumed by each HPC job
c Include the energy consumption of simulations in all forms of reports and
presentations eg informal talks posters peer reviewed journal articles social
media posts etc This will increase awareness of our role as environmentally
aware and conscientious computational scientists and users of HPC resources
1 Itrsquos highly probable that users can expand on the list of model properties and parameters described within this document to further optimise energy efficient computing 2 lsquoNote Only in case of exclusive job allocation this value reflects the jobs real energy consumptionrsquo - see httpsslurmschedmdcomsaccthtml
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
Figure 6 Examples of information about jobs output through SLURMrsquos lsquosacctrsquo command (plus flags) Top list of details about several jobs run from 20032021 bottom details for a specific job ID via the lsquoseff ltjobIDgtrsquo command
d Include estimates of the energy consumption of simulations in applications for
funding Although not yet explicitly requested in EPSRC funding applications
there is the expectation that UKRIrsquos 2020 commitment to Environmental
Sustainability will filter down to all activities of its research councils including
funding This will mean that funding applicants will need to demonstrate their
awareness of the environmental impact of their proposed work Become an
impressive pioneer and include environmental impact through energy
consumption in your next application
5 What are the developers doing
The compilation of this document included a chat with several of the developers of CASTEP
who are keen to help users run their software energy efficiently they shared their plans and
projects in this field
Parts of CASTEP have been programmed to run on GPUs with up to a 15-fold
speed-up (for non-local functionals)
Work on a CASTEP simulator is underway that should reduce the number of
CASTEP calculations required per simulation by choosing an optimal parallel domain
decomposition and implementing timings for FFTs ndash the big parallel cost also it will
estimate compute usage This simulator will go a long way to providing the structure
needed to add energy efficiency to CASTEP and will be accessible through the
rsquo- -dryrunrsquo command The toy code is available in bitbucket
The developers recognise the need for energy consumption to be acknowledged as
an additional factor to be included in the cost of computational simulations They will
be planning their approach beyond the software itself such as including energy
efficient computing in their training courses
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340
Acknowledgements
I acknowledge the support of the Supercomputing Wales project which is part-funded by the
European Regional Development Fund (ERDF) via Welsh Government
Thank you to the following CASTEP developers for their invaluable input and support for this
small project Dr Phil Hasnip and Prof Matt Probert (University of York) Prof Chris Pickard
(University of Cambridge) Dr Dominik Jochym (STFC) Prof Stewart Clark (University of
Durham) Thanks also to Dr Sue Thorne (STFC) and Dr Ed Bennett (Supercomputing
Wales) for sharing their research engineering perspectives
References
(1) Clark S J Segall M D Pickard C J Hasnip P J Probert M I J Refson K Payne M C First Principles Methods Using CASTEP Z Krist 2005 220 567ndash570
(2) Vanderbilt D Soft Self-Consistent Pseudopotentials in a Generalized Eigenvalue Formalism Phys Rev B 1990 41 7892ndash7895
(3) Pickard C J On-the-Fly Pseudopotential Generation in CASTEP 2006 (4) Refson K Clark S J Tulip P Variational Density Functional Perturbation Theory for
Dielectrics and Lattice Dynamics Phys Rev B 2006 73 155114 (5) Hamann D R Schluumlter M Chiang C Norm-Conserving Pseudopotentials Phys
Rev Lett 1979 43 (20) 1494ndash1497 (6) BIOVIA Dassault Systegravemes Materials Studio 2020 Dassault Systegravemes San Diego
2019 (7) Marzari N Vanderbilt D Payne M C Ensemble Density Functional Theory for Ab
Initio Molecular Dynamics of Metals and Finite-Temperature Insulators Phys Rev Lett 1997 79 1337ndash1340