Upload
seth-fiddler
View
221
Download
4
Tags:
Embed Size (px)
Citation preview
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Computational Chemistry at Computational Chemistry at Daresbury LaboratoryDaresbury Laboratory
Quantum Chemistry GroupQuantum Chemistry Group
Martyn. F. Guest, Paul. Sherwood, and Huub J.J. van Dam Martyn. F. Guest, Paul. Sherwood, and Huub J.J. van Dam
http://www.dl.ac.uk/CFShttp://www.dl.ac.uk/CFS
http://www.cse.clrc.ac.uk/Activity/QUASIhttp://www.cse.clrc.ac.uk/Activity/QUASI
Molecular Simulation GroupMolecular Simulation Group
Bill Smith and Maurice LeslieBill Smith and Maurice Leslie
http://www.dl.ac.uk/TCSC/Software/DL_POLYhttp://www.dl.ac.uk/TCSC/Software/DL_POLY
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
OutlineOutlinePerformance of Computational Chemistry CodesPerformance of Computational Chemistry Codes
Serial Applications BenchmarksSerial Applications Benchmarks GAMESS-UK and DL_POLYGAMESS-UK and DL_POLY
Parallel Performance on High-end and Commodity-class Parallel Performance on High-end and Commodity-class systemssystems NWChem NWChem
Global Array (GA) ToolsGlobal Array (GA) ToolsParallel Eigensolver (PeIGS)Parallel Eigensolver (PeIGS)
GAMESS-UKGAMESS-UKSCF, DFT, MP2 and 2nd DerivativesSCF, DFT, MP2 and 2nd Derivatives
DL_POLYDL_POLYVersion 2: Replicated dataVersion 2: Replicated dataVersion 3: Distributed data (domain decomposition)Version 3: Distributed data (domain decomposition)
CHARMM and QM/MM CalculationsCHARMM and QM/MM CalculationsThrombin and TIM benchmarksThrombin and TIM benchmarks
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
GAMESS-UK and DLGAMESS-UK and DL__POLY Serial BenchmarksPOLY Serial Benchmarks
12 Typical QC Calculations12 Typical QC Calculations
ModuleModule Basis (GTOs) SpeciesBasis (GTOs) Species1. SCF1. SCF STO-3G (124)STO-3G (124) MorphineMorphine
2. SCF2. SCF 6-31G (154)6-31G (154) CC66HH33(NO(NO22))33
3. ECP Geometry3. ECP Geometry ECPDZ (70)ECPDZ (70) NaNa77MgMg++
4. Direct-SCF4. Direct-SCF 6-31G (82)6-31G (82) CytosineCytosine
5. CAS-geometry5. CAS-geometry TZVP (52)TZVP (52) HH22COCO
6. MCSCF6. MCSCF EXT1 (74)EXT1 (74) HH22COCO
7. Direct-CI7. Direct-CI EXT2 (64)EXT2 (64) HH22CO/HCO/H22+CO+CO
8. MRD-CI (26M)8. MRD-CI (26M) ECP (59)ECP (59) TiClTiCl44
9. MP2-geometry9. MP2-geometry 6-31G* (70)6-31G* (70) HH33SiNCOSiNCO
10. SCF 2nd derivs.6-31G (64)10. SCF 2nd derivs.6-31G (64) CC55HH55NN
11. MP2 2nd derivs.6-31G* (60)11. MP2 2nd derivs.6-31G* (60) CC44
12. Direct-MP212. Direct-MP2 DZP (76)DZP (76) CC55HH55NN
Six Typical SimulationsSix Typical SimulationsSimulationSimulation AtomsAtoms TimeTime
stepssteps
1. Na-K disilicate glass1. Na-K disilicate glass 10801080 300300
2. Metallic Al with2. Metallic Al with 256256 80008000
Sutton-Chen potentialSutton-Chen potential
3. Valinomycin in 12233. Valinomycin in 1223 38373837 100100
water moleculeswater molecules
4. Dynamic shell model 4. Dynamic shell model 768768 10001000
water with 1024 siteswater with 1024 sites
5. Dynamic shell model 5. Dynamic shell model 768768 10001000
MgCl2 with 1280 sitesMgCl2 with 1280 sites
6. Model membrane, 2 6. Model membrane, 2 31483148 10001000
membrane chains, 202 membrane chains, 202
solute and 2746 solvent moleculessolute and 2746 solvent molecules
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
The GAMESS-UK BenchmarkThe GAMESS-UK BenchmarkPerformance relative to the Compaq Alpha ES45/68-1000
55
67
90
46
16
59
45
111
51
48
96
84
49
126
100
73
144
0 30 60 90 120 150
IBM RS/6000 44P-270
IBM 690 Turbo/POWER4 1.3 GHz
HP PA-9000/J6000-552
HP PA-9000/J6700-750
SGI Origin3800/R14k-500
SGI O2 R12k/270
SUN Fire 6800 / 900 Cu
SUN Blade 1000 / 750
Compaq Alpha ES45/1000
Compaq Alpha ES40/833
Intel Tiger Itanium 2/1000
HP RX2600 Itanium 2/1000
IBM IA64 Itanium 800/4MB
HP RX4610 Itanium 733/2MB
Pentium 4 / 2000
AMD MP1800+ / 1533
Pentium III / 1000
3.6 minutes
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
32
78
58
78
17
65
24
50
42
151
114
46
46
62
60
38
83
100
0 20 40 60 80 100 120 140 160
IBM RS/6000 44P-270
IBM 690 Turbo/POWER4 1.3 GHz
HP PA-9000/J6000-552
HP PA-9000/J6700-750
Cray T3E/1200
SGI Origin3800/R14k-500
SGI O2 R12k/270
SUN Fire 6800 / 900 Cu
SUN Blade 1000 / 750
Compaq Alpha ES45/1000
Compaq Alpha ES40/833
Intel Tiger Itanium 2/1000
HP RX2600 Itanium 2/1000
IBM IA64 Itanium 800/4MB
HP RX4610 Itanium 733/2MB
Pentium 4 / 2000
AMD MP1800+ / 1533
Pentium III / 1000
The DLThe DL__POLY Benchmark. POLY Benchmark. Performance relative to the Compaq Alpha ES45/68-1000
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
High-End Systems EvaluatedHigh-End Systems Evaluated Cray T3E/1200ECray T3E/1200E
816 processor system at Manchester (CSAR service)816 processor system at Manchester (CSAR service) 600 Mz EV56 Alpha processor with 256 MB memory600 Mz EV56 Alpha processor with 256 MB memory
IBM SP/WH2-375 and SP/Regatta-H 32 CPU system at DL, 4-way Winterhawk2 SMP “thin nodes” with 2 GB 32 CPU system at DL, 4-way Winterhawk2 SMP “thin nodes” with 2 GB
memory, 375 MHz Power3-II processors with 8 MB L2 cachememory, 375 MHz Power3-II processors with 8 MB L2 cache IBM Regatta-HIBM Regatta-H (32-way node, 1.3 GHZ power4 CPUs) at Montepelier (32-way node, 1.3 GHZ power4 CPUs) at Montepelier IBM SP/Regatta-HIBM SP/Regatta-H (8-way LPAR’d nodes, 1.3 GHZ) at (8-way LPAR’d nodes, 1.3 GHZ) at ORNLORNL
Compaq AlphaServer SC Compaq AlphaServer SC 4-way ES40/667 A21264A (APAC) and 833 MHz SMP nodes (2 GB RAM); 4-way ES40/667 A21264A (APAC) and 833 MHz SMP nodes (2 GB RAM); TCS1 system at PSC (comprising 750 4-way ES45 nodes - 3,000 EV68 CPUs
- with 4 GB memory per node, 8MB L2 cache GB memory per node, 8MB L2 cache Quadrics “fat tree” interconnect (5 usec latency, 250 MB/sec B/W)Quadrics “fat tree” interconnect (5 usec latency, 250 MB/sec B/W)
SGI Origin 3800SGI Origin 3800 SARA (1000 CPUs) - Numalink with R14k/500 & R12k/400 CPUsSARA (1000 CPUs) - Numalink with R14k/500 & R12k/400 CPUs
Cray Supercluster at EagenCray Supercluster at Eagen Linux Alpha Cluster (96 X API CS20s - dual 833 MHz EV67 CPUs, Myrinet) Linux Alpha Cluster (96 X API CS20s - dual 833 MHz EV67 CPUs, Myrinet)
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Commodity Systems (CSx)Commodity Systems (CSx)Prototype / Evaluation HardwarePrototype / Evaluation Hardware
SystemsSystems LocationLocation CPUsCPUs ConfigurationConfigurationCS1CS1 DaresburyDaresbury 3232 PentiumIII / 450 MHz + FE (EPSRC)PentiumIII / 450 MHz + FE (EPSRC)CS2CS2 DaresburyDaresbury 6464 24 X dual UP2000/EV67-667, 24 X dual UP2000/EV67-667,
QSNet Alpha/LINUX cluster,QSNet Alpha/LINUX cluster,8 X dual CS20/EV67-833 (“loki”)8 X dual CS20/EV67-833 (“loki”)
CS3CS3 RALRAL 1616 Athlon K7 850MHz + myrinetAthlon K7 850MHz + myrinetCS4CS4 SaraSara 3232 Athlon K7 1.2 GHz + FEAthlon K7 1.2 GHz + FECS6CS6 CLiCCLiC 528528 PentiumIII / 800 MHz; fast PentiumIII / 800 MHz; fast
ethernet (Chemnitzer Cluster)ethernet (Chemnitzer Cluster)CS7CS7 DaresburyDaresbury 6464 AMD K7/1000 MP + SCALI/SCI AMD K7/1000 MP + SCALI/SCI (“ukcp”)(“ukcp”)
CS8CS8 NCSANCSA 320320 160 dual IBM Itanium/800 + Myrinet 2k 160 dual IBM Itanium/800 + Myrinet 2k (“titan”)(“titan”)
CS9CS9 BristolBristol 9696 Pentium4 Xeon/2000 + Myrinet 2k Pentium4 Xeon/2000 + Myrinet 2k (“dirac”)(“dirac”)
Protoype SystemsProtoype SystemsCS0CS0 DaresburyDaresbury 1010 10 CPUS, Pentium II/26610 CPUS, Pentium II/266
CS5CS5 DaresburyDaresbury 1616 8 X dual Pentium III/933, SCALI8 X dual Pentium III/933, SCALI
www.cse.clrc.ac.uk/Activity/DisCo
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
T (32-nodes Cray T3E/1200E) / T (32 CPUs ) CSxT (32-nodes Cray T3E/1200E) / T (32 CPUs ) CSx
[ T 32-node T3E / T 32-node CS1 Pentium III/450 + FE ]
T 32-node T3E / T 32-node CS6 Pentium III/800 + FE
T T 32-node32-node T3E / T T3E / T 32-CPU32-CPU CS2 Alpha Linux Cluster + Quadrix CS2 Alpha Linux Cluster + Quadrix
Performance Metrics: 1999-2001Performance Metrics: 1999-2001Attempted to quantify delivered performance from the Commodity-based Attempted to quantify delivered performance from the Commodity-based systems against current MPP (CSAR Cray T3E/1200E)and ASCI-style SMP-systems against current MPP (CSAR Cray T3E/1200E)and ASCI-style SMP-node platforms (e.g. SGI Origin 3800) i.e.node platforms (e.g. SGI Origin 3800) i.e.
Performance Metric (% 32-node Cray T3E)Performance Metric (% 32-node Cray T3E)
Performance Metrics: 2002Performance Metric (% 32-node AlphaServer SC [PSC])Performance Metric (% 32-node AlphaServer SC [PSC])
T (32-CPUs AlphaServer SC ES45/1000) / T (32 CPUs ) CSx
T 32-CPU AlphaServer ES45 / T 32-CPU CS9 Pentium 4 Xeon / 2000 + Myrinet 2k
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Beowulf Comparisons with the T3E & O3800/R14k-500Beowulf Comparisons with the T3E & O3800/R14k-500CSx - Pentium III + FECSx - Pentium III + FE% of 32-node Cray T3E/1200E% of 32-node Cray T3E/1200E
GAMESS-UKGAMESS-UK CS1CS1 CS6CS6SCFSCF 53-69%53-69% 96%96%DFTDFT 65-85%65-85% 130-130-
178%178%DFT (Jfit)DFT (Jfit) 44-77%44-77% 65-65-
131%131%DFT GradientDFT Gradient 90%90% 130%130%MP2 GradientMP2 Gradient44%44% 73%73%SCF ForcesSCF Forces 80%80% 127%127%
NWChem NWChem (DFT Jfit)(DFT Jfit) 50-60%50-60%
REALCREALC 67%67%
CRYSTALCRYSTAL 79-145%79-145%
DL_POLYDL_POLYEwald-based Ewald-based 95-107%95-107% 151-184%151-184%bond constraintsbond constraints 34-56%34-56% 69%69%
CHARMMCHARMM 96%96% 172%172%
CASTEPCASTEP 33%33% 42%42%CPMDCPMD 62%62%
ANGUSANGUS 60%60% 68%68%FLITE3DFLITE3D 104%104%
CS2 - QSNet Alpha Linux ClusterCS2 - QSNet Alpha Linux Cluster% of 32-node Cray T3E and O3800/R14k-500% of 32-node Cray T3E and O3800/R14k-500
GAMESS-UKGAMESS-UKSCFSCF 256%256% 99%99%DFT † DFT † 301-361%301-361% 99%99%DFT (Jfit)DFT (Jfit) 219-379% 89-100%219-379% 89-100%DFT Gradient † DFT Gradient † 289%289% 89%89%MP2 GradientMP2 Gradient 228%228% 87%87%SCF ForcesSCF Forces 154%154% 86%86%
NWChem NWChem (DFT Jfit) † (DFT Jfit) † 150-288% 150-288% 7474-135%-135%
CRYSTAL CRYSTAL †† 349%349%
DL_POLYDL_POLYEwald-based † Ewald-based † 363-470%363-470% 95%95%bond constraintsbond constraints 143-260%143-260% 82%82%
CHARMM CHARMM †† 404%404% 78%78%
CASTEP CASTEP 166%166% 78%78%
ANGUSANGUS 145%145%
FLITE3D FLITE3D †† 480%480%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
High-End Computational ChemistryHigh-End Computational ChemistryThe NWChem SoftwareThe NWChem Software
Capabilities Capabilities (Direct, Semi-direct and conventional):(Direct, Semi-direct and conventional): RHF, UHF, ROHFRHF, UHF, ROHF using up to 10,000 basis functions; analytic 1st using up to 10,000 basis functions; analytic 1st
and 2nd derivatives.and 2nd derivatives. DFTDFT with a wide variety of local and non-local XC potentials, using with a wide variety of local and non-local XC potentials, using
up to 10,000 basis functions; analytic 1st and 2nd derivatives.up to 10,000 basis functions; analytic 1st and 2nd derivatives. CASSCFCASSCF; analytic 1st and numerical 2nd derivatives.; analytic 1st and numerical 2nd derivatives. Semi-direct and RI-based MP2Semi-direct and RI-based MP2 calculations for RHF and UHF wave calculations for RHF and UHF wave
functions using up to 3,000 basis functions; analytic 1st derivatives functions using up to 3,000 basis functions; analytic 1st derivatives and numerical 2nd derivatives.and numerical 2nd derivatives.
Coupled cluster, CCSD and CCSD(T)Coupled cluster, CCSD and CCSD(T) using up to 3,000 basis using up to 3,000 basis functions; numerical 1st and 2nd derivatives of the CC energy. functions; numerical 1st and 2nd derivatives of the CC energy.
Classical molecular dynamics and free energy simulations with the forces obtainable from a variety of sources
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Single, shared data structure
Physically distributed data • Shared-memory-like model– Fast local access– NUMA aware and easy to use– MIMD and data-parallel modes– Inter-operates with MPI, …
• BLAS and linear algebra interface• Ported to major parallel machines
– IBM, Cray, SGI, clusters,...• Originated in an HPCC project• Used by 5 major chemistry codes,
financial futures forecasting,astrophysics, computer graphics
Global ArraysGlobal Arrays
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
(Solution of real symmetric generalized and standard eigensystem problems)
Full eigensolution performed on a matrix generated in a charge density fitting procedure (966 fitting functions for a fluorinated biphenyl).
•uaranteed orthonormal eigenvectors in the presence of large clusters of degenerate eigenvalues•Packed Storage•Smaller scratch space requirements
0
10
20
30
40
50
60
0 16 32 48 64Number of processors
Tim
e (s
ec)
IBM SP
Cray T3E
Features (not available elsewhere):• Inverse iteration using Dhillon-Fann-
Parlett’s parallel algorithm (fastest uniprocessor performance and good parallel scaling)
• Guaranteed orthonormal eigenvectors in the presence of large clusters of degenerate eigenvalues
• Packed Storage• Smaller scratch space requirements
PeIGS 3.0 Parallel Performance
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Fock matrix (N = 1152)
0
5
10
15
20
25
30
35
40
45
50
2 4 8 16 32
PeIGS 2.1PeIGS 2.1 - Cray T3E/1200PeIGS 3.0PDSYEV (Scpk 1.5)PDSYEVD (Scpk 1.7)
SGI Origin 3800/R12k-400 (“green”)SGI Origin 3800/R12k-400 (“green”)
Scalability of Numerical AlgorithmsScalability of Numerical AlgorithmsReal symmetric eigenvalue problems
Number of processors
Tim
e (s
ec)
Fock matrix (N = 3888)
0
20
40
60
80
100
120
16 32 64 128 256 512
PeIGS 2.1PeIGS 3.0PDSYEV (Scpk 1.5)PDSYEVD (Scpk 1.7)BFG-Jacobi (DL)
Number of processors
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
CS7 AMD K7/1000 MP + SCALICS7 AMD K7/1000 MP + SCALI
Parallel EigensolversParallel Eigensolvers Real symmetric eigenvalue problems
Number of processors Number of processors
N = 3888
0
100
200
300
400
500
600
4 8 16 32 64
PDSYEV: 2 CPUs per Node
PDSYEV: 1 CPU Per Node
Scalapack (PDSYEV)
Measured Time (seconds)Measured Time (seconds)
CS9 P4/2000 Xeon + MyrinetCS9 P4/2000 Xeon + Myrinet
N = 3888
0
50
100
150
200
250
300
350
400
8 16 32 64
PDSYEV: 2 CPUs / nodePDSYEV: 1 CPU / nodePDSYEVD: 2 CPUs / nodePDSYEVD: 1 CPU / nodePeIGS 2.1: 2 CPUs / nodePeIGS 2.1: 1 CPU / node
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Case Studies - Zeolite FragmentsCase Studies - Zeolite Fragments
SiSi88OO77HH1818 347/832347/832
SiSi88OO2525HH1818 617/1444617/1444
SiSi2626OO3737HH3636 1199/28181199/2818
SiSi2828OO6767HH3030 1687/39281687/3928
• DFT Calculations with DFT Calculations with Coulomb FittingCoulomb Fitting
Basis (Godbout et al.)Basis (Godbout et al.) DZVP - O, SiDZVP - O, Si
DZVP2 - HDZVP2 - HFitting Basis:Fitting Basis:
DGAUSS-A1 - O, SiDGAUSS-A1 - O, SiDGAUSS-A2 - HDGAUSS-A2 - H
• NWChem & GAMESS-UKNWChem & GAMESS-UK
Both codes use auxiliary fitting Both codes use auxiliary fitting basis for coulomb energy, with basis for coulomb energy, with 3 centre 2 electron integrals 3 centre 2 electron integrals held in coreheld in core..
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
DFT Coulomb Fit - NWChemDFT Coulomb Fit - NWChem
Number of CPUs Number of CPUs
Measured Time (seconds)
SiSi88OO77HH1818 347/832347/832 SiSi88OO2525HH1818 617/1444617/1444
Measured Time (seconds)
227 223
68
61
163
55
177
249
55
43
107
42
0
50
100
150
200
250
300
350
400
16 32
CS6 PIII/800 + FECS7 AMD K7/1000 + SCICS9 P4/2000 + MyrinetCS2 QSNet Alpha/667Cray T3E/1200ESGI O3800/R14k-500AlphaServer SC ES45/1000
596
511
182
177
411
159
514404
135
124
257
119
0
200
400
600
800
16 32
CS6 PIII/800 + FECS7 AMD K7/1000 + SCICS9 P4/2000 + MyrinetCS2 QSNet Alpha/667Cray T3E/1200ESGI O3800/R14k-500AlphaServer SC ES45/1000
76%,95% 88%,104%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
DFT Coulomb Fit - NWChemDFT Coulomb Fit - NWChem
Number of CPUs Number of CPUs
Measured Time (seconds) Measured Time (seconds)
SiSi2828OO6767HH3030 1687/39281687/3928SiSi2626OO3737HH3636 1199/28181199/2818
TTIBM-SP/P2SC-120 IBM-SP/P2SC-120 (256) = 1137(256) = 1137 TTIBM-SP/P2SC-120 IBM-SP/P2SC-120 (256) = 2766(256) = 2766
2388
1147
5169
2414
907
1271
517
798
502404
632
303
0
1500
3000
4500
32 64 128
CS7 AMD K7/1000 + SCICS9 P4/2000 + Myrinet 2kCS2 QSNet Alpha Cluster/667Cray T3E/1200ESGI Origin 3800/R14k-500AlphaServer SC ES45/1000 4682
2424
2351
5507
1580
3008
1617
1504
6090
3050
11821360
880611
0
2000
4000
6000
32 64 128
CS7 AMD K7/1000 + SCICS9 P4/2000 + Myrinet 2kCS2 QSNet Alpha ClusterCray T3E/1200ESGI Origin 3800/R14k-500AlphaServer SC ES45/1000
79%,210%
85%,227%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Memory-driven Approaches: NWChem - DFT (LDA) Memory-driven Approaches: NWChem - DFT (LDA) 1. Performance on the SGI Origin 38001. Performance on the SGI Origin 3800
• DZVP Basis (DZV_A2) and DgaussDZVP Basis (DZV_A2) and Dgauss A1_DFT Fitting basis: A1_DFT Fitting basis:
AO basis: AO basis: 3554 3554 CD basis:CD basis: 1271312713
• MIPS R14k-500 CPUs (Teras)MIPS R14k-500 CPUs (Teras)
Wall time (13 SCF iterations):Wall time (13 SCF iterations):64 CPUs = 5,242 seconds64 CPUs = 5,242 seconds128 CPUs= 3,951 seconds128 CPUs= 3,951 seconds
Est. time on 32 CPUs = Est. time on 32 CPUs = 40,00040,000 secs secs
Zeolite ZSM-5Zeolite ZSM-5
• 3-centre 2e-integrals = 1.00 X 103-centre 2e-integrals = 1.00 X 10 12 12
• Schwarz screening = 5.95 X 10Schwarz screening = 5.95 X 10 10 10
• % 3c 2e-ints. In core = 100%% 3c 2e-ints. In core = 100%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Memory-driven Approaches: NWChem - DFT (LDA) Memory-driven Approaches: NWChem - DFT (LDA) 2. Performance on the HP/Compaq AlphaServer SC2. Performance on the HP/Compaq AlphaServer SC
• DZVP Basis (DZV_A2) and DgaussDZVP Basis (DZV_A2) and Dgauss A1_DFT Fitting basis: A1_DFT Fitting basis:
AO basis: AO basis: 5457 5457 CD basis:CD basis: 1271312713
• 256 EV67/6-667 CPUs (64 Compaq256 EV67/6-667 CPUs (64 Compaq AlphaServer SC nodes) AlphaServer SC nodes)
Wall time (10 SCF iterations) on Wall time (10 SCF iterations) on 256 CPUs = 11,960 seconds 256 CPUs = 11,960 seconds (60% efficiency)(60% efficiency)
Pyridine in Zeolite ZSM-5Pyridine in Zeolite ZSM-5
• 3-centre 2e-integrals = 3.79 X 103-centre 2e-integrals = 3.79 X 10 12 12
• Schwarz screening = 2.81 X 10Schwarz screening = 2.81 X 10 11 11
• % 3c 2e-ints. In core = 1.66%% 3c 2e-ints. In core = 1.66%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Parallel Implementations of GAMESS-UKParallel Implementations of GAMESS-UK
Extensive use of Global Array (GA) Tools and Parallel Extensive use of Global Array (GA) Tools and Parallel Linear Algebra from NWChem Project (EMSL)Linear Algebra from NWChem Project (EMSL)
SCF and DFTSCF and DFT Replicated data, but …Replicated data, but … GA Tools for caching of I/O for restart and checkpoint filesGA Tools for caching of I/O for restart and checkpoint files Storage of 2-centre 2-e integrals in DFT Jfit Storage of 2-centre 2-e integrals in DFT Jfit Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix)Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix)
SCF second derivativesSCF second derivatives Distribution of <vvoo> and <vovo> integrals via GAsDistribution of <vvoo> and <vovo> integrals via GAs
MP2 gradientsMP2 gradients Distribution of <vvoo> and <vovo> integrals via GAsDistribution of <vvoo> and <vovo> integrals via GAs
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
896
529656
387
821
987
1418
2040
878
1276 1357
2424
1289
611
925
522
369
597
0
500
1000
1500
2000
2500
16 32
CS6 PIII/800 + FECS7 AMD K7/1000 MP + SCICS2 QSNet Alpha Cluster/667CS9 P4 Xeon/2000 + MyrinetCray T3E/1200EIBM SP/WH2-375Compaq AlphaServer SC ES40/667SGI Origin3800/R14k-500Compaq AlphaServer SC ES45/1000
GAMESS-UK GAMESS-UK SCF PerformanceSCF PerformanceCray T3E/1200E, High-end and Commodity-based SystemsCray T3E/1200E, High-end and Commodity-based Systems
Number of CPUs
Cyclosporin:(3-21G Basis, 1000 GTOS)
Elapsed Time (seconds)
95%,135%
T3ET3E128128 = 436 = 436
Impact of Impact of Serial Serial Linear Algebra:Linear Algebra:TTIBM-SPIBM-SP(16) = (16) = 26562656 [1289] [1289]
TTIBM-SPIBM-SP(32) = (32) = 21842184 [ 821] [ 821]
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
4399
1859
6927
1760
1249
2774
1018
3612
990713
0
2000
4000
6000
8000
16 32
CS6 PIII/800 + FECS4 AMD K7/1200 + FECS7 AMD K7/1000 MP + SCICS2 QSNet Alpha Cluster/667CS9 P4 Xeon/2000 + MyrinetCray T3E/1200EIBM SP WH2/375SGI Origin 3800/R14k-500Cray SuperCluster/833AlphaServer SC ES45/1000
GAMESS-UK. DFT B3LYP PerformanceGAMESS-UK. DFT B3LYP PerformanceCray T3E/1200, High-end and Commodity-based SystemsCray T3E/1200, High-end and Commodity-based Systems
Basis: 6-31G
Elapsed Time (seconds)
Cyclosporin, 1000 GTOs
70%,117%
Number of CPUs
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
106.7
73.4
16
32
48
64
80
96
112
128
16 32 48 64 80 96 112 128
LinearCray T3E/1200ECompaq AlphaServer SC/667Cray SuperCluster/833SGI Origin 3800/R14k-500AlphaServer SC ES45/1000
3612
990713
2003
606424
1039
481 310
0
1000
2000
3000
4000
32 64 128
Cray T3E/1200E
Compaq AlphaServer SC ES40/667
SGI Origin 3800/R14k-500
Cray SuperCluster/833
Compaq AlphaServer SC ES45/1000
GAMESS-UK. DFT B3LYP PerformanceGAMESS-UK. DFT B3LYP PerformanceThe Cray T3E/1200 and High-end SystemsThe Cray T3E/1200 and High-end Systems
Basis: 6-31G Elapsed Time (seconds)
Cyclosporin, 1000 GTOs
S = 106.7S = 106.7
Speed-up
Number of CPUsNumber of CPUs
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
7979
4423
3596
2810
10389
4919
3190
19342035
1825
6107
19351167
0
4000
8000
12000
32 64
CS6 PIII/800 + FECS7 AMD K7/1000 MP + SCICS2 QSNet Alpha Cluster/667CS9 P4 Xeon/2000 + MyrinetCray T3E/1200EIBM SP/WH2-375SGI Origin 3800 R14k/500AlphaServer SC ES45/1000
16
32
48
64
16 32 48 64
Linear
Cray T3E/1200E
IBM SP/WH2-375
CS1 PIII/450 + FE
SGI Origin 3800/R14k-500
CS2 QSNet Alpha Cluster/667
DFT BLYP Gradient:DFT BLYP Gradient: Cray T3E/1200, High-end and Commodity-based SystemsCray T3E/1200, High-end and Commodity-based Systems
Number of CPUs Number of CPUs
Geometry optimisation of polymerisation catalystGeometry optimisation of polymerisation catalystCl(CCl(C33HH55O).Pd[(P(CMeO).Pd[(P(CMe33))22))22.C.C66HH44]]
Basis 3-21G* (446 GTOs): 10 energy + gradient Basis 3-21G* (446 GTOs): 10 energy + gradient evaluationsevaluations
Speed-upElapsed Time (seconds)
linear
69%,114%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Auxilliary Basis Coulomb Fit (I)Auxilliary Basis Coulomb Fit (I)
Where Where V V is the matrix of 2-centre 2-electron repulsion integrals in the charge is the matrix of 2-centre 2-electron repulsion integrals in the charge density basis and density basis and bb are the three centre electron repulsion integrals between the are the three centre electron repulsion integrals between the wavefunction basis set and the charge density basis.wavefunction basis set and the charge density basis.
pqpq bVC 1
The approach is based on the expansion of the charge density in an The approach is based on the expansion of the charge density in an auxiliary basis of Gaussian functionsauxiliary basis of Gaussian functions
As suggested by Dunlap, a variational choice of the fitting coefficients C As suggested by Dunlap, a variational choice of the fitting coefficients C can be obtained as followscan be obtained as follows::
uu
u pq
pqupq
pqpq uduCDpqDr)(
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
The number of 3-centre integrals is significantly smaller than the 4-centre The number of 3-centre integrals is significantly smaller than the 4-centre integrals used in the conventional coulomb evaluation, but for large molecules integrals used in the conventional coulomb evaluation, but for large molecules additional screening is required.additional screening is required.
We make use of the Schwarz inequalityWe make use of the Schwarz inequality
Auxilliary Basis Coulomb Fit (ii)Auxilliary Basis Coulomb Fit (ii)
uupqpqupq
Where p and q are AO basis functions and u are the fitting functions. Where p and q are AO basis functions and u are the fitting functions. Since screening is applied on a shell basis, the maximal integrals for Since screening is applied on a shell basis, the maximal integrals for each shell quartet are stored.each shell quartet are stored.
Using this screening, and exploiting the aggregate memory of a parallel Using this screening, and exploiting the aggregate memory of a parallel machine, it is possible to hold a significant fraction of the 3-centre machine, it is possible to hold a significant fraction of the 3-centre integrals in core.integrals in core.
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
3910
1567
6226
1488
933
2333
780
3063
726477
0
2000
4000
6000
16 32
CS6 PIII/800 + FECS4 AMD K7/1200 + FECS7 AMD K7/1000 MP + SCALICS2 QSNet Alpha Cluster/667CS9 P4 Xeon/2000 + MyrinetCray T3E/1200EIBM SP/WH2-375Cray SuperCluster/833 AlphaServer SC ES45/1000
7638
3451
14980
3386
2418
4281
1791
7617
17831301
0
3000
6000
9000
12000
15000
16 32
CS6 PIII/800 + FECS7 AMD K7/1000 MP + SCALICS4 AMD K7/1200 + FECS2 QSNet Alpha Cluster/667CS9 P4 Xeon/2000 + MyrinetCray T3E/1200EIBM SP/WH2-375Cray SuperCluster/833 AlphaServer SC ES45/1000
GAMESS-UK: DFT HCTH on Valinomycin. GAMESS-UK: DFT HCTH on Valinomycin.
Impact of Coulomb FittingImpact of Coulomb Fitting
Number of CPUs
Number of CPUs
Measured Time (seconds)
Measured Time (seconds)
Basis: DZV_A2 (Dgauss)Basis: DZV_A2 (Dgauss)A1_DFT Fit: 882/3012A1_DFT Fit: 882/3012
73%,161%
JJEXPLICITEXPLICIT61%,93%
JJFITFIT
TTT3E/1200E T3E/1200E (128) = 2139(128) = 2139
TTT3E/1200E T3E/1200E (128) = 995(128) = 995
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
3063
881726
1573
539476
995
419364
0
1500
3000
32 64 128
Cray T3E/1200ECompaq AlphaServer SC/667SGI Origin 3800/R14k-500Cray Alpha Cluster/833
7617
20331783
4300
1123978
2139
713598
0
4000
8000
32 64 128
Cray T3E/1200ESGI Origin 3800/R14k-500Compaq AlphaServer SC/667Cray Alpha Cluster/833
GAMESS-UK: DFT HCTH on Valinomycin. GAMESS-UK: DFT HCTH on Valinomycin. Impact of Coulomb Fitting: Impact of Coulomb Fitting: Cray T3E/1200, Cray Super Cray T3E/1200, Cray Super
Cluster/833, Compaq AlphaServer SC/667 and SGI Origin R14k/500Cluster/833, Compaq AlphaServer SC/667 and SGI Origin R14k/500
Number of CPUs Number of CPUs
Measured Time (seconds) Measured Time (seconds)
Basis: DZV_A2 (Dgauss)Basis: DZV_A2 (Dgauss)A1_DFT Fit: 882/3012A1_DFT Fit: 882/3012
JJEXPLICITEXPLICIT JJFITFIT
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
GAMESS-UK: DFT HCTH on Valinomycin. Speedups GAMESS-UK: DFT HCTH on Valinomycin. Speedups for both Explicit and Coulomb Fitting.for both Explicit and Coulomb Fitting.
JJEXPLICITEXPLICIT JJFITFIT
Number of CPUsNumber of CPUs
Speedup Speedup
112
90.6
16
32
48
64
80
96
112
128
16 32 48 64 80 96 112 128
Linear
Cray T3E/1200E
CS2 QSNet Alpha Cluster/667
Compaq AlphaServer/667
Cray SuperCluster/883
SGI Origin 3800/R12k-400
100.1
65.4
16
32
48
64
80
96
112
128
16 32 48 64 80 96 112 128
linearCray T3E/1200E
CS2 QSNet Alpha Cluster/667Compaq Alpha Server SC/667
Cray SuperCluster/833SGI Origin 3800/R14k-500
SGI Origin 3800/R12k-400 105+105+
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Memory-driven Approaches - SCF and DFT Memory-driven Approaches - SCF and DFT
Number of CPUs
HF and DFT Energy and Gradient calculation for HF and DFT Energy and Gradient calculation for NiSiMeNiSiMe22CHCH22CHCH22CC66FF1313: Basis Ahlrichs DZ (1485 GTOs): Basis Ahlrichs DZ (1485 GTOs)
Elapsed Time (seconds)
4206
2476
1723
830
4182
2405
1692
799
0
1000
2000
3000
4000
5000
28 60 124
HF - Direct SCFHF - In-coreDFT - Direct-SCFDFT - In-core
Integrals written directly Integrals written directly to memory, rather than toto memory, rather than todisk, and are not re-disk, and are not re-
calculatedcalculated
SGI Origin 3800/R14k-500SGI Origin 3800/R14k-500
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
MP2 Gradient AlgorithmsMP2 Gradient Algorithms
ConventionalConventional integrals written to diskintegrals written to disk read back, transformed, read back, transformed,
written out, resorted etc.written out, resorted etc. heavy I/O demandsheavy I/O demands
Direct/Semi-direct (Frisch, Direct/Semi-direct (Frisch, Head-Gordon & Pople, Hasse Head-Gordon & Pople, Hasse and Ahlrichs)and Ahlrichs) replace all/some I/O with replace all/some I/O with
batched integral batched integral recomputationrecomputation
Poor I/O-to-compute Poor I/O-to-compute performance of MPPsperformance of MPPs direct approachdirect approach
Current MPPs have large Current MPPs have large global memoriesglobal memories
Store subset of MO integralsStore subset of MO integrals reduce number of integral reduce number of integral
recomputationsrecomputations increase communication increase communication
overheadoverhead Subset includes VOVO, VVOO, Subset includes VOVO, VVOO,
VOOO,VOOO, VVVO-class too large to storeVVVO-class too large to store compute VVVO-terms in compute VVVO-terms in
separate stepseparate step
SerialSerial ParallelParallel
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
7487
5233
2722
2521
6713
3456
2496
1539
4847
3790
1550
1725
3530
2129
13461012
0
3000
6000
16 32
CS6 PIII/800 + FECS7 AMD K7/1000 MP + SCICS2 QSNet Alpha Cluster/667CS9 P4 Xeon/2000 + MyrinetCray T3E/1200EIBM SP/WH2-375Compaq AlphaServer SC/667SGI Origin3800/R14k-500AlphaServer SC ES45/1000
16
32
48
64
80
96
112
128
16 32 48 64 80 96 112 128
Linear
Cray T3E/1200EAlphaServer SC ES45/1000
SGI Origin3800/R14k-500
Mn(CO)5H - MP2 geometry optimisation
BASIS: TZVP + f (217 GTOs
Performance of MP2 Gradient ModulePerformance of MP2 Gradient ModuleCray T3E/1200, High-end and Commodity-Cray T3E/1200, High-end and Commodity-based Systemsbased Systems
Number of CPUsNumber of CPUs
Elapsed Time (seconds)Elapsed Time (seconds)
6713
2496
1539
3530
1346
1012
1923
836634
1158
529
0
1500
3000
4500
6000
16 32 64 128
Cray T3E/1200ESGI Origin3800/R14k-500AlphaServer SC ES45/1000
Speed-up
59%,78%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
2687
1080
1439
803
488 499402
0
500
1000
1500
2000
2500
3000
16 32 64 128
Cray T3E/1200ESGI Origin 3800/R12k-400SGI Origin 3800/R14k-500AlphaServer SC ES40/667
1450
949
2687
1842
1490
1080933
543
1439
1073
803746
501
0
1000
2000
3000
16 32
CS2 QSNet Alpha Cluster/667CS9 P4 Xeon/2000 + MyrinetCray T3E/1200EIBM SP/WH2-375SGI Origin 3800/R14k-500AlphaServer SC ES40/667AlphaServer SC ES45/1000
SCF Analytic 2nd Derivatives PerformanceSCF Analytic 2nd Derivatives PerformanceCray T3E/1200, High-end and Commodity-based SystemsCray T3E/1200, High-end and Commodity-based Systems
(C6H4(CF3))2: Basis 6-31G (196 GTO)
Elapsed Time (seconds)
• Terms from MO 2e-integrals in GA storage (CPHF & pert. Fock matrices); Calculation dominated by CPHF:Gaussian98 - L1002 (CPU) - 32 nodes: 1181 secs; 64 nodes: 1058 secs.GAMESS-UK (total job time); 128 nodes: 499 secs.
CPUs
92%,148%
CPUs
G98: 2271
G98: 1706
G98: 1490
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Vampir 2.5
VVisualization and AAnalysis ofMPIMPI Prrograms
GAMESS-UK on High-end and Commodity class machines
• extensions to handle GA applications
Performance Analysis of GA-Performance Analysis of GA-based Applications using based Applications using
Vampir Vampir
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
GAMESS-UK / GAMESS-UK / SiSi88OO2525HH1818 : 8 CPUs: : 8 CPUs:
One DFT CycleOne DFT Cycle
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
GAMESS-UK / GAMESS-UK / SiSi88OO2525HH1818 : 8 CPUs : 8 CPUsQQ††HQ (GAMULT2) and PEIGSHQ (GAMULT2) and PEIGS
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Materials Simulation CodesMaterials Simulation Codes
Plane Wave DFT Codes:• CASTEP• VASP• CPMD
These codes have similar These codes have similar functionality, power and problems. functionality, power and problems. CASTEP is the flagship code of CASTEP is the flagship code of UKCP and hence subsequent UKCP and hence subsequent discussions will focus on this.discussions will focus on this.
Local Gaussian Basis Set Codes:• CRYSTAL
This code presents a different set of problems when considering performance on HPC(x).
SIESTA and SIESTA and CONQUEST:CONQUEST:
• O(n) scaling codes which will be extremely attractive to users. O(n) scaling codes which will be extremely attractive to users. • Both are currently development rather than production codes. Both are currently development rather than production codes.
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
CRYSTAL - 2000CRYSTAL - 2000 Distributed Data implementationDistributed Data implementation Benchmark: Benchmark:
An Acid Centre in Zeolite-Y (Faujasite)An Acid Centre in Zeolite-Y (Faujasite) Single point energySingle point energy 145 atoms / cell, No symmetry / 8k-points145 atoms / cell, No symmetry / 8k-points 2208 basis functions, (6-21G2208 basis functions, (6-21G* * ))
18000
7609
9962
4028
5633
22063241
1483
0
7000
14000
21000
16 32 64 128
CS1 PIII/450 + FE/LAMCS2 Alpha Linux Cluster/667Cray T3E/1200ESGI Origin 3800/R12k-400IBM SP/WH2-375
144
128.8
0
64
128
192
256
0 64 128 192 256
LinearCray T3E/1200ECS1 PIII/450 + FESGI Origin 3800/R12k-400
Elapsed Time (seconds)Speed-up
Number of CPUs
TT256256 SGI Origin 3800 = 945 secs SGI Origin 3800 = 945 secs
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Materials Simulation. Materials Simulation. Plane Wave Methods: CASTEPPlane Wave Methods: CASTEP
cutEGk
G
rGkikGj
kj eCr
2)().(
,)(
• Direct minimisation of the total energy (avoiding diagonalisation)
• Pseudopotentials must be used to keep the number of plane waves manageable
• Large number of basis functions N~106 (especially for heavy atoms).
The plane wave expansion means that the bulk of the computation comprises large 3D Fast Fourier Transforms (FFTs) between real and momentum space.
• These are distributed across the processors in various ways.
• The actual FFT routines are optimized for the cache size of the processor.
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
CASTEP 4.2 - Parallel BenchmarkCASTEP 4.2 - Parallel BenchmarkChabaziteChabazite Acid sites in a zeolite. (SiAcid sites in a zeolite. (Si1111 O O2424
Al H)Al H) Vanderbilt ulatrasoft pseudo-Vanderbilt ulatrasoft pseudo-
potentialpotential Pulay density mixing minimiser Pulay density mixing minimiser
schemescheme single k point total energy, 96 single k point total energy, 96
bandsbands 15045 plane waves on 3D FFT 15045 plane waves on 3D FFT
grid size = 54x54x54; grid size = 54x54x54; convergence in 17 SCF cyclesconvergence in 17 SCF cycles
CPUs
Measured Time (seconds)
Time (comms)IBM SP/WH2-375 157Cray T3E/1200E 90CS6 PIII/800+FE 600CS7 AMD K7/1000 + SCI 242CS9 P4/2000 +Myrinet 115CS2 QSNet Alpha 111SGI Origin 3800/R14k 71
1228
367330
272
637
422
235228
147
774
195205
204
324271
153137
89
0
250
500
750
1000
1250
16 32
CS6 PIII/800 FE/MPICHCS2 QSNet Alpha Cluster/667CS9 P4/2000 Xeon + MyrinetCS8 Itanium/800 + MyrinetCray T3E/1200EIBM SP/WH2-375SGI Origin 3800/R14k-500AlphaServer SC ES45/1000IBM Regatta-H
67%,75%
GG
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
CASTEP 4.2 - Parallel Benchmark II.CASTEP 4.2 - Parallel Benchmark II.TiN: TiN: A TiN 32 atom slab, 8 k-points, single point A TiN 32 atom slab, 8 k-points, single point
energy calculation with Mulliken analysis,energy calculation with Mulliken analysis,
CPUs
Measured Time (seconds)
1781
3109
1712
2815
7881178
2332
6038
2287
1096
560800
1312
577610910
0
2000
4000
6000
32 64 128
CS9 P4/2000 + Myrinet 2kCS8 Itanium/800 + Myrinet 2kCray T3E/1200ESGI Origin 3800/R14k-500AlphaServer SC ES45/1000IBM SP/Regatta-H
kGkG47%,81%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
CPMD - Car-Parrinello Molecular DynamicsCPMD - Car-Parrinello Molecular Dynamics
CPMDCPMD Version 3.5.1: Hutter, Alavi, Deutsh, Version 3.5.1: Hutter, Alavi, Deutsh,
Bernasconi, St. Goedecker, Marx, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-2001)Tuckerman and Parrinello (1995-2001)
DFT, plane-waves, pseudo-potentials DFT, plane-waves, pseudo-potentials and FFT'sand FFT's
Benchmark Example: Benchmark Example: Liquid WaterLiquid Water Physical Specifications:Physical Specifications:
32 molecules, Simple cubic periodic box of 32 molecules, Simple cubic periodic box of length 9.86 A, Temperature 300Klength 9.86 A, Temperature 300K
Electronic StructureElectronic Structure; ; BLYP functional, Trouillier Martins BLYP functional, Trouillier Martins pseudopotential, Recriprocal space cutoff pseudopotential, Recriprocal space cutoff 70 Ry = 952 eV70 Ry = 952 eV
CPMD is the base code for the new CCP1 CPMD is the base code for the new CCP1 flagship project flagship project
CPUs
Elapsed Time (secs.)Elapsed Time (secs.)
Sprik and Vuilleumier (Cambridge)Sprik and Vuilleumier (Cambridge)
390
259
204221
145
99
253
208
124 11195
63
0
200
400
16 32
CS7 dual-Athlon K7/1000CS9 P4/2000 Xeon + Myrinet (2CPU)CS9 P4/2000 Xeon + Myrinet (1CPU)CS2 Alpha Linux ClusterIBM SP/WH2-375SGI Origin 3800 / R14k-500AlphaServer SC ES40 / 667AlphaServer SC ES45 / 1000
30%,53%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
0
64
128
192
256
0 64 128 192 256
DL_POLY Parallel Benchmarks (Cray T3E/1200)DL_POLY Parallel Benchmarks (Cray T3E/1200)
Number of Nodes
4. NaCl; Ewald, 27,000 ions5. NaK-disilicate glass; 8,640 atoms,MTS+ Ewald8. MgO microcrystal: 5,416 atoms
Number of Nodes
9. Model membrane/Valinomycin (MTS, 18,886)7. Gramicidin in water (SHAKE, 12,390)6. K/valinomycin in water (SHAKE, AMBER, 3,838)1. Metallic Al (19,652 atoms, Sutton Chen)3. Transferrin in Water (neutral groups + SHAKE, 27,593)2. Peptide in water (neutral groups + SHAKE, 3993).
Speed-up
0
16
32
48
64
0 16 32 48 64
Speed-up LinearLinear
V2: Replicated Data
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
385
204
98
191
135
55
376
817
114
6590
48 4379
0
250
500
750
16 32
CS6 PIII/800 + FE/LAMCS7 AMD K7/1000 MP + SCICS4 AMD K7/1200 +FECS9 P4/2000 + MyrinetCS2 QSNet Alpha Cluster/667CS8 Itanium/800 + MyrinetCray T3E/1200EIBM SP/WH2-375SGI Origin 3800/R14k-500CraySuperCluster/833IBM Regatta-HAlphaServer SC ES45/1000
DL_POLY: Cray/T3E, High-end and DL_POLY: Cray/T3E, High-end and Commodity-based SystemsCommodity-based Systems
Bench 4. NaCl; 27,000 ions, 27,000 ions, Ewald , 75 time Ewald , 75 time steps, Cutoff=24Åsteps, Cutoff=24Å
Number of CPUs
Measured Time (seconds)
T3ET3E128128 =94 =94
44%,71%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
228
149
153
53
89
167
12168
44
57
109
60
4052
36
88
62
3623
0
50
100
150
200
250
16 32 64
CS6 PIII/800 + FE/LAMCS7 AMD K7/1000 MP + SCICS9 P4/2000 + MyrinetCS8 Itanium/800 + MyrinetCS2 QSNet Alpha Cluster/667SGI Origin3800/R14k-500Cray SuperCluster/833AlphaServer SC ES45/1000
DL_POLY: Scalability on the T3E, High-end & DL_POLY: Scalability on the T3E, High-end & Commodity SystemsCommodity Systems
Measured Time (seconds)
Bench 5: NaK-disilicate glass; 8,640 atoms, MTS + Ewald: 270 time steps
Number of CPUs
T3ET3E128128 = 75 = 75
Speed-up
Number of CPUs
98
50.1
0
32
64
96
128
0 32 64 96 128
LinearCray T3E/1200ESGI Origin 3800/R14k-500Cray SuperCluster / 833CS1 Pentium III/800 + FECS7 AMD K7/1000 MP +SCICompaq AlphaServer ES45/1000
53%,84%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
733
552
182
121
190
306
216
147
125
191
688
382
154
9311678
0
200
400
600
800
16 32
CS6 PIII/800 + FE/LAMCS4 AMD K7/1200 + FECS7 AMD K7/1000 MP + SCICS9 P4/2000 + MyrinetCS2 QSNet Alpha Cluster/667CS8 Itanium/800 + MyrinetCray T3E/1200EIBM SP/WH2-375Cray SuperCluster/833SGI Origin 3800/R14k-500IBM Regatta-HAlphaServer SC ES45/1000
DL_POLY: Macromolecular SimulationsDL_POLY: Macromolecular Simulations
Bench 7. Gramicidin in water; rigid bonds and rigid bonds and SHAKE, SHAKE, 12,390 atoms,12,390 atoms,500 time steps500 time steps
Number of CPUs
Measured Time (seconds)
T3ET3E128128 =166 =16641%,64%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Migration from Replicated to Distributed dataMigration from Replicated to Distributed data DL_POLY-3 : Domain DecompositionDL_POLY-3 : Domain Decomposition
Distribute atoms, forces across the Distribute atoms, forces across the nodesnodes More memory efficient, can address More memory efficient, can address
much larger cases (10 much larger cases (10 55-10 -10 77)) Shake and short-ranges forces require Shake and short-ranges forces require
only neighbour communicationonly neighbour communication communications scale linearly with communications scale linearly with
number of nodesnumber of nodes
Coulombic energy remains globalCoulombic energy remains global strategy depends on problem and strategy depends on problem and
machine characteristicsmachine characteristics Adopt Particle Mesh Ewald scheme
includes Fourier transform smoothed includes Fourier transform smoothed charge density (reciprocal space grid charge density (reciprocal space grid typically 64x64x64 - 128x128x128)typically 64x64x64 - 128x128x128)
AA BB
CC DD
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Conventional routines (Conventional routines (e.g.e.g. fftw) assume plane fftw) assume plane or column distributions or column distributions
A global transpose of the data is required to A global transpose of the data is required to complete the 3D FFT and additional costs are complete the 3D FFT and additional costs are incurred re-organising the data from the natural incurred re-organising the data from the natural block domain decomposition. block domain decomposition.
An alternative FFT algorithm has been designed An alternative FFT algorithm has been designed to reduce communication costs. to reduce communication costs.
the 3D FFT are performed as a series of 1D the 3D FFT are performed as a series of 1D FFTs, each involving communications only FFTs, each involving communications only between blocks in a given columnbetween blocks in a given column
More data is transferred, but in far fewer More data is transferred, but in far fewer messagesmessages
Rather than all-to-all, the communications are Rather than all-to-all, the communications are column-wise onlycolumn-wise only
Plane Block
Migration from Replicated to Distributed dataMigration from Replicated to Distributed data DL_POLY-3: Coulomb Energy EvaluationDL_POLY-3: Coulomb Energy Evaluation
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
171
186
0
64
128
192
256
0 64 128 192 256
Linear
AlphaServer ES45/1000
CS9 P4/2000 + Myrinet 2k
SGI Origin 3800/R14k-500
Migration from Replicated to Distributed dataMigration from Replicated to Distributed data DL_POLY-3: Coulomb Energy PerformanceDL_POLY-3: Coulomb Energy Performance
Speed-up
Number of CPUs
74.4
79.1
94.5
0
32
64
96
128
0 32 64 96 128
LinearDL_POLY-2: AlphaServerAlphaServer SC ES45/1000CS9 P4/2000 + MyrinetSGI Origin 3800/R14k-500
NaCl Simulation; DLPOLY_2.11, Ewald summationEwald summation
DLPOLY_3, DLPOLY_3, PMESPMES
DLPOLY_2.1127,000 ions, 500 time steps, 27,000 ions, 500 time steps, Cutoff=24ÅCutoff=24ÅDLPOLY_3DLPOLY_327,000 ions, 500 time steps, 27,000 ions, 500 time steps, Cutoff=12ÅCutoff=12Å DLPOLY_3DLPOLY_3
216,000 ions, 200 time steps, 216,000 ions, 200 time steps, Cutoff=12ÅCutoff=12Å
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
174.7
208.4
0
64
128
192
256
0 64 128 192 256
Linear
AlphaServer SC ES45/1000CS9 P4/2000 + Myrinet 2k
SGI Origin 3800/R14k-500
Migration from Replicated to Distributed dataMigration from Replicated to Distributed data DL_POLY-3: Macromolecular SimulationsDL_POLY-3: Macromolecular Simulations
DLPOLY_3DLPOLY_3792,960 ions, 50 time steps792,960 ions, 50 time steps
Speed-up
Number of CPUs
26.4
31
43
0
16
32
48
64
0 16 32 48 64
LinearDL_POLY-2: 12.390 atoms - AlphaServerAlphaServer SC ES45/1000CS9 P4/2000 + Myrinet 2kSGI Origin 3800/R14k-500
DL_POLY 2.1112,390 atoms, 500 time steps12,390 atoms, 500 time stepsDL_POLY 3DL_POLY 399,120 atoms, 100 time steps99,120 atoms, 100 time steps
Gramicidin in water;rigid bonds + SHAKE
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Parallel CHARMM BenchmarkParallel CHARMM BenchmarkBenchmark MD Calculation of Carboxy Myoglobin Benchmark MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules: (MbCO) with 3830 Water Molecules: 14026 atoms, 14026 atoms, 1000 steps (1 ps), 12-14 A shift.1000 steps (1 ps), 12-14 A shift.
4
20
36
52
4 20 36 52
LinearCS1 PIII/450 + FE/LAMCray T3E/1200ECS2 QSNet Alpha Cluster/667SGI Origin 3800/R14k-500
Measured Time (seconds)
Number of CPUs
231
104
114
89
199
64
6661
62
51
6473
0
100
200
300
16 32 64
CS6 PIII/800 + FE/LAMCS2 QSNet Alpha Cluster/667CS7 AMD K7/1000 MP + SCICS9 P4/2000 + Myrinet 2kSGI Origin 3800/R14k-500AlphaServer SC ES45/1000
95%,103%
T3ET3E1616 = 665, T3E = 665, T3E128128 = 106 = 106
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Parallel CHARMM Benchmark:Parallel CHARMM Benchmark:LAM MPI vs. MPICHLAM MPI vs. MPICHBenchmark MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Benchmark MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules:Water Molecules:
Measured Time (seconds)
Number of CPUs
198.9
440.1
0
150
300
450
600
750
4 8 16 32
CS6 PIII/800 + FE
LAM 6.3-b2
MPICH-1.2.1
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
QM/MM Thrombin BenchmarkQM/MM Thrombin Benchmark
QM/MM calculation on thrombin pocket:
QM region (69 Atoms) is modelled by SCF 6-31G calculation (401 GTOs), total system size for the classical calculation is 16,659 atoms. The MM calculation was performed using all non-bonded interactions (i.e. without a pairlist), and the QM/MM interaction was cut off at 15 angstroms using a neutral group scheme (leading to the inclusion in the QM calculation of 1507 classical centres).
0
400
800
1200
1600
2000
1 2 4 8 16 32 64 128 256
QM time
MM Time
Scaling for Thrombin + Inhibitor (NAPAP) + WaterScaling for Thrombin + Inhibitor (NAPAP) + Water
Speedup (Cray T3E)
0
32
64
96
128
0 32 64 96 128
Number of Nodes
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
• QM region 35 atoms (DFT BLYP) – include residues with possible proton donor/acceptor roles – GAMESS-UK, MNDO, TURBOMOLE
• MM region (4,180 atoms + 2 link)– CHARMM force-field, implemented in CHARMM, DL_POLY
Triosephosphate isomerase (TIM)
• Central reaction in glycolysis, catalytic interconversion ofDHAP to GAP
• Demonstration case within QUASI (Partners UZH, and BASF)
Triosephosphate isomerase (TIM)
• Central reaction in glycolysis, catalytic interconversion ofDHAP to GAP
• Demonstration case within QUASI (Partners UZH, and BASF)
QM/MM Applications
1467
1030
1487
714
802
540
778
419
508
308
428
274
196
257
213
0
400
800
1200
1600
8 16 32 64
CS7 AMD K7/1000 + SCI
CS9 P4/2000 + Myrinet 2k
SGI Origin3800/R14k-500
AlphaServer SC ES45/1000
Measured Time (seconds)
T T 128128 (O3800/R14k-500) = 181 secs (O3800/R14k-500) = 181 secs
89%,139%
Computational Chemistry at Daresbury 16-22 November 2002
Computational Science and Engineering Department Daresbury Laboratory
Commodity Comparisons with High-end Systems:Commodity Comparisons with High-end Systems:Compaq AlphaServer SC ES45/1000 and the SGI O3800/R14k-500Compaq AlphaServer SC ES45/1000 and the SGI O3800/R14k-500
CS9 - Pentium4/2000 Xeon + Myrinet 2kCS9 - Pentium4/2000 Xeon + Myrinet 2k% of 32-CPU % of 32-CPU AlphaServer SC Origin 3800 AlphaServer SC Origin 3800GAMESS-UKGAMESS-UKSCFSCF 95% 135%DFTDFT 70-73% 117-161%DFT (Jfit)DFT (Jfit) 59-70% 92-111%DFT GradientDFT Gradient 69% 114%MP2 GradientMP2 Gradient 59% 78%SCF ForcesSCF Forces 92% 148%
NWChem NWChem (DFT Jfit)(DFT Jfit) 65-88% 94-227%
GAMESS / CHARMMGAMESS / CHARMM 89% 139%
DL_POLYDL_POLYEwald-based Ewald-based 44-53% 71-84%bond constraintsbond constraints 41% 64%
CHARMMCHARMM 95% 103%
CASTEPCASTEP 47-67% 75-81%CPMDCPMD 30% 53%
ANGUSANGUS 35-69% 51-147%