30
Clusters in Molecular Sciences Applications”, 2 Clusters in Molecular Sciences Applications”, 2 nd nd Annual iHPC Cluster Workshop, Ottawa Jan Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 11, 2002. p. 1 Clusters in Clusters in Molecular Sciences Molecular Sciences Applications Applications Serguei Patchkovskii Serguei Patchkovskii @# @# , Rochus , Rochus Schmid Schmid @ , Tom Ziegler , Tom Ziegler @ , , Siu Pang Chan Siu Pang Chan # , Andrew McCormack , Andrew McCormack # , , Roger Rousseau Roger Rousseau # , Ian Skanes , Ian Skanes # @ Department of Chemistry, University of Calgary, Department of Chemistry, University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N 1N4 2500 University Dr. NW, Calgary, Alberta, T2N 1N4 Canada Canada # Theory and Computation Group, SIMS, NRC, 100 Theory and Computation Group, SIMS, NRC, 100 Sussex Dr., Ottawa, Ontario, K1A 0R6 Sussex Dr., Ottawa, Ontario, K1A 0R6

Clusters in Molecular Sciences Applications

  • Upload
    denis

  • View
    32

  • Download
    2

Embed Size (px)

DESCRIPTION

Clusters in Molecular Sciences Applications. Serguei Patchkovskii @# , Rochus Schmid @ , Tom Ziegler @ , Siu Pang Chan # , Andrew McCormack # , Roger Rousseau # , Ian Skanes #. @ Department of Chemistry, University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N 1N4 Canada - PowerPoint PPT Presentation

Citation preview

Page 1: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 11

Clusters in Molecular Clusters in Molecular Sciences ApplicationsSciences Applications

Serguei PatchkovskiiSerguei Patchkovskii@#@#, Rochus Schmid, Rochus Schmid@@, , Tom ZieglerTom Ziegler@@,,

Siu Pang ChanSiu Pang Chan##, Andrew McCormack, Andrew McCormack##, Roger , Roger RousseauRousseau##, Ian Skanes, Ian Skanes##

@@Department of Chemistry, University of Calgary, 2500 University Dr. NW, Department of Chemistry, University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N 1N4 CanadaCalgary, Alberta, T2N 1N4 Canada

##Theory and Computation Group, SIMS, NRC, 100 Sussex Dr., Ottawa, Theory and Computation Group, SIMS, NRC, 100 Sussex Dr., Ottawa, Ontario, K1A 0R6 Ontario, K1A 0R6

Page 2: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 22

OverviewOverview• Beowulf-style clusters entered mainstreamBeowulf-style clusters entered mainstream

• Are clusters a lasting, efficient investment?Are clusters a lasting, efficient investment?

• Odysseus: an internal cluster at the SIMS Odysseus: an internal cluster at the SIMS theory grouptheory group

• Clusters in molecular science applications: Clusters in molecular science applications: software availability and performancesoftware availability and performance

• Three war stories, and a cautionary messageThree war stories, and a cautionary message

• Summary and conclusionsSummary and conclusions

Page 3: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 33

Shared, Academic Clusters in Shared, Academic Clusters in CanadaCanada

LocationLocation CPUsCPUs URL of other infoURL of other infoCarleton U.Carleton U. 8xPII-4008xPII-400 www.scs.carleton.ca/~gis/www.scs.carleton.ca/~gis/

UBCUBC 256xPIII-1000256xPIII-1000 www.gdcfd.ubc.ca/Monsterwww.gdcfd.ubc.ca/Monster

U of CalgaryU of Calgary 179xAlpha179xAlpha www.maci-cluster.ucalgary.cawww.maci-cluster.ucalgary.ca

U of Western OntarioU of Western Ontario 144xAlpha144xAlpha GreatWhite.sharcnet.caGreatWhite.sharcnet.ca

U of Western OntarioU of Western Ontario 48xAlpha48xAlpha DeepPurple.sharcnet.caDeepPurple.sharcnet.ca

McMaster UMcMaster U 106xAlpha106xAlpha Idra.physics.mcmaster.caIdra.physics.mcmaster.ca

U of GuelphU of Guelph 120xAlpha120xAlpha Hammerhead.uoguelph.caHammerhead.uoguelph.ca

U of WundsorU of Wundsor 8xAlpha8xAlpha

Winfrid Laurier UWinfrid Laurier U 8xAlpha8xAlpha

Page 4: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 44

Canadian top-500 facilitiesCanadian top-500 facilities

Cluster

Page 5: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 55

Internal, “workhorse” clustersInternal, “workhorse” clustersLocationLocation CPUsCPUs URL or otherURL or other

U of AlbertaU of Alberta 98xPIII-45098xPIII-450 www.phys.ualberta.ca/THORwww.phys.ualberta.ca/THOR

U of CalgaryU of Calgary 94x21164-50094x21164-500 www.cobalt.chem.ucalgary.cawww.cobalt.chem.ucalgary.ca

U of CalgaryU of Calgary 120xPIII-1000120xPIII-1000 www.ucalgary.ca/~tieleman/elk.htmlwww.ucalgary.ca/~tieleman/elk.html

U of CalgaryU of Calgary 32xPIII32xPIII

Memorial UMemorial U 32xPII-30032xPII-300 weland.esd.mun.caweland.esd.mun.ca

MDS ProteomicsMDS Proteomics 400xPIII-1000400xPIII-1000 www.mdsproteomics.comwww.mdsproteomics.com

ICPET, NRCICPET, NRC 80xPIII-80080xPIII-800

DRAO, NRCDRAO, NRC 16xPII-45016xPII-450

SIMS, NRCSIMS, NRC 32xPIII-93332xPIII-933

Samuel Lunenfeld Research InstituteSamuel Lunenfeld Research Institute 224xPIII-450224xPIII-450 Bioinfo.mshri.on.ca/yac/Bioinfo.mshri.on.ca/yac/

Sherbrooke USherbrooke U 64xPII-40064xPII-400

U of SaskatchewanU of Saskatchewan 12xAthlon-80012xAthlon-800 Sasquatch.usask.caSasquatch.usask.ca

Simon Frazer USimon Frazer U 16xPIII-50016xPIII-500 www.sfu.ca/acs/cluster/www.sfu.ca/acs/cluster/

U of VictoriaU of Victoria 39xPIII-45039xPIII-450 Pingu.phys.uvic.ca/muse/ (?)Pingu.phys.uvic.ca/muse/ (?)

McMaster UMcMaster U 32xPIII-70032xPIII-700 www.cim.mcgill.ca/~cvr/beowulf/www.cim.mcgill.ca/~cvr/beowulf/

CERCA, MontrealCERCA, Montreal 16xAthlon-120016xAthlon-1200 www.cerca.umontreal.ca/~fourmano/www.cerca.umontreal.ca/~fourmano/

U of Western OntarioU of Western Ontario variousvarious www.baldric.uwo.cawww.baldric.uwo.ca

Page 6: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 66

Clusters are everywhereClusters are everywhereLemma 1Lemma 1: A computationally-intensive research group : A computationally-intensive research group in Canada can be in one of the three states:in Canada can be in one of the three states:

a)a) It owns a cluster, orIt owns a cluster, or

b)b) It builds a cluster, orIt builds a cluster, or

c)c) It plans building a cluster RSNIt plans building a cluster RSN

Clusters became a mainstream research tool – useful,Clusters became a mainstream research tool – useful,but not automatically worthy of a separate mentionbut not automatically worthy of a separate mention

Page 7: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 77

Cobalt: Hardware Cobalt: Hardware

Node 1

Node 93

World

Switch 93x100BaseTx

100BaseTx

(half-duplex)

2x100BaseTx

128Mb memory18Gbytes RAID-1 (4 spindles)

CComputers omputers oon n bbenches enches aall ll llinked inked ttogetherogether

Page 8: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 88

Cobalt: Nodes and NetworkCobalt: Nodes and NetworkDigital/Compaq Personal Workstation Digital/Compaq Personal Workstation 500au. 500au. CPUCPU Alpha 21164A, 500 MHzAlpha 21164A, 500 MHzCacheCache 96Kb on-chip (L1 and L2)96Kb on-chip (L1 and L2)Peak flopsPeak flops 101099 Flop/second Flop/secondSpecInt 95SpecInt 95 15.7 (estimate)15.7 (estimate)SpecFP 95SpecFP 95 19.5 (estimate)19.5 (estimate)

4 x 3COM SuperStack II 3300

Peak aggregate b/wPeak aggregate b/w 500.0 MB/s500.0 MB/sPeak internode b/w (TCP)Peak internode b/w (TCP) 11.2 MB/s11.2 MB/sNFS read/writeNFS read/write 3.4/4.1 MB/s3.4/4.1 MB/sRound-trip (TCP)Round-trip (TCP) 360 360 μsμsRound-trip (UDP)Round-trip (UDP) 354 354 μsμs

Page 9: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 99

Cobalt: SoftwareCobalt: Software

OS, communications, and cluster management:OS, communications, and cluster management:Base OS: Tru64, using DMS, NIS, and NFSBase OS: Tru64, using DMS, NIS, and NFS

Compilers: Digital/Compaq C, C++, FortranCompilers: Digital/Compaq C, C++, Fortran

Communications: PVM, MPICHCommunications: PVM, MPICH

Batch queuing: DQSBatch queuing: DQS

Application software:Application software:ADF: Amsterdam Density Functional (PVM)ADF: Amsterdam Density Functional (PVM)

PAW: Projector-Augmented Wave (MPI) PAW: Projector-Augmented Wave (MPI)

Page 10: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1010

Cobalt: Return on the InvestmentCobalt: Return on the Investment

Investment: DollarsInvestment: Dollars Payback: Research ArticlesPayback: Research Articles

Total publicationsTotal publications 9292

… … including:including:

OrganometallicsOrganometallics 2121

J. Am. Chem. Soc.J. Am. Chem. Soc. 1212

J. Phys. Chem.J. Phys. Chem. 1111

J. Chem. Phys.J. Chem. Phys. 1010

Inorg. Chem.Inorg. Chem. 66

Total costTotal cost 390,800390,800

… … including:including:

Initial purchaseInitial purchase 346,000346,000

Operating (’98-’01)Operating (’98-’01) power (6power (6¢¢/kWh)/kWh) 15,80015,800 admin (20% PDF) admin (20% PDF) 24,00024,000 spare partsspare parts 5,0005,000

ROI: 1 publication / $4,250 ROI: 1 publication / $4,250

Page 11: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1111

Odysseus: Low-tech solution for Odysseus: Low-tech solution for high-tech problemshigh-tech problems11

Page 12: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1212

Odysseus: Low-tech solution for Odysseus: Low-tech solution for high-tech problemshigh-tech problems22

Nodes (16+1)Nodes (16+1)ABIT VP6 motherboardABIT VP6 motherboard2xPIII-933, 133MHz FSB2xPIII-933, 133MHz FSB4x256Mbytes RAM4x256Mbytes RAM3COM 3C905C3COM 3C905C36Gb 7200rpm IDE36Gb 7200rpm IDE

… … plus, on the front end:plus, on the front end:Intel PRO/1000Intel PRO/1000Adaptec AHA-2940UWAdaptec AHA-2940UW60Gb 7200rpm IDE60Gb 7200rpm IDE

Page 13: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1313

Odysseus: Low-tech solution for Odysseus: Low-tech solution for high-tech problemshigh-tech problems33

Network: SCI + 100MbitNetwork: SCI + 100MbitDolphin D339 (2D SCI)Dolphin D339 (2D SCI)

H ringH ring

V ringV ring

HP Procurve 2524 + 1GigHP Procurve 2524 + 1Gig

Page 14: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1414

Odysseus: Low-tech solution for Odysseus: Low-tech solution for high-tech problemshigh-tech problems44

Backup unit:Backup unit:VXAtape (VXAtape (www.ecrix.comwww.ecrix.com))

35Gbytes/cartridge (physical)35Gbytes/cartridge (physical)

TreeFrog autoloader (TreeFrog autoloader (www.spectralogic.comwww.spectralogic.com))

16 cartridge capacity16 cartridge capacity

UPS Unit:UPS Unit:Powerware 5119Powerware 5119

2880VA2880VA

Page 15: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1515

Odysseus: Low-tech solution for Odysseus: Low-tech solution for high-tech problemshigh-tech problems55

Four little wheelsFour little wheels

Odysseus at a glanceOdysseus at a glance

Processors:Processors: 32 (+2)32 (+2)

Memory:Memory: 16Gbytes16Gbytes

Disk:Disk: 636Gbytes636Gbytes

Peak flops:Peak flops: 29.9GFlops/sec29.9GFlops/sec

Page 16: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1616

Odysseus: cost overviewOdysseus: cost overview

ExpenseExpense dollarsdollarsNodesNodes 40,64040,640

SCI network (cards & cables)SCI network (cards & cables) 26,77126,771

Backup unit (tape+robot)Backup unit (tape+robot) 5,8605,860

Spare parts in stockSpare parts in stock 5,0245,024

Ethernet (switch, cables, and head node link)Ethernet (switch, cables, and head node link) 4,1904,190

Compiler (PGI)Compiler (PGI) 3,7803,780

UPSUPS 2,2652,265

Backup tapes (16+1)Backup tapes (16+1) 1,9111,911

Total:Total: 90,44190,441

Page 17: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1717

Clusters in molecular science – Clusters in molecular science – software availability software availability

• GaussianGaussian• TurbomoleTurbomole• GAMESSGAMESS• NWChemNWChem• GROMOSGROMOS

• ADFADF• PAWPAW• CPMDCPMD• AMBERAMBER• VASPVASP• PWSCFPWSCF• ABINITABINIT

Page 18: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1818

Software: ADFSoftware: ADFADF – Amsterdam Density ADF – Amsterdam Density

Functional (Functional (www.scm.comwww.scm.com))

Example: Cr(N)PorphExample: Cr(N)Porph

Full geometry optimizationFull geometry optimization38 atoms38 atoms580 basis functions580 basis functionsC4v symmetryC4v symmetry45Mbytes of memory45Mbytes of memorySerial time: 683 minutesSerial time: 683 minutes

Number of Cobalt nodes

Sp

eed

up

idea

l

Observed

Page 19: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1919

Software: PAWSoftware: PAW

PAW – “Projector-Augmented Wave”PAW – “Projector-Augmented Wave”((www.pt.tu-clausthal.de/~ptpb/PAW/pawmain.htmlwww.pt.tu-clausthal.de/~ptpb/PAW/pawmain.html))

Sp

eed

up

Cobalt Nodes

idea

l

Observed

Example: SExample: SNN2 reaction2 reaction

CHCH33I + [Rh(CO)I + [Rh(CO)22II22]]--

1111ÅÅ unit cell unit cell

Serial time per step: 83 secondsSerial time per step: 83 seconds

Memory: 231MbytesMemory: 231Mbytes

Page 20: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2020

Software: CPMDSoftware: CPMDCPMD – Car-Parinello Molecular Dynamic CPMD – Car-Parinello Molecular Dynamic

((www.mpi-stuttgart.mpg.de/parinello/www.mpi-stuttgart.mpg.de/parinello/))

Example: H in SiExample: H in Si6464

65 atoms, periodic65 atoms, periodic

40Ryd cut-off40Ryd cut-off

Geometry opt (2 steps) + Geometry opt (2 steps) + free MD (70 steps)free MD (70 steps)

odysseus

Page 21: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2121

Software: AMBERSoftware: AMBERAMBER – “Assisted Model AMBER – “Assisted Model

Building with Energy Building with Energy Refinement” Refinement” ((www.amber.ucsf.edu/amber/www.amber.ucsf.edu/amber/))

Ncpu

Tim

e (h

our)

Example:Example:

22-residue polypeptide+4K22-residue polypeptide+4K++

+2500 H+2500 H22OO

1ns MD1ns MD

Page 22: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2222

Software: VASPSoftware: VASPVASP – Vienna Ab-initio Simulation Package (VASP – Vienna Ab-initio Simulation Package (cmscms

.mpi.univie.ac.at/vasp/.mpi.univie.ac.at/vasp/))

Example: LiExample: Li198198

1000GPa1000GPa

300 eV cutoff300 eV cutoff

9 K-points9 K-points

10 WF optimization steps 10 WF optimization steps + stress tensor+ stress tensor

odysseus

Page 23: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2323

Software: PWSCFSoftware: PWSCFPWSCF and PHONON – Plane wave pseudopotential codes, PWSCF and PHONON – Plane wave pseudopotential codes,

optimized for phonon spectra calculations (optimized for phonon spectra calculations (www.pwscf.org/www.pwscf.org/))

Example: MgBExample: MgB22 solid solid

Geometry opt.Geometry opt.

40 Ryd cut-off40 Ryd cut-off

60 K-points60 K-points

odysseus

Page 24: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2424

Software: ABINITSoftware: ABINIT

ABINIT (ABINIT (www.mapr.ucl.ac.be/ABINIT/www.mapr.ucl.ac.be/ABINIT/))

Example:Example:

SiOSiO22 (stishovite) (stishovite)

70Ryd cut-off70Ryd cut-off

6 K-points6 K-points

12 SCF iterations12 SCF iterations

Page 25: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2525

War Story #1War Story #1Odysseus hardware maintenance log, Oct 19, 2001:Odysseus hardware maintenance log, Oct 19, 2001: Overnight, node 6 had a kernel OOPS … it responds to Overnight, node 6 had a kernel OOPS … it responds to

network pings and keyboard, but no new processes can be network pings and keyboard, but no new processes can be started …started …

Reason:Reason: Heat sink on CPU#1 became loose, resulting Heat sink on CPU#1 became loose, resulting in overheating under heavy load.in overheating under heavy load.Resolution:Resolution: Reinstall the heat sinkReinstall the heat sinkDetected by:Detected by: Elevated temperature readings for the Elevated temperature readings for the CPU#1 (lm_sensors)CPU#1 (lm_sensors)Downtime:Downtime: 20 minutes (the affected node)20 minutes (the affected node)

Page 26: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2626

Odysseus hardware maintenance log, Nov 12, 2001:Odysseus hardware maintenance log, Nov 12, 2001: A large, 16-CPU VASP job fails with “LAPACK: Routine A large, 16-CPU VASP job fails with “LAPACK: Routine

ZPOTRF failed”, or random total energy ZPOTRF failed”, or random total energy Reason:Reason: DIMM in bank #0 on node 17 developed a single-DIMM in bank #0 on node 17 developed a single- bit failure at the address 0xfd9f0cbit failure at the address 0xfd9f0cResolution:Resolution: Replace memory module in bank #0Replace memory module in bank #0Detected by:Detected by: Rerunning failing job with different sets of nodes,Rerunning failing job with different sets of nodes, followed by the memory diagnostic on the affected followed by the memory diagnostic on the affected node (memtest32)node (memtest32)Downtime:Downtime: 1 day (the whole cluster) + 2 days (the affected node)1 day (the whole cluster) + 2 days (the affected node)

War Story #2War Story #2

Page 27: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2727

War Story #3War Story #3Odysseus hardware maintenance log, Dec 10, 2001:Odysseus hardware maintenance log, Dec 10, 2001: Apparently random application failures are observedApparently random application failures are observedReason:Reason: Multiple single-bit memory Multiple single-bit memory failures, on the nodes (bank #): failures, on the nodes (bank #): 6 (#2), 7 (#2,#3), 8 (#0), 6 (#2), 7 (#2,#3), 8 (#0), 10 (#0), 11 (#0) 10 (#0), 11 (#0) Resolution:Resolution: Replace memory modulesReplace memory modulesDetected by:Detected by: Cluster-wide memory diagnostic (memtest32) Cluster-wide memory diagnostic (memtest32) Downtime:Downtime: 3 days (the whole cluster)3 days (the whole cluster)

Page 28: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2828

• Using inexpensive, consumer-grade hardware Using inexpensive, consumer-grade hardware potentially exposes you to low-quality componentspotentially exposes you to low-quality components

• NeverNever use components which have no built-in use components which have no built-in hardware monitoring and error detection capabilityhardware monitoring and error detection capability

• Always configure your clusters to Always configure your clusters to reportreport corrected corrected errors and out-of-range hardware sensors readings. errors and out-of-range hardware sensors readings.

• ActAct on the early warnings on the early warnings

• Otherwise, you run a risk of producing garbage Otherwise, you run a risk of producing garbage science, science, and never knowing itand never knowing it

Cautionary NoteCautionary Note

Page 29: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2929

Hardware Monitoring with LinuxHardware Monitoring with Linux

CategoryCategory ParameterParameter PackagePackageMotherboardMotherboard Temperature; Power supply Temperature; Power supply

voltage; Fan statusvoltage; Fan statuslm_sensorslm_sensors##

Hard drivesHard drives Corrected error counts; Corrected error counts; Impending failure indicatorsImpending failure indicators

ide-smartide-smart$$

S.M.A.R.T. SuiteS.M.A.R.T. Suite%%

MemoryMemory Corrected error countsCorrected error counts ecc.oecc.o^̂

NetworkNetwork Hardware-dependentHardware-dependent

# http://www2.lm-sensors.nu/~lm78/ $ http://www.linux-ide.org/smart.html % http://csl.cse.ucsc.edu/smart.shtml ^ http://www.anime.net/~goemon/linux-ecc/ (2.2 kernels only)

Page 30: Clusters in Molecular Sciences Applications

““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 3030

Summary and ConclusionsSummary and Conclusions• Clusters are no longer a techno-geek’s toy, and will Clusters are no longer a techno-geek’s toy, and will

remain the primary workhorse of many research remain the primary workhorse of many research groups, at least for a whilegroups, at least for a while

• Clusters give an impressive return on the investment, Clusters give an impressive return on the investment, and may remain useful longer than expectedand may remain useful longer than expected

• Many (most?) useful research codes in molecular Many (most?) useful research codes in molecular sciences are readily available on clusterssciences are readily available on clusters

• Configuring and operating PC clusters can be tricky. Configuring and operating PC clusters can be tricky. Consider a reputable system integrator with Beowulf Consider a reputable system integrator with Beowulf hardware hardware and softwareand software experience experience