27
1 Karsten Kramer @ AEI Cluster Day 02/19/09 Potsdam Scientists to Tackle New Type of Weather Simulations with IBM iDataPlex Jan 21, 2009 - The Potsdam Institute for Climate Impact Research (PIK) is rolling out a new IBM supercomputer that will increase its computing capacity more than 30-fold. Potsdam researchers plan to employ IBM’s high-performance iDataPlex servers to more precisely predict weather events that have so far proven to be incalculable – extreme, short-term phenomena such as torrential rain or drought. [Joint press release by Potsdam-Institute for Climate-Impact Research (PIK) and IBM Germany Ltd.]

Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

Embed Size (px)

Citation preview

Page 1: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

1Karsten Kramer @ AEI Cluster Day 02/19/09

Potsdam Scientists to Tackle New Type of Weather Simulations with IBM iDataPlex

Jan 21, 2009 - The Potsdam Institute for Climate Impact Research (PIK) is rolling out a new IBM supercomputer that will increase its computing capacity more than 30-fold.

Potsdam researchers plan to employ IBM’s high-performance iDataPlex servers to more precisely predict weather events that have so far proven to be incalculable – extreme, short-term phenomena such as torrential rain or drought.

[Joint press release by Potsdam-Institute for Climate-Impact Research (PIK) and IBM Germany Ltd.]

Page 2: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

2Karsten Kramer @ AEI Cluster Day 02/19/09

Outline

➢ Acknowledgments➢ Requirements➢ Procurement and Installation➢ Hardware, Software Overview➢ System Architecture➢ Compute Subsystem IBM iDataPlex ™ ➢ Voltaire Infiniband Interconnect➢ Benchmarks➢ Electrical Power Consumption & Cooling➢ I/O Subsystem➢ Summary

Page 3: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

3Karsten Kramer @ AEI Cluster Day 02/19/09

Acknowledgments

➢ Dr. Werner von Bloh

➢ Roger Grzondziel

➢ Achim Glauer

➢ Karsten Kramer

➢ Dr. Ciaron Linstead

➢ Kerstin Heuer,

➢ Frauke Haneberg

➢ Ingo Deutsch

➢ Carsten Goldhan

➢ Torsten Klietsch

➢ Torsten Kurz

➢ Klaus Hassels

➢ Martin Hiegl

➢ Christoph Pospiech

➢ Klaus Gottschalk

➢ Michael Lauffs

➢ Torsten Bloth

➢ Michael Julien

➢ Steffen Schwab

➢ Mike Kruse-Heidler

➢ Maik Bornhardt, Axel Jahn and

Colleagues

PIK

Gneise 66

IBM

Page 4: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

4Karsten Kramer @ AEI Cluster Day 02/19/09

Requirements

➢ Reliable, high performance general purpose compute facility➢ Application benchmarks: LPJ (C+MPI2), CLIMBER (F77), MOM4 (F90+MPI), S (C++/

MPI)

➢ Power uptake limited to 100 kW (compute) /150 kW (system) maximum➢ Parallel file system, Backup/Restore and Hierarchical Storage Management

➢ File system with 4 GB/s, extension of existing tape library and TSM (LAN free) storage infrastructure

➢ Integration into building infrastructure (UPS, Cooling)➢ Replacement of old Cluster, installation of new UPS with 30 Minutes/250 kVA

➢ Control engineering, limited airflow of about 17.000 m3/h (max. 100 kW by air)

➢ Service and technical support➢ 4 yrs. Hardware maintenance and Software Support (OS, Management, Compilers) +

Microcode support

➢ Financing➢ 3 rates over 3 years

Page 5: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

5Karsten Kramer @ AEI Cluster Day 02/19/09

Procurement and Installation

➢ Tender and negotiation process (EU)➢ June 11th - October 30th 2008.

➢ IBM awarded contract for secondary offer (Nebenangebot) based on iDataPlex ™ systems

➢ October 31st 2008.

➢ Preparation of raised floor – electricity and cooling➢ November 3rd - 14th 2008.

➢ Hardware delivery and on-site installation➢ November 17th – December 3rd 2008.

➢ Software Installation (OS, LAN, IB, SAN, Management)➢ December 3rd - 17th 2008.

➢ Performance Tests➢ December 17th -31st 2008.

Page 6: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

6Karsten Kramer @ AEI Cluster Day 02/19/09

Installation (cont.)

➢ System accepted with conditions➢ December 31st 2008.

➢ Annual maintenance facilities – extended by UPS installation and control engineering

➢ January 2nd - 9th 2009.

➢ I/O tuning➢ January 12nd - 23rd 2009.

➢ TSM/HSM Installation and Integration➢ February 2nd - 6th 2009.

➢ Additional Software (Compilers, Debugger, Batch Queuing, etc.)➢ February 9th - on-going!

➢ Scheduled production:➢ March 2nd 2009.

Page 7: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

7Karsten Kramer @ AEI Cluster Day 02/19/09

Hardware

➢ 2560 Cores in 320 Nodes with two Intel Xeon E5472 3GHz/1600 MHz QC CPU each.

➢ 10 TByte RAM 800 MHz FB-DIMM.

➢ 4x DDR Voltaire Infiniband Interconnect (two fabrics).

➢ 200 TByte GPFS / 15krpm FC.

➢ 1 Pbyte tape cartridges and 8 x IBM E06 Drives.

Page 8: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

8Karsten Kramer @ AEI Cluster Day 02/19/09

Software

Control➢ Novell SLES10 SP2 x86-64➢ IBM AIX 6.1➢ OFED 1.3

➢ Xcat 2.1➢ IBM GPFS 3.2.1 (RDMA)➢ IBM LoadLeveler 3.5.3➢ Tivoli Storage Manager 5.5.3,

incl. HSM

➢ Cisco IOS 12.2(18)SXF7➢ Voltaire Fab. Manager 5.2.0 ➢ Brocade Fab. Manager 5.3.1

Applications➢ Intel Cluster Toolkit Compiler

Edition 3.2 (C++ , FTN, MPI)

➢ Intel Vtune Performance Analyzer 9.1

➢ GCC 4.1.2/ Open MPI➢ Total View Debugger 8.6.2

➢ Matlab➢ Mapping

➢ Optimization

➢ Signal Processing

➢ Statistics

➢ Tivoli Storage Manager 5.5.2

Page 9: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

9Karsten Kramer @ AEI Cluster Day 02/19/09

System Architecture[Klaus Gottschalk, IBM]

Page 10: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

10Karsten Kramer @ AEI Cluster Day 02/19/09

Compute SubsystemIBM iDataPlex ™

Model IBM dx360

➢Diskless ➢16 x memory slots

➢eth0 + bmc/ipmi 2.0

➢Mellanox ConnectX Dual-Port 4X DDR IB PCI-E 2.0 x8 5.0GT/s

➢Emcore Connects Optical Cables Two 3-phase PDU+

➢12 x C13 outlets➢Webserver

Two Cisco 3750G-48TS4 x 1000Base-TX uplinks

Reardoor Heat Exchanger~ 25 kW

Page 11: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

11Karsten Kramer @ AEI Cluster Day 02/19/09

MEMORY BANDWIDTH

Intel Harpertown/SeaburgE5472/5400B

➢ Dual QC Xeon 3 GHz➢ 32 GB RAM - 16 x 2GB DDR2-

800 FBDIMM AMB+➢ 4 memory channels x 800 MHz➢ UMA ➢ 64 MByte Snoop-Filter➢ 25,6 GB/s max. Bandwidth

(4 x 800 Mhz x 8 Byte)

STREAMMB/s

COPY SCALE ADD TRIAD

dx360/8 11242.36 11273.16 9721.32 9742.68

js22/4 13606,58 13589,12 15416,37 15456,58

p655/8 12059.0 12072.0 14925.0 15090.0

Page 12: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

12Karsten Kramer @ AEI Cluster Day 02/19/09

FIRST MEMORY TEST[Ciaron Linstead, PIK]

Page 13: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

13Karsten Kramer @ AEI Cluster Day 02/19/09

Voltaire Interconnect

➢ Two Voltaire 2012 x 192 Ports (8 of 12 linecards; 2 x 5 ports free).

➢ Mellanox ConnectX Dual-Port 4X DDR IB PCI-E 2.0 x8

➢ Optical cables highly recommended!

➢ Netpipe MPI Latency (Intra/Interswitch) 1,48s / 2.25s.

➢ Netpipe MPI Bandwidth (Intra/Interswitch) 14.9 Gbps / 14.6 Gbps.

root@nsds01[0]:~# ibstatus Infiniband device 'mlx4_0' port 1 status:        state:           4: ACTIVE        phys state:      5: LinkUp        rate:            20 Gb/sec (4X DDR)

Infiniband device 'mlx4_0' port 2 status:        state:           4: ACTIVE        phys state:      5: LinkUp        rate:            20 Gb/sec (4X DDR)

Page 14: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

14Karsten Kramer @ AEI Cluster Day 02/19/09

Voltaire Interconnect- ½ on-site installation -

Page 15: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

15Karsten Kramer @ AEI Cluster Day 02/19/09

Application Benchmarks

Runtime/sHS202 DC Woodcrest 3 GHz

DX3602 QC Harpertown

3 GHz

JS222 DC Power6

4 GHz

CLIMBER(1)

1120.00 785.68 1407.56 *

MOM-4 (45)

15259.13 9038.66 4378.00

LPJ spinup(32)

2984.00 2256.00 2408.00

LPJ output(32)

698.5 498.00 637.00

Page 16: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

16Karsten Kramer @ AEI Cluster Day 02/19/09

Electrical Power ConsuptionPDU+ x 8

➢ Node➢ 2 x 80 W Chip TDP➢ 32 GB FBDIMM➢ 190 W (idle)➢ 312 W (busy)

➢ PDU+ – 40 x DX360 = ½ Rack

➢ 7.5 kW (idle)➢ 12.5 kW (busy)

➢ System➢ 60 kW (idle)➢ 100 kW (busy)

+/- 40 kWIn only two minutes!

Page 17: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

17Karsten Kramer @ AEI Cluster Day 02/19/09

Reardoor Heat Exchangers

Secondary cold water circuit➢16° C➢14 m3/h

Water pumps adjustevery 30 Minutes!

Page 18: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

18Karsten Kramer @ AEI Cluster Day 02/19/09

… by the way …iDataplex has no redundant power!

“Autonomiezeit”30 minutes

with200 kW load.

How would you cool 200 kW withoutelectrical power?

Page 19: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

19Karsten Kramer @ AEI Cluster Day 02/19/09

GPFS I/O Subsystem[Klaus Gottschalk, Torsten Bloth, IBM]

NSDS: 2 x PCIe 8x

Dual Port HBA Dual Port HCA

2 xPCIe 4x Single Port HBA Single Port HBA

Maximum transfer rates for one 200 TbyteFilesystem using 16 clientsand one 1 TB file:➢ 5.3 Gbps write➢ 4.8 Gbps read

Page 20: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

20Karsten Kramer @ AEI Cluster Day 02/19/09

GPFS I/O Subsystem[Klaus Gottschalk, Torsten Bloth, IBM]

NSDS: 2 x PCIe 8x

Dual Port HBA Dual Port HCA

2 xPCIe 4x Single Port HBA Single Port HBA

Maximum transfer rates for one 200 TbyteFilesystem using 16 clientsand one 1 TB file:➢ 5.3 Gbps write➢ 4.8 Gbps read

Page 21: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

21Karsten Kramer @ AEI Cluster Day 02/19/09

GPFS I/O Subsystem[Klaus Gottschalk, Torsten Bloth, IBM]

NSDS: 2 x PCIe 8x

Dual Port HBA Dual Port HCA

2 xPCIe 4x Single Port HBA Single Port HBA

Maximum transfer rates for one 200 TbyteFilesystem using 16 clientsand one 1 TB file:➢ 5.3 Gbps write➢ 4.8 Gbps read

Page 22: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

22Karsten Kramer @ AEI Cluster Day 02/19/09

I/O Tuning Considerations

➢ Storage Servers (DS)➢ Enclosure Cabling, LUN Layout + LUN Mapping➢ Block + Cache Settings (i.e. dynamic prefetch disabled)

➢ Network Shares Disk Servers (NSDS)➢ Drivers (!) MPP vs. RDAC (Downgrade to SLES10/SP1)➢ Max. Sectors (disk) + Max. Depth (hba)

➢ Filesystem➢ Disk Layout, Blocksize (256KB x 8 = 2 MB), number of threads, pinned

memory, prefetch, etc.

➢ Storage Area Network➢ Zoning (!), WWPN vs. Port Zoning, Switch Firmware, Port to ASIC

➢ Interconnect➢ Haven't looked into IB tuning … yet. Experiences?

Page 23: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

23Karsten Kramer @ AEI Cluster Day 02/19/09

LUNs (DataAndMetadata);

CTRL AA = dam1B = dam2C = dam3D = dam4E = dam5F = dam6G = dam7

CTRL Bh = dam8i = dam9j = dam10k = dam11l = dam12m = dam13n = dam14

A A A B B C C D D E E E F F G GEnc 11

slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

A A B B B C C D D E E F F F G GEnc 22

A A B B C C C D D E E F F G G GEnc 33

A A B B C C D D D E E F F G G HSEnc 44

h h i i j j k k k l l m m n n HSEnc 66

h h i i j j j k k l l m m n n nEnc 77

h h i i i j j k k l l m m m n nEnc 88

h h h i i j j k k l l l m m n nEnc 99

New LUN Disk LayoutIBM Proposal, Torsten Bloth

Page 24: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

24Karsten Kramer @ AEI Cluster Day 02/19/09

1 2 3 4Server 1

1 2 3 4Server 2

1 2 3 4Server 3

1 2 3 4Server 4

A

B

1234 1 2 3 4

A Ctrlssan03

B Ctrlssan04

A

B

1234 1 2 3 4A

B

1234 1 2 3 4A

B

1234 1 2 3 4

2/2 3/1

2/14/1

ds12 ds13 ds14 ds15

2/2 3/1

2/14/1

2/2 3/1

2/14/1

2/2 3/1

2/14/1

B1

SAN ZoningIBM Proposal, Torsten Bloth

Page 25: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

25Karsten Kramer @ AEI Cluster Day 02/19/09

TSM/HSM Subsystem[Klaus Gottschalk, IBM]

Restore of one file system from new tapes/tape drives into pre­production GPFS. One IBM E06 drive used!

Restore processing finishedTotal number of objects restored:   2 661 756Total number of objects failed:    62Total number of bytes transferred: 12.34 TBData transfer time:                 77 943.60 secNetwork data transfer rate:         170 093.47 KB/secAggregate data transfer rate:       124 869.86 KB/secElapsed processing time:            29:29:32 

Page 26: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

26Karsten Kramer @ AEI Cluster Day 02/19/09

LoadL

Page 27: Potsdam Scientists to Tackle New Type of Weather Simulations …clusterday2011.aei.mpg.de/cgd2009/talks/kramer/PIK_IPLEX_Clusterday... · COPY SCALE ADD TRIAD dx360/8 11242.36 11273.16

27Karsten Kramer @ AEI Cluster Day 02/19/09

Summary

➢ IBM iDataplex promised best price/performance and best electrical efficiency for PIK application benchmarks.

➢ Detailed technical planning upfront installation require, authorization recommended.

➢ UPS is mandatory – but mind the cooling after power is interrupted!

➢ Though it (still) looks like an easy set-up the devil is in the detail:➢ Electrical power uptake varies significantly with application.➢ Tight control of secondary chilled water circuit required (a good

facility manager, that is), ➢ Would have liked a Cisco NAM2 or eq. installed in the central ADMIN

LAN switch for debugging.➢ I/O performance tuning is still challenging.➢ Xcat2 straightforward to use but mind that this is just a basic tool.