Upload
trinhdieu
View
215
Download
0
Embed Size (px)
Citation preview
1Karsten Kramer @ AEI Cluster Day 02/19/09
Potsdam Scientists to Tackle New Type of Weather Simulations with IBM iDataPlex
Jan 21, 2009 - The Potsdam Institute for Climate Impact Research (PIK) is rolling out a new IBM supercomputer that will increase its computing capacity more than 30-fold.
Potsdam researchers plan to employ IBM’s high-performance iDataPlex servers to more precisely predict weather events that have so far proven to be incalculable – extreme, short-term phenomena such as torrential rain or drought.
[Joint press release by Potsdam-Institute for Climate-Impact Research (PIK) and IBM Germany Ltd.]
2Karsten Kramer @ AEI Cluster Day 02/19/09
Outline
➢ Acknowledgments➢ Requirements➢ Procurement and Installation➢ Hardware, Software Overview➢ System Architecture➢ Compute Subsystem IBM iDataPlex ™ ➢ Voltaire Infiniband Interconnect➢ Benchmarks➢ Electrical Power Consumption & Cooling➢ I/O Subsystem➢ Summary
3Karsten Kramer @ AEI Cluster Day 02/19/09
Acknowledgments
➢ Dr. Werner von Bloh
➢ Roger Grzondziel
➢ Achim Glauer
➢ Karsten Kramer
➢ Dr. Ciaron Linstead
➢ Kerstin Heuer,
➢ Frauke Haneberg
➢ Ingo Deutsch
➢ Carsten Goldhan
➢ Torsten Klietsch
➢ Torsten Kurz
➢ Klaus Hassels
➢ Martin Hiegl
➢ Christoph Pospiech
➢ Klaus Gottschalk
➢ Michael Lauffs
➢ Torsten Bloth
➢ Michael Julien
➢ Steffen Schwab
➢ Mike Kruse-Heidler
➢ Maik Bornhardt, Axel Jahn and
Colleagues
PIK
Gneise 66
IBM
4Karsten Kramer @ AEI Cluster Day 02/19/09
Requirements
➢ Reliable, high performance general purpose compute facility➢ Application benchmarks: LPJ (C+MPI2), CLIMBER (F77), MOM4 (F90+MPI), S (C++/
MPI)
➢ Power uptake limited to 100 kW (compute) /150 kW (system) maximum➢ Parallel file system, Backup/Restore and Hierarchical Storage Management
➢ File system with 4 GB/s, extension of existing tape library and TSM (LAN free) storage infrastructure
➢ Integration into building infrastructure (UPS, Cooling)➢ Replacement of old Cluster, installation of new UPS with 30 Minutes/250 kVA
➢ Control engineering, limited airflow of about 17.000 m3/h (max. 100 kW by air)
➢ Service and technical support➢ 4 yrs. Hardware maintenance and Software Support (OS, Management, Compilers) +
Microcode support
➢ Financing➢ 3 rates over 3 years
5Karsten Kramer @ AEI Cluster Day 02/19/09
Procurement and Installation
➢ Tender and negotiation process (EU)➢ June 11th - October 30th 2008.
➢ IBM awarded contract for secondary offer (Nebenangebot) based on iDataPlex ™ systems
➢ October 31st 2008.
➢ Preparation of raised floor – electricity and cooling➢ November 3rd - 14th 2008.
➢ Hardware delivery and on-site installation➢ November 17th – December 3rd 2008.
➢ Software Installation (OS, LAN, IB, SAN, Management)➢ December 3rd - 17th 2008.
➢ Performance Tests➢ December 17th -31st 2008.
6Karsten Kramer @ AEI Cluster Day 02/19/09
Installation (cont.)
➢ System accepted with conditions➢ December 31st 2008.
➢ Annual maintenance facilities – extended by UPS installation and control engineering
➢ January 2nd - 9th 2009.
➢ I/O tuning➢ January 12nd - 23rd 2009.
➢ TSM/HSM Installation and Integration➢ February 2nd - 6th 2009.
➢ Additional Software (Compilers, Debugger, Batch Queuing, etc.)➢ February 9th - on-going!
➢ Scheduled production:➢ March 2nd 2009.
7Karsten Kramer @ AEI Cluster Day 02/19/09
Hardware
➢ 2560 Cores in 320 Nodes with two Intel Xeon E5472 3GHz/1600 MHz QC CPU each.
➢ 10 TByte RAM 800 MHz FB-DIMM.
➢ 4x DDR Voltaire Infiniband Interconnect (two fabrics).
➢ 200 TByte GPFS / 15krpm FC.
➢ 1 Pbyte tape cartridges and 8 x IBM E06 Drives.
8Karsten Kramer @ AEI Cluster Day 02/19/09
Software
Control➢ Novell SLES10 SP2 x86-64➢ IBM AIX 6.1➢ OFED 1.3
➢ Xcat 2.1➢ IBM GPFS 3.2.1 (RDMA)➢ IBM LoadLeveler 3.5.3➢ Tivoli Storage Manager 5.5.3,
incl. HSM
➢ Cisco IOS 12.2(18)SXF7➢ Voltaire Fab. Manager 5.2.0 ➢ Brocade Fab. Manager 5.3.1
Applications➢ Intel Cluster Toolkit Compiler
Edition 3.2 (C++ , FTN, MPI)
➢ Intel Vtune Performance Analyzer 9.1
➢ GCC 4.1.2/ Open MPI➢ Total View Debugger 8.6.2
➢ Matlab➢ Mapping
➢ Optimization
➢ Signal Processing
➢ Statistics
➢ Tivoli Storage Manager 5.5.2
9Karsten Kramer @ AEI Cluster Day 02/19/09
System Architecture[Klaus Gottschalk, IBM]
10Karsten Kramer @ AEI Cluster Day 02/19/09
Compute SubsystemIBM iDataPlex ™
Model IBM dx360
➢Diskless ➢16 x memory slots
➢eth0 + bmc/ipmi 2.0
➢Mellanox ConnectX Dual-Port 4X DDR IB PCI-E 2.0 x8 5.0GT/s
➢Emcore Connects Optical Cables Two 3-phase PDU+
➢12 x C13 outlets➢Webserver
Two Cisco 3750G-48TS4 x 1000Base-TX uplinks
Reardoor Heat Exchanger~ 25 kW
11Karsten Kramer @ AEI Cluster Day 02/19/09
MEMORY BANDWIDTH
Intel Harpertown/SeaburgE5472/5400B
➢ Dual QC Xeon 3 GHz➢ 32 GB RAM - 16 x 2GB DDR2-
800 FBDIMM AMB+➢ 4 memory channels x 800 MHz➢ UMA ➢ 64 MByte Snoop-Filter➢ 25,6 GB/s max. Bandwidth
(4 x 800 Mhz x 8 Byte)
STREAMMB/s
COPY SCALE ADD TRIAD
dx360/8 11242.36 11273.16 9721.32 9742.68
js22/4 13606,58 13589,12 15416,37 15456,58
p655/8 12059.0 12072.0 14925.0 15090.0
12Karsten Kramer @ AEI Cluster Day 02/19/09
FIRST MEMORY TEST[Ciaron Linstead, PIK]
13Karsten Kramer @ AEI Cluster Day 02/19/09
Voltaire Interconnect
➢ Two Voltaire 2012 x 192 Ports (8 of 12 linecards; 2 x 5 ports free).
➢ Mellanox ConnectX Dual-Port 4X DDR IB PCI-E 2.0 x8
➢ Optical cables highly recommended!
➢ Netpipe MPI Latency (Intra/Interswitch) 1,48s / 2.25s.
➢ Netpipe MPI Bandwidth (Intra/Interswitch) 14.9 Gbps / 14.6 Gbps.
root@nsds01[0]:~# ibstatus Infiniband device 'mlx4_0' port 1 status: state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR)
Infiniband device 'mlx4_0' port 2 status: state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR)
14Karsten Kramer @ AEI Cluster Day 02/19/09
Voltaire Interconnect- ½ on-site installation -
15Karsten Kramer @ AEI Cluster Day 02/19/09
Application Benchmarks
Runtime/sHS202 DC Woodcrest 3 GHz
DX3602 QC Harpertown
3 GHz
JS222 DC Power6
4 GHz
CLIMBER(1)
1120.00 785.68 1407.56 *
MOM-4 (45)
15259.13 9038.66 4378.00
LPJ spinup(32)
2984.00 2256.00 2408.00
LPJ output(32)
698.5 498.00 637.00
16Karsten Kramer @ AEI Cluster Day 02/19/09
Electrical Power ConsuptionPDU+ x 8
➢ Node➢ 2 x 80 W Chip TDP➢ 32 GB FBDIMM➢ 190 W (idle)➢ 312 W (busy)
➢ PDU+ – 40 x DX360 = ½ Rack
➢ 7.5 kW (idle)➢ 12.5 kW (busy)
➢ System➢ 60 kW (idle)➢ 100 kW (busy)
+/- 40 kWIn only two minutes!
17Karsten Kramer @ AEI Cluster Day 02/19/09
Reardoor Heat Exchangers
Secondary cold water circuit➢16° C➢14 m3/h
Water pumps adjustevery 30 Minutes!
18Karsten Kramer @ AEI Cluster Day 02/19/09
… by the way …iDataplex has no redundant power!
“Autonomiezeit”30 minutes
with200 kW load.
…
How would you cool 200 kW withoutelectrical power?
19Karsten Kramer @ AEI Cluster Day 02/19/09
GPFS I/O Subsystem[Klaus Gottschalk, Torsten Bloth, IBM]
NSDS: 2 x PCIe 8x
Dual Port HBA Dual Port HCA
2 xPCIe 4x Single Port HBA Single Port HBA
Maximum transfer rates for one 200 TbyteFilesystem using 16 clientsand one 1 TB file:➢ 5.3 Gbps write➢ 4.8 Gbps read
20Karsten Kramer @ AEI Cluster Day 02/19/09
GPFS I/O Subsystem[Klaus Gottschalk, Torsten Bloth, IBM]
NSDS: 2 x PCIe 8x
Dual Port HBA Dual Port HCA
2 xPCIe 4x Single Port HBA Single Port HBA
Maximum transfer rates for one 200 TbyteFilesystem using 16 clientsand one 1 TB file:➢ 5.3 Gbps write➢ 4.8 Gbps read
21Karsten Kramer @ AEI Cluster Day 02/19/09
GPFS I/O Subsystem[Klaus Gottschalk, Torsten Bloth, IBM]
NSDS: 2 x PCIe 8x
Dual Port HBA Dual Port HCA
2 xPCIe 4x Single Port HBA Single Port HBA
Maximum transfer rates for one 200 TbyteFilesystem using 16 clientsand one 1 TB file:➢ 5.3 Gbps write➢ 4.8 Gbps read
22Karsten Kramer @ AEI Cluster Day 02/19/09
I/O Tuning Considerations
➢ Storage Servers (DS)➢ Enclosure Cabling, LUN Layout + LUN Mapping➢ Block + Cache Settings (i.e. dynamic prefetch disabled)
➢ Network Shares Disk Servers (NSDS)➢ Drivers (!) MPP vs. RDAC (Downgrade to SLES10/SP1)➢ Max. Sectors (disk) + Max. Depth (hba)
➢ Filesystem➢ Disk Layout, Blocksize (256KB x 8 = 2 MB), number of threads, pinned
memory, prefetch, etc.
➢ Storage Area Network➢ Zoning (!), WWPN vs. Port Zoning, Switch Firmware, Port to ASIC
➢ Interconnect➢ Haven't looked into IB tuning … yet. Experiences?
23Karsten Kramer @ AEI Cluster Day 02/19/09
LUNs (DataAndMetadata);
CTRL AA = dam1B = dam2C = dam3D = dam4E = dam5F = dam6G = dam7
CTRL Bh = dam8i = dam9j = dam10k = dam11l = dam12m = dam13n = dam14
A A A B B C C D D E E E F F G GEnc 11
slot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A A B B B C C D D E E F F F G GEnc 22
A A B B C C C D D E E F F G G GEnc 33
A A B B C C D D D E E F F G G HSEnc 44
h h i i j j k k k l l m m n n HSEnc 66
h h i i j j j k k l l m m n n nEnc 77
h h i i i j j k k l l m m m n nEnc 88
h h h i i j j k k l l l m m n nEnc 99
New LUN Disk LayoutIBM Proposal, Torsten Bloth
24Karsten Kramer @ AEI Cluster Day 02/19/09
1 2 3 4Server 1
1 2 3 4Server 2
1 2 3 4Server 3
1 2 3 4Server 4
A
B
1234 1 2 3 4
A Ctrlssan03
B Ctrlssan04
A
B
1234 1 2 3 4A
B
1234 1 2 3 4A
B
1234 1 2 3 4
2/2 3/1
2/14/1
ds12 ds13 ds14 ds15
2/2 3/1
2/14/1
2/2 3/1
2/14/1
2/2 3/1
2/14/1
B1
SAN ZoningIBM Proposal, Torsten Bloth
25Karsten Kramer @ AEI Cluster Day 02/19/09
TSM/HSM Subsystem[Klaus Gottschalk, IBM]
Restore of one file system from new tapes/tape drives into preproduction GPFS. One IBM E06 drive used!
Restore processing finishedTotal number of objects restored: 2 661 756Total number of objects failed: 62Total number of bytes transferred: 12.34 TBData transfer time: 77 943.60 secNetwork data transfer rate: 170 093.47 KB/secAggregate data transfer rate: 124 869.86 KB/secElapsed processing time: 29:29:32
26Karsten Kramer @ AEI Cluster Day 02/19/09
LoadL
27Karsten Kramer @ AEI Cluster Day 02/19/09
Summary
➢ IBM iDataplex promised best price/performance and best electrical efficiency for PIK application benchmarks.
➢ Detailed technical planning upfront installation require, authorization recommended.
➢ UPS is mandatory – but mind the cooling after power is interrupted!
➢ Though it (still) looks like an easy set-up the devil is in the detail:➢ Electrical power uptake varies significantly with application.➢ Tight control of secondary chilled water circuit required (a good
facility manager, that is), ➢ Would have liked a Cisco NAM2 or eq. installed in the central ADMIN
LAN switch for debugging.➢ I/O performance tuning is still challenging.➢ Xcat2 straightforward to use but mind that this is just a basic tool.