Upload
nguyendat
View
238
Download
1
Embed Size (px)
Citation preview
QLogic Confidential1
Complete InfiniBand From QLogic
Wenhao WuHPC Systems Engineer
QLogic Confidential2
Meeting Agenda
QLogic in HPC marketSun/QLogic EngagementsQLogic IB Portfolio – “Complete InfiniBand”QLogic IB stack and ProtocolsDiscussions
QLogic Confidential3
History of Infiniband
In 1999, two separate initiatives merged to form the Infiniband Trade Association (IBTA)
NGIO backed by IntelFIO backed by IBM, Compaq and HP
Charter members include Sun, HP, IBM, Dell, Microsoft and IntelSpec was written that all companies in the IB space strive to maintain (current is 1.2) for industry standard interoperability
QLogic Confidential4
Common Infiniband Terminology
Switches• Provides “any-to-any” high speed access within the IB network• Switches are currently available in sizes from 12 nodes to 244
nodes• Single Data Rate (SDR) at 10 Gb/s; Double Data Rate (DDR) at
~ 20 Gb/s and Quad Data Rate (QDR) at 40 Gb/sHCA – Host Channel Adapter• Resides within a server• Configurations include
PCI-XDual Port with memory (133 MHz optimal, 64 Bit)
PCI Express (PCI-e x8)Dual Port with memoryDual port memory freeSingle port memory free
Hypertransport
QLogic Confidential5
QLogic Provides Leading InfiniBand Technology
Little Mountain Group2001
Troika Networks2005
Ancor Communications2000 Storage Solutions Group
Switch Products Group
Computer Systems Group
Emulex Micro Devices1993
April 2006
November 2006
QLogic Confidential6
Positioned for Success in InfiniBand
Financially strong public company• $586M revenue in FY07• $500M cash
Acquired two leading InfiniBand companies in 2006• Over 200 employees focused on
InfiniBand and HPCStrong and growing HPC business development teamWell established global support teamMajor investments in DVT and Signal IntegrityDemonstrated success in delivering current generation of end-to-end solutions• Directors offered over a year ahead of
nearest competition • Over 55,000 external DDR InfiniBand
switch ports shipped
428.7494.1
586.1
$0.0
$100.0
$200.0
$300.0
$400.0
$500.0
$600.0
FY05 FY06 FY07
($Millions)
QLogic Confidential7
QLogic Supercomputing TOP100
QLogic is installed in 10 of 26 IB Clusters in the Supercomputing TOP100
Cisco/Dell15.62208Stanford54
Cisco/Dell34.85440Louisiana ONI23
Cisco/Dell539024NNSA/Sandia11Cisco/Dell46.75848TACC15Cisco/Dell42.45200Maui MHPCC16
SilverStorm Direct12.22200Virginia Tech71Alliance10.81140Intel81
Linux Networx15.23368ARL57
Dell18.32340Cambridge44
Linux Networx40.64416ARL17
PartnerRMaxTFLOPS
CPUsSiteRank
QLogic Confidential8
Industry’s First 500+ SDR Fabric – RIKEN
2004#111 on the TOP500 after 3 years in operation
• Installation: Q1 2004. Operational: March 1, 2004• 512-node cluster (1,024 CPUs) with InfiniBand by leveraging
the combined technologies of Fujitsu and InfiniCon Systems.• Fujitsu rigorously evaluated InfiniCon’s products for data integrity,
robustness, component selection, management interfaces, mechanical design, reliability, and numerous other criteria.
• BenefitsCreated one of the world's top-performing supercomputers (6.2 TeraFlops), using commodity components and industry-standard technologies.
First 500-node clusterAccelerated performance to complete compute tasks by 20-45% over competitive clustering solutions.
QLogic Confidential9
Industry’s First 800+ node DDR Network Deployment
2006TI-06 with ARL & Linux Networx
The First and Largest (at the time) DDR Network Deployment
2 systems (1100 & 842 compute nodes), 70 TFLOPS: June 2006# 57 on the TOP500
• Largest DDR Cluster at the time • 842 Compute Nodes
3368 Processors17 TFLOPS
*1 HABU = equivalent performance of the weighted DoD benchmark suite on a 1024 CPU IBM Power3 system
QLogic Confidential10
First 1100+ DDR Cluster
June 2007# 17 on the TOP500
• Largest DDR Cluster at the time • 1100-compute nodes (MJM)
4400 3.0 GHz Intel Woodcrest cores for computation53 TFLOPSIncreased computational capability more than 64 HABUs* and 50 TFLOPs.
• 28 Management Nodes 112 3.0 GHz cores Login, storage, and administration8.8 TB of memory and 200 TB of disk
• A global file system of 200 TBIBM GPFS using QLogic Sockets Direct Protocol (SDP) for InfiniBand Performance in excess of 9.5 GB/sec
• All nodes communicate via 4X DDR (20 Gbps) QLogic IB
IB to 10GbE Gateway Module providing all nodes with uplink capability.
QLogic Confidential11
Engagements with Sun
QLogic Confidential12
Sun/QLogic – Joint Efforts
• Sun Partner Advantage Program member• Sun HPTC Alliance Partner• Sun Solaris Ready Program member
• InfiniBand Infrastructure supplier for Sun Solution Center for HPC in Oregon - 1000+ node capability, high density cluster, Top500 SC listing
• InfiniBand Infrastructure supplier for Sun Standards Benchmark Labin Oregon – 128 node plus FC I/O capableLustre• Total at Houston, 512 cores with QLogic SDR InfiniBand adapters• Quick Silver stack certified with Lustre
QLogic Confidential13
Sun/QLogic – Joint Efforts
• Schlumberger, joint efforts produced Certified Oil&Gas offering used in Sun RFQ bids
• Oracle RAC, joint efforts produced Finance reference demo
• Installs at Kuwaiti Oil Company, University of Granada, University of Oslo, Conoco-Alaska, SDSC, Penn State, Princeton, Oregon State, Univ. Catholique de Louvain, Univ. de Liege, BioInformatic Institute, UAE Weather
QLogic Confidential14
InfiniBand Product Portfolio
QLogic Confidential15
Anatomy of a 1000 Node Cluster
3-4 Management Nodes 968 Compute Nodes 32 I/O Nodes
xxTB FC SAN xxTB NASIB SAN
HCAs
xxTBDAS
IBEdge
SwitchIB Director
QLogic Confidential16
Complete InfiniBand Solution
Widest range of host InfiniBand adapters
OFED Plus software stacks
High Capacity Multi-protocol Director
InfiniBand Edge Switches
Multi-protocol gateways
Cables
InfiniBand Fabric Management
QLogic Confidential17
QLogic Host Channel Adapters
First Silicon Alternative for IB HCAsFirst x16 DDR PCIe HCAComplete Offering of HCA IB Family• x8, x16, SDR, DDR, Dual Port, Single Port,
Mem, MemFreeNext Generation HCA Technology Leadership• Very low latency coupled with very high
message rate• Industry leading Spec MPI2007 results• Lowest Power Consumption: <1/2 power of
alternatives
QLogic Confidential18
7U
14U
2U
7U
19” rack
4U
Series 9000 InfiniBand Products
9240 - 288 Ports
9120 - 144 Ports
9080 - 96 Ports
9040 - 48 Ports
9020 - 20 Ports
Core Fabric Switches
Multi-Protocol Fabric Director
Modules
4X SDR
4X DDR
- Same Modules – IB, FC, Ethernet
- Same Spine
- Same Power Supplies unit
- Same Fan Tray unit
- Same OOB management interface
- Same Serial interface
- Same running image
9024 - 24 Ports
EVIC
FVIC
QLogic Confidential19
9240 (Core)9240 (Edge)
10,369 – 20,736
9240 (Core)9120 (Edge)
2,437 – 10,368
9240 (Core)9024 Unmanaged (Edge)
1,729 – 3,456
9120 (Core)9024 Unmanaged (Edge)
289 – 1,728
9240145 – 288
912097 – 144
908049 – 96
904021 – 48
9024 (Managed)1 – 24
90201 – 22
Switch Type# of Nodes
7U
14U
2U
7U
19” rack
4U
Series 9000 Usage
9240 - 288 Ports
9120 - 144 Ports
9080 - 96 Ports
9040 - 48 Ports
9020 - 20 Ports9024 - 24 Ports
QLogic Confidential20
QLogic IB Stack and Protocols
QLogic Confidential21
IB Protocol
QLogic Confidential22
Linux Software Stacks: Standards +
InfiniPath Adapters only
iPath Driver
Verbs Provider Driver
IPTra
nsp
ortQ
Lo
gic
MP
ICH
Sta
nd
ard
Lin
ux T
CP
/IP
S
tack
PSM API
…
AcceleratedStacks
StandardStack
Op
en
MP
I
HP
-MP
I
Sca
li M
PI
MV
AP
ICH
MV
AP
ICH
2
For InfiniPath Adapters, additional acceleration is available!
OpenFabrics Verbs
VNIC, SDPSRP, iSER
…
OpenFabrics Stack
Inte
l M
PI
Op
en
MP
I
uDAPLMV
AP
ICH
Oracle RACAccelerator(via RDS)
GPFS Accelerator(via SDP)
Enterprise SDP(NFS, inet, etc)
Enterprise EthernetIO Controller
(via VNIC)
Enterprise IB Storage& FC IO Controller
(via SRP)
Enterprise FabricManagement
FastFabric Tools
QuickSilver Stack
QuickSilver Verbs
…
Inte
l M
PI
Op
en
MP
I
uDAPLQS
-MV
AP
ICH
Sca
liM
PI
HP
MP
I
VAPI
…
QLogic Confidential23
QLogic OFED+ Host Software Stack
All OFED benefits plus optional value added capabilitiesAll OFED benefits plus optional value added capabilitiesAll OFED benefits plus optional value added capabilities
QLogic Confidential24
Host Software Support ModelO
penF
abric
s A
llianc
e D
evel
opm
ent S
tream
snapshot OFED 1.1
snapshot OFED 1.2
snapshot OFED 1.3
QLogicOFED 1.1
QLogicOFED 1.2
QLogicOFED 1.3
snapshot
snapshot
snapshot
QLogicValue-adds
Model provides for the rapid resolution of customer issues!
Bug Fix
OFA Maintained OFED Maintained QLogic Maintained
QLogic Confidential25
Rebuilding QuickSilver & QuickSilverMPI
QLogic Confidential26
Build QuickSilver from Source
Get a copy of InfiniServ*GPL.tgz (and InfiniServNonGPL*.tgz if IFS2008 was purchased.)Extract the packages, GPL firstCommon missing packages include: kernel-headers, kernel-source, x11-devel, g77, expect, and tcl. If an error occurs, parse make.res for “ERROR” or “Error” or send a copy of make.res to [email protected]
[root@tsg67 ~]$ tar zxfInfiniServBasicGPL.4.1.1.0.15.tgz[root@tsg67 ~]$ cd InfiniServ.4.1.1.0.15/ALL_HOST/[root@tsg67 ALL_HOST]$ ./do_buildICS CDE Environment Settings:
Build Target : X86_64Build Target OS : redhat LINUX 2.6.9-67.ELsmpBuild Platform : redhat LINUX…
QLogic Confidential27
Building QuickSilver from Source
Once complete you will see a summary report of the build.The software is located in release/<OS>/<arch>/
From this point you can proceed with a standard installInfiniServ*G.tgz – G means this was a compilation from source
[root@tsg67 ALL_HOST]$ ls -al release/redhat/X86_64/total 258748drwxr-xr-x 4 root root 4096 Apr 9 13:36 .drwxr-xr-x 3 root root 4096 Apr 9 13:35 ..drwxr-xr-x 9 root root 4096 Apr 9 13:35 InfiniServ.4.1.1.0.15G-rw-r--r-- 1 root root 100045684 Apr 9 13:36 InfiniServ.4.1.1.0.15G.tgzlrwxrwxrwx 1 root root 22 Apr 9 13:36 InfiniServBasic.4.1.1.0.15G -> InfiniServ.4.1.1.0.15G-rw-r--r-- 1 root root 100044283 Apr 9 13:36 InfiniServBasic.4.1.1.0.15G.tgz
QLogic Confidential28
Rebuilding QuickSilver MPI for other 3rd
party compiler support
Insure MPI Source was installed (iba_config)
The following compiler build options are available if installed. (Insure the compiler you need is in $PATH)
[root@tsg67 mpich]$ cd /opt/iba/src/InfiniServMPI/mpich
Compiler options are:gnu_autoselect: Choose gnu or gnu4 based on g77/gfortran availablity. (default)gnu: gcc, g77 (default if g77 on system)gnu4: gcc, gfortran (default if gfortran on system)path_x86_64: Pathscale Compiler for x86_64pgi_x86_64: Portland Compiler for x86_64
(link symbols have two underscores)pgi_x86_64_nsu: Portland Compiler for x86_64
(link symbols have single underscore)pgi_x86_32: Portland Compiler for Opteron with 32bit OS
(link symbols have two underscores)pgi_x86_32_nsu: Portland Compiler for Opteron with 32bit OS
(link symbols have single underscore)pgi_ia32: Portland Compiler for IA32pgi_ia32_nsu: Portland Compiler for IA32
(link symbols have single underscore)
ifc_ia32: Intel Fortran Compiler for IA32ifc_x86_64: Intel Fortran Compiler for X86_64ifc_ia64: Intel Fortran Compiler for IA64ifcc_ia32: Intel Fortran and C Compiler for IA32ifcc_x86_64: Intel Fortran and C Compiler for X86_64ifcc_ia64: Intel Fortran and C Compiler for IA64lf95_ia32: Fujitsu F95 Compiler for IA32
QLogic Confidential29
Rebuilding QuickSilverMPI
./do_config provides you with command line control to select installation location and compiler type. (./do_config pgi_x86_64 /opt/mpich_pgi)./do_build automates this function by searching your $PATH for additional compilers.
[root@tsg67 mpich]$ ./do_build
InfiniServ MPI Library/Tools rebuild1) GNU_g772) pgi_x86_643) pgi_x86_64_nsuSelect Compiler: 1
QLogic Confidential30
QuickSilver SM
QLogic Confidential31
QLogic Subnet Manager - SM
What is it?• The backbone of the InfiniBand Spec• The SM is responsible for initializing the fabric and
coordinating subnet management responsibilities with other redundant subnet managers. Fabric initialization includes:
1.Setting up the routing tables for unicast and multicast operation
2.Setting up user specified partitions for security/provisioning3.Setting up SL to VL mapping tables to configure QOS (quality
of service) policies4.Setting up VL Arbitration Tables to configure packet priorities
Without an SM, the InfiniBand fabric will not be active. It will stay in an “init” state.
QLogic Confidential32
QLogic Fabric Manager
Key Capabilities• Very High Performance• Rapid Response to Fabric Changes• Highly Scalable SM and SA• Fully Redundant Operation• Supports Fabric Verification and Diagnosis Tools• Sophisticated Routing algorithms• Required for Adaptive Routing• Will Support Virtual Fabrics
QLogicFabric
Manager
SM
PM
BM
SA
FE
QLogic Confidential33
Fabric ManagerScalability to Meet the Challenge
2048 node fabric initialization in <30 seconds• 1024 nodes < 15 seconds• 512 nodes < 6 seconds• 128 nodes < 2 seconds• Significant Performance Improvements every year
Rapid Identification of Fabric Changes• Essential to keeping fabric up in face of failures• Trap based mechanism allows << 1 second identification• No need for frequent bandwidth wasting sweeps
Highly Scalable SA implementation• Rapid response time speeds up application startup• Handles worst case burst of queries without dropping any
QLogic Confidential34
How to enable the Subnet Manager
Run on the switch itselfSM Key needs to be generated for the switch’s specific GUID and the serial number from the CD caseContact support
QLogic Confidential35
Diagnostic Files
QLogic Confidential36
Diagnostic Files - Review Port Information
[root@compute-0-1 port1]# p1info (Fast fabric Tools shortcut)
[root@compute-0-1 port1]# cat /proc/iba/mt23108/1/port1/info
Port 1 Info
PortState: Active PhysState: LinkUp DownDefault: Polling
LID: 0x002D LMC: 0
Subnet: 0xfe80000000000000 GUID: 0x00066a00a00006ff
SMLID: 0x0000 SMSL: 0 RespTimeout: 32 ms SubnetTimeout: 4096 ns
M_KEY: 0x0000000000000000 Lease: 0 s Protect: Readonly
MTU: Active: 512 Supported: 2048
LinkWidth: Active: 4x Supported: 1-4x Enabled: 1-4x
LinkSpeed: Active: 2.5Gbps Supported: 2.5Gbps Enabled: 2.5Gbps
VLs: Active: 4+1 Supported: 4+1 HOQLife: 4096 ns
Capability 0x02010048: CR CM SL Trap
Violations: M_Key: 0 P_Key: 0 Q_Key: 0
ErrorLimits: Overrun: 0 LocalPhys: 15 DiagCode: 0x0000
P_Key Enforcement: In: Off Out: Off FilterRaw: In: Off Out: Off
Note: IPoIB will not be able to register if it is 1X
QLogic Confidential37
Diagnostic Files - Reviewing Port Statistics
[root@compute-0-1 port1]# p1stats
[root@compute-0-1 port1]# cat stats
Port 1 Counters
Performance: Transmit
Xmit Data 78625 MB (0 Quads)
Xmit Pkts 92830192
Performance: Receive
Rcv Data 8792 MB (0 Quads)
Rcv Pkts 8292051
Errors: Async Events:
Symbol Errors 12 State Change 0
Link Error Recovery 0 Traps:
Link Downed 0 Link Integrity 0
Port Rcv Errors 0 Exc. Buffer Overrun 0
Port Rcv Rmt Phys Err 0 Flow Control Watchdog 0
Port Rcv Sw Relay Err 0 Capability Mask Chg 0
Port Xmit Discards 0 Platform Guid Chg 0
Port Xmit Constraint 0 Bad M-Key 0
Port Rcv Constraint 0 Bad P-Key 0
Local Link Integrity 0 Bad Q-Key 0
Exc. Buffer Overrun 0 Other 0
VL15 Dropped 0
To Clear counters: echo stats
QLogic Confidential38
Diagnostic Files – iba_capture capture.tgz
iba_capture whatever.tgz will provide support with all relevant host information in order to diagnose a problem
[root] less whatever.tgzetc\sysconfigtmp\capture19723
\proc\sys
var\log
QLogic Confidential39
iba-capture – etc/
Etc/ directory of the capture file contains all the configuration information that the QuickSilver drivers useThis gives our support a look at the systems configuration and verify parts of the installation
[root@tsg68 whatever]$ ls -aR etc/etc/:. .. hosts modprobe.conf modprobe.conf~ modprobe.conf.dist sysconfig
etc/sysconfig:. .. firstboot iba ipoib.cfg ipoib.cfg-sample network-scripts
etc/sysconfig/iba:. .. busdrv.conf iba_mon.conf iba_mon.conf-sample iba_stat.confiba_stat.conf-sample uvp.conf version
etc/sysconfig/network-scripts:. .. ifcfg-eth0 ifcfg-eth1 ifcfg-ib1 ifcfg-lo
QLogic Confidential40
Discussions
QLogic Confidential41
Topics to discuss
Which stack? QuickSilver or OFED+?Which MPI/tools/etc?Multiple HCAs?Performance discussion – kernel benchmarks or application benchmarks?
QLogic Confidential42
How to utilize multiple HCAs?
Bonding? (Current State of art is only IPoIB)Load Balancing
By default VIADEV_PATH_METHOD=4 is used. This causes all HCAs in the system to be used equally.
If a system has more than 1 HCA, this is equivalent toVIADEV_PATH_METHOD=3,
If a system has only 1 HCA, this is equivalent to VIADEV_PATH_METHOD=0.
The MPI processes on that node will be evenly distributed among all HCAs with active ports
All HCAs with Active ports must be connected to the same fabric as all other nodes in the job Jobs may fail to start if the MPI job includes some nodes with 1HCA and some nodes with multiple HCAs.
QLogic Confidential43
MPI Performance with QLogic HCAs
0.45~1.4M/s
3.5~2.6 µs
0.9~1.4 GB/s
QuickSilverSDR~DDR
.08 GUPs
8M/s
2.0 µs
1.56 µs
0.9 GB/s
QLE7140 SDR
2.3 µs1.4 µsHPCC Random Ring Latency @ 32 cores
ScalableLatency
1.5 GB/s1.9 GB/sOSU BandwidthPoint-to-Point Bandwidth
.02 GUPs.09 GUPsHPCC MPI Random Access @ 32 cores
ScalableMessage Rate(non-coalesced)
6 M/s11 M/sOSU multi_bw @ 4ppn
Point-to-Point Message Rate(non-coalesced)
1.5 µs1.4 µsOSU LatencyPoint-to-PointLatency (0 Byte)
ConnectX DDR
QLE7280 DDR
BenchmarkComparison
Scaling results are for 8 nodes / 32 cores.
Latency results include a single switch crossing.NEW
QLogic Confidential44
Thank You!