Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
InfiniBand management and monitoring with Wingman and Hawk-eye
Cyrille Verrier / Haakon Bryhni
Fabriscale in a nutshell
2/16
● Founded in 2014 by Dr. Sven-Arne Reinemo and Professor Tor Skeie, both with a strong research background from the University of Oslo (UiO) and Simula Research Laboratory (Simula).
● Funded and backed by Simula Innovation and private Norwegian and British investors.
● Fabriscale´s products are based on patented and patent pending algorithms for network-agnostic routing and fault-tolerance.
● By using the Fabriscale solution, userscan increase their KPI´s with more than20%, and critical down-time can beminimized
● Headquartered in Oslo, Norway. Local support in US and Asia.
Enable users to invest efficiently
3/16
“Organizations tend to focus on investing in nodes, but put less focus on managing the
system to optimize utilization.”
“With Fabriscale´s products you can monitor the traffic very efficiently, minimize downtime and
increase network utilisation, and simply increase your system performance.”
Fabriscale Products
Wingman Watching your back:Improved speed of communication through fast routingRobust performance through pre-calculated fault-actions Utilising virtual lanesGuarantee private space / partition
Hawk-eye Providing a hunter’s vision:Highlight the terrain and its essentialsAn analytics platformUse of historical data to improve efficiency
Wingman: NetworkFat Tree Mesh Torus Hypercube Dragonfly
InfiniBand(Mellanox)
OmniPath(Intel)
Converged Ethernet(Mellanox, Broadcom, Cisco, Juniper, Marvell, Huawei)Routing Algorithms
OpenSM Up/DownOpenSM LASHOpenSM DFSSSPWingman
Routing AlgorithmsShortest PathDevice Group RoutingUp/DownDimension Ordered Routing
Routing AlgorithmsShortest PathECMP
x4 higher Latency
Performance Benchmark
OpenSM (Mellanox) vsWingman (Fabriscale)
Trondheim
OsloFagernes
Geilo
Distribution of traffic from Oslo to Trondheim across different routes.
Decision based on local informationThe result is bad distribution of traffic further down the road and the problem gets worse as the number of intersections grows.
Bergen
Hamar
The Patented Algorithm
Trondheim
OsloFagernes
Geilo
Optimised distribution of traffic from Oslo to Trondheim across different routes.
Bergen
Hamar
Decision based on global informationThe result is a better distribution of traffic further down the road independent of the number intersections.
The Patented Algorithm
Trondheim
OsloFagernes
Geilo
Optimised distribution of traffic from Oslo to Trondheim across different routes.
Bergen
Hamar
Decision based on global informationThe result is a better distribution of traffic further down the road independent of the number intersections.
The Patented Algorithm
Decision based on global informationThe result is a better distribution of traffic further down the road independent of the number intersections.
Trondheim
OsloFagernes
Geilo
Optimised distribution of traffic from Oslo to Trondheim across different routes.
Bergen
Hamar
Global information when creating the routing enables:☑ better balancing of network traffic, which provides
faster communication.☑ precalculation of redundant paths, which provides
fast-failover.☑ robust routing, which handles irregular connectivity.
The above benefits increases with the size and complexity of the network.
The Patented Algorithm
Wingman - FFT benchmark
10/16
System specs:➢ 650 compute nodes➢ Each node has 16 cores➢ Interconnect at 56 Gbit/s
The performance comparison of Wingman and OpenSM was conducted by running a 3-dimensional Fast Fourier Transform (FFT) at Abel’s HPC cluster at University of Oslo
41%
13%
18%
Wingman - Multiple bandwidth test (osu_mbw_mr)
11/16
System specs:➢ 122 compute nodes ➢ Each node has 32 cores➢ Interconnect at 100 Gbit/s
The performance comparison of Wingman and OpenSM was conducted by running the OSU multiple bandwidth test (osu_mbw_mr).
% improvment of WingMan compared to OpenSM
Wingman - Simulation of 648 node balanced fat-tree
12/16
Simulation set-up:➢ 2-level fat-tree (1:1)➢ 648 compute nodes➢ Case 1: No faults➢ Case 2: 3% links taken
down, irregular fat-tree➢ Link speed set to 10 Gbit/s➢ All-to-all traffic
Performance comparison of Wingman and OpenSM conducted by OMNeT++ simulations
Wingman - Simulation results, 96 node real cluster
13/16
Simulation set-up:➢ 2-level fat-tree (4:3)➢ 96 compute nodes➢ Case 1: No faults➢ Case 2: 3% links taken
down➢ Link speed set to 10 Gbit/s➢ All-to-all traffic
Performance comparison of Wingman and OpenSM conducted by OMNeT++ simulations
Wingman - Simulation results, 1624 node real cluster
14/16
Simulation set-up:➢ 3-level fat-tree (4:1)➢ 1624 compute nodes➢ Case 1: No faults➢ Case 2: 3% links taken
down➢ Link speed set to 10 Gbit/s➢ All-to-all traffic➢ OpenSM falls back to
Minhop due to the irregularity of the fat-tree
Performance comparison of Wingman and OpenSM conducted by OMNeT++ simulations
Wingman - Fast failover performance
15/16
The failover times measure the time from the subnet manager has been notified of a link failure to the fabric has re-established new routes for all affected paths.
Fabriscale´s product summary
16/16
Wingman Hawk-eyeInfiniBand subnet manager with the following key features:
☑ Optimized fabric routing, which improves network utilisation.
☑ Fast and dynamic fault tolerance, which delivers sub second rerouting in case of network failures.
☑ Dynamic and secure InfiniBand partitioning that integrates with Slurm to deliver on the fly per job partitioning.
InfiniBand network monitoring with the following key features:
☑ Compelling and modern user interface.
☑ Component navigation with alerts, live metrics.
☑ Interactive topology browser with metric, job and alert overlays.
☑ Flexible integration with workload management software including Slurm and Torque.
Hawk-eyeTopology Topology
Jobs Dashboard
Hawk-eye - Software architecture
18/26
Fabriscale subnet manager(fsm + fsm event plugin)
ElasticsearchArangoDB
Push metrics and topology
store network metricsstore job scheduler metrics
Store network topologyStore job scheduler data
Fabriscale dashboard(WebUI served by nginx)
SLURM
PBS/Torque
fetch jobfetch nodes
HTTP GETnodes, ports, links, jobs, etc
Legend
Fabriscale component, forked from OSS
Fabriscale component
Third-party component
Open-source component
Fabriscale Monitoring System(fsmonitoring)
Websocket notifications
fms-torqueplugin
fms-slurmplugin
18/16
Hawk-eye - Use case examples
19/16
● Network performance analysis○ Congestion notifications.○ Long term network utilisation.○ Visualisation of load (per link, job and topology).
● Network health mapping○ Gain a quick overview of the health of all monitored devices.
● Device failure detection○ Link down.○ Node down in context of both InfiniBand and job scheduler.○ Predictive link failure detection through custom alerts.
● Detect configuration anomalies○ Link speed inconsistencies.○ Abnormal error metric values.
● Cluster bring up assistance○ Discover and review devices as they are added.
Hawk-eye - Demo
20/16
Live demo
Contact
http://fabriscale.com