21
InfiniBand management and monitoring with Wingman and Hawk-eye Cyrille Verrier / Haakon Bryhni

Cyrille Verrier / Haakon Bryhni with Wingman and Hawk-eye ... · Wingman - Multiple bandwidth test (osu_mbw_mr) 11/16 System specs: 122 compute nodes Each node has 32 cores The performance

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

  • InfiniBand management and monitoring with Wingman and Hawk-eye

    Cyrille Verrier / Haakon Bryhni

  • Fabriscale in a nutshell

    2/16

    ● Founded in 2014 by Dr. Sven-Arne Reinemo and Professor Tor Skeie, both with a strong research background from the University of Oslo (UiO) and Simula Research Laboratory (Simula).

    ● Funded and backed by Simula Innovation and private Norwegian and British investors.

    ● Fabriscale´s products are based on patented and patent pending algorithms for network-agnostic routing and fault-tolerance.

    ● By using the Fabriscale solution, userscan increase their KPI´s with more than20%, and critical down-time can beminimized

    ● Headquartered in Oslo, Norway. Local support in US and Asia.

  • Enable users to invest efficiently

    3/16

    “Organizations tend to focus on investing in nodes, but put less focus on managing the

    system to optimize utilization.”

    “With Fabriscale´s products you can monitor the traffic very efficiently, minimize downtime and

    increase network utilisation, and simply increase your system performance.”

  • Fabriscale Products

    Wingman Watching your back:Improved speed of communication through fast routingRobust performance through pre-calculated fault-actions Utilising virtual lanesGuarantee private space / partition

    Hawk-eye Providing a hunter’s vision:Highlight the terrain and its essentialsAn analytics platformUse of historical data to improve efficiency

  • Wingman: NetworkFat Tree Mesh Torus Hypercube Dragonfly

    InfiniBand(Mellanox)

    OmniPath(Intel)

    Converged Ethernet(Mellanox, Broadcom, Cisco, Juniper, Marvell, Huawei)Routing Algorithms

    OpenSM Up/DownOpenSM LASHOpenSM DFSSSPWingman

    Routing AlgorithmsShortest PathDevice Group RoutingUp/DownDimension Ordered Routing

    Routing AlgorithmsShortest PathECMP

    x4 higher Latency

    Performance Benchmark

    OpenSM (Mellanox) vsWingman (Fabriscale)

  • Trondheim

    OsloFagernes

    Geilo

    Distribution of traffic from Oslo to Trondheim across different routes.

    Decision based on local informationThe result is bad distribution of traffic further down the road and the problem gets worse as the number of intersections grows.

    Bergen

    Hamar

    The Patented Algorithm

  • Trondheim

    OsloFagernes

    Geilo

    Optimised distribution of traffic from Oslo to Trondheim across different routes.

    Bergen

    Hamar

    Decision based on global informationThe result is a better distribution of traffic further down the road independent of the number intersections.

    The Patented Algorithm

  • Trondheim

    OsloFagernes

    Geilo

    Optimised distribution of traffic from Oslo to Trondheim across different routes.

    Bergen

    Hamar

    Decision based on global informationThe result is a better distribution of traffic further down the road independent of the number intersections.

    The Patented Algorithm

  • Decision based on global informationThe result is a better distribution of traffic further down the road independent of the number intersections.

    Trondheim

    OsloFagernes

    Geilo

    Optimised distribution of traffic from Oslo to Trondheim across different routes.

    Bergen

    Hamar

    Global information when creating the routing enables:☑ better balancing of network traffic, which provides

    faster communication.☑ precalculation of redundant paths, which provides

    fast-failover.☑ robust routing, which handles irregular connectivity.

    The above benefits increases with the size and complexity of the network.

    The Patented Algorithm

  • Wingman - FFT benchmark

    10/16

    System specs:➢ 650 compute nodes➢ Each node has 16 cores➢ Interconnect at 56 Gbit/s

    The performance comparison of Wingman and OpenSM was conducted by running a 3-dimensional Fast Fourier Transform (FFT) at Abel’s HPC cluster at University of Oslo

    41%

    13%

    18%

  • Wingman - Multiple bandwidth test (osu_mbw_mr)

    11/16

    System specs:➢ 122 compute nodes ➢ Each node has 32 cores➢ Interconnect at 100 Gbit/s

    The performance comparison of Wingman and OpenSM was conducted by running the OSU multiple bandwidth test (osu_mbw_mr).

    % improvment of WingMan compared to OpenSM

  • Wingman - Simulation of 648 node balanced fat-tree

    12/16

    Simulation set-up:➢ 2-level fat-tree (1:1)➢ 648 compute nodes➢ Case 1: No faults➢ Case 2: 3% links taken

    down, irregular fat-tree➢ Link speed set to 10 Gbit/s➢ All-to-all traffic

    Performance comparison of Wingman and OpenSM conducted by OMNeT++ simulations

  • Wingman - Simulation results, 96 node real cluster

    13/16

    Simulation set-up:➢ 2-level fat-tree (4:3)➢ 96 compute nodes➢ Case 1: No faults➢ Case 2: 3% links taken

    down➢ Link speed set to 10 Gbit/s➢ All-to-all traffic

    Performance comparison of Wingman and OpenSM conducted by OMNeT++ simulations

  • Wingman - Simulation results, 1624 node real cluster

    14/16

    Simulation set-up:➢ 3-level fat-tree (4:1)➢ 1624 compute nodes➢ Case 1: No faults➢ Case 2: 3% links taken

    down➢ Link speed set to 10 Gbit/s➢ All-to-all traffic➢ OpenSM falls back to

    Minhop due to the irregularity of the fat-tree

    Performance comparison of Wingman and OpenSM conducted by OMNeT++ simulations

  • Wingman - Fast failover performance

    15/16

    The failover times measure the time from the subnet manager has been notified of a link failure to the fabric has re-established new routes for all affected paths.

  • Fabriscale´s product summary

    16/16

    Wingman Hawk-eyeInfiniBand subnet manager with the following key features:

    ☑ Optimized fabric routing, which improves network utilisation.

    ☑ Fast and dynamic fault tolerance, which delivers sub second rerouting in case of network failures.

    ☑ Dynamic and secure InfiniBand partitioning that integrates with Slurm to deliver on the fly per job partitioning.

    InfiniBand network monitoring with the following key features:

    ☑ Compelling and modern user interface.

    ☑ Component navigation with alerts, live metrics.

    ☑ Interactive topology browser with metric, job and alert overlays.

    ☑ Flexible integration with workload management software including Slurm and Torque.

  • Hawk-eyeTopology Topology

    Jobs Dashboard

  • Hawk-eye - Software architecture

    18/26

    Fabriscale subnet manager(fsm + fsm event plugin)

    ElasticsearchArangoDB

    Push metrics and topology

    store network metricsstore job scheduler metrics

    Store network topologyStore job scheduler data

    Fabriscale dashboard(WebUI served by nginx)

    SLURM

    PBS/Torque

    fetch jobfetch nodes

    HTTP GETnodes, ports, links, jobs, etc

    Legend

    Fabriscale component, forked from OSS

    Fabriscale component

    Third-party component

    Open-source component

    Fabriscale Monitoring System(fsmonitoring)

    Websocket notifications

    fms-torqueplugin

    fms-slurmplugin

    18/16

  • Hawk-eye - Use case examples

    19/16

    ● Network performance analysis○ Congestion notifications.○ Long term network utilisation.○ Visualisation of load (per link, job and topology).

    ● Network health mapping○ Gain a quick overview of the health of all monitored devices.

    ● Device failure detection○ Link down.○ Node down in context of both InfiniBand and job scheduler.○ Predictive link failure detection through custom alerts.

    ● Detect configuration anomalies○ Link speed inconsistencies.○ Abnormal error metric values.

    ● Cluster bring up assistance○ Discover and review devices as they are added.

  • Hawk-eye - Demo

    20/16

    Live demo

  • Contact

    [email protected]

    http://fabriscale.com