Providing Australian researchers with world-class, high-end computing services
Perspectives on implementation of a high performance scientific cloud backed
by a 56G high speed interconnect
Presenter: Jakub Chrzeszczyk, Cloud Architect,
National Computational Infrastructure, The Australian National University
Authors: Jakub Chrzeszczyk, Dr. Muhammad Atif, Dr. Joseph Antony,
Dongyang Li, Matthew Sanderson, Allan Williams, National Computational Infrastructure,
The Australian National University
About the National Computational Infrastructure
History of supercomputing (at ANU)
• History:– Origins in ANU Supercomputing Facility from late 1980s– National computational services from 2000 under APAC until 2006/07– NCI established in 2007 under NCRIS Agreement– Evolved as a comprehensive provider of computational/data-intensive services– Infrastructure capabilities extended: EIF Climate HPC Project Agreement (2010) —$50M
• New petascale HPC facility and related infrastructure + Data Centre
• Comprehensive and vertically-integrated provider of infrastructure services:– Petascale Supercomputer (Fujitsu Primergy 1.2 Pflops)– Data-intensive Services via VM and Cloud– Large-scale Data Storage via NCI partnership & RDSI– Nationally-acknowledged expertise
• Sustainable Partnership (through until Dec. 2015, to be extended)– Involving major national institutions and research-intensive universities, $11-12M p.a.
NCI – an overview
Mission:•to foster ambitious and aspirational research objectives and to enable their realisation, in the Australian context, through world-class, high-end computing services.NCI is:•being driven by research objectives, •an integral component of the Commonwealth’s research infrastructure program,•a comprehensive, vertically-integrated research service,•engaging with, and is embedded in, research communities, high-impact centres , and institutions,•fostering aspiration in computational and data-intensive research, and increasing ambition in the use of HPC to enhance research impact,•delivering innovative “digital laboratories”,•providing national access on priority and merit, and•being built on, and sustained by , a collaboration of national organisations and research-intensive universities.
Research Outcomes
Communities and Institutions/Access and Services
Expertise Support and Development
Data Intensive Services Digital Laboratories
Compute (HPC/Cloud) and Data Infrastructure
Re
se
arc
h O
bje
cti
ve
s
Current Infrastructure
Current Peak Computing Infrastructure (NCRIS) •Raijin—Fujitsu Primergy cluster—June 2013•Approx. 57,500 Intel Sandy Bridge (2.6 GHz)•157 TBytes memory, 10 PBytes short-term storage•FDR Infiniband•Centos 6.5 Linux; PBS Pro scheduler•Australia’s first petaflop supercomputer.
– 1195 Tflops, 1,400,000 SPECFPrate– Fastest Filesystem in Southern Hemisphere.
•Significant growth in highly scaling applications– Largest: 32,000 cores; many 1,024 core tasks
Data Storage•HSM (massdata): 10 PB as at September 2013• Global Lustre Filesystem
– 10 PB as of Sept 2013 and growing – 25 Gbytes/sec
•Object Storage: being considered
Current Infrastructure
•Cloud computing (since 2009) – DCC Cluster (now retired in favour of OpenStack)
– Virtualized under VMware.– 384 core private cloud: RedHat OpenStack – NeCTAR Research Cloud node at NCI
• Australia’s highest performance cloud• Architected for strong computational and I/O performance
needed for “big data” research• Intel Sandy Bridge (3200 cores); • 160 TB of SSDs; FDR IB for compute and i/o performance
NCI Cloud Overview
...from 10Gbit to 56GBit Ethernet and towards 56Gbit Infiniband.
10 Gigabit Ethernet: Introduction
The NCI Cloud Prototype was first deployed in using 8 Dell M710HD blades and 2xM8024K 10GE switches.
To keep up with increasing demand for cloud resources the cluster was gradually expanded to 32 Dell M710HD blades and 4xM8024K 10GE switches
10 Gigabit Ethernet : Design (1)
10 Gigabit Ethernet installation doesn't require any specific configuration.
Image source: access.redhat.com
10 Gigabit Ethernet : Design (2)
Network design for OpenStack running on 10 GE. No special hardware or configuration is required
10 Gigabit Ethernet: Pros
Pros:• In-kernel drivers support 10GE “out of the box”,• There is no need to install additional modules for
OpenStack components,• Simple cluster like this can be deployed in days, using
“off-the-shelf” hardware and software.
10 Gigabit Ethernet: Cons
Cons:• Much lower bandwidth and much higher latency
compared to HPC-oriented systems,• High blocking factor, particularly while scaling beyond a
single blade chassis,• No RDMA support,• No L2 separation between tenants (strictly speaking a
nova-network limitation)
56 Gigabit Ethernet: Introduction (1)
Overcoming the limitations of 10G Ethernet.• 56Gbit Ethernet and • SRIOV-enabled HCAs to build next-generation cloud platform.• Neutron.
56 Gigabit Ethernet: Introduction (2)
Cloud hardware• 3200 Intel Sandy bridge E5-2670
• Part of the cluster is HT-enabled• Mellanox ConnectX-3 FDR
• SX1036 Switches • IB and Ethernet capable.
56 Gigabit Ethernet: Network design
56 Gigabit Ethernet: Cloud Design (1)
Customizing Openstack Networking and Nova Compute.
Image source: access.redhat.com
56 Gigabit Ethernet: Cloud Design (2)
Compute nodes are build with Mellanox OFED drivers and specific kernel parameters:
[root@tc015 ~]# cat /etc/modprobe.d/mlx4_core.conf:options mlx4_core num_vfs=16
[root@tc015 ~]# ip link sh eth24: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP qlen 1000 link/ether 00:02:c9:1c:ca:d1 brd ff:ff:ff:ff:ff:ff vf 0 MAC 00:00:00:00:00:00, vlan 4095 vf 1 MAC 00:00:00:00:00:00, vlan 4095 vf 2 MAC 00:00:00:00:00:00, vlan 4095
56 Gigabit Ethernet: Cloud Design (3)
Additional OpenStack components:• Eswitchd, mlnxvif and mlnx-neutron-agent need to be
installed on the compute nodesAdditional OpenStack configuration:• Neutron server needs to be configured to pass on
SRIOV guest configuration to the compute nodes
• Compute nodes have to be configured with SRIOV guest support
vif_driver=mlnxvif.vif.MlxEthVIFDriver
vnic_type = hostdev
56 Gigabit Ethernet: Summary
Pros:• Excellent performance (~40-50 Gbit in VMs)• Support for RDMA via ROCE• Support for private tenant networks (Neutron feature)Cons:• SRIOV is great for performance, but doesn't support a
number of features (live migration and snapshots, high ratio of VCPU overcommit)
• Due to using Ethernet, non-blocking topology can't be guaranteed (communication patterns dependent)
Native Infiniband: Introduction
The hardware used to deploy the 56Gigabit NCI Cloud is IB-capable. The NCI engineers are working on building a native IB cloud. The first prototype was deployed in September 2014.
Native Infiniband: Motivation
Why is NCI interested in Native IB cloud?• Ability to use RDMA via native IB – less overheads,• Enabling a true fat tree topology for all traffic patterns,• Ability to use (excellent) IB profiling, debugging and
troubleshooting tools,• IB is “native mode of operation” for many HPC users
Native Infiniband: Network Design
The network topology is almost identical to Ethernet fat tree. Main difference is lack of the ring connection in the core.
Native Infiniband: Cloud Design (1)
Native IB installation of OpenStack will require the same degree of modification to core components as 56Gigabit Ethernet PLUS:
Replacing default DHCP with custom agent with IPoIB support
OpenSM needs to be pre-configured with PKIs and corresponding Ethernet VLANs
Neutron, hypervisor OS and nova-compute need to be configured to support IB vifs instead of Ethernet
Native Infiniband: Cloud Design (2)
Configuration examples
/etc/neutron/plugins/mlnx/mlnx_conf.ini:tenant_network_type = ibphysical_interface_mapping = default:autoib
[root@tk-0503 ~]# cat /etc/opensm/partitions.conf management=0x7fff,ipoib, sl=0, defmember=full : ALL, ALL_SWITCHES=full,SELF=full;vlan1=0x1, ipoib, sl=0, defmember=full : ALL;vlan2=0x2, ipoib, sl=0, defmember=full : ALL;vlan3=0x3, ipoib, sl=0, defmember=full : ALL;vlan4=0x4, ipoib, sl=0, defmember=full : ALL;vlan5=0x5, ipoib, sl=0, defmember=full : ALL;vlan6=0x6, ipoib, sl=0, defmember=full : ALL;vlan7=0x7, ipoib, sl=0, defmember=full : ALL;
/etc/neutron/dhcp_agent.ini:dhcp_driver = mlnx_dhcp.MlnxDnsmasq
Native Infiniband: Challenges
• Evolution of OpenStack can conflict with out-of-tree customizations required by 56GE and/or IB.
• Example: As for Sep 2014 IB OpenStack Networker functionality was broken in Icehouse/OpenVSwitch
• NCI engineers build a limited-functionality physical “Networker” system to work around this problem,
• 2 OCT VENDOR UPDATE: the fix will be available Q4 • IB Cloud means extensive use for IPoIB which can
mean higher overhead then 56Gbit Ethernet. • IPoIB is generally sensitive to CPU scheduling.
Performance benchmarks
First, we compared 56Gbit Ethernet Cloud with traditional 10Gbit Ethernet Cloud.
Performance benchmarks
10/07/14
Performance benchmarks
10/07/14
Performance benchmarks
Next, we compared with Raijin.
10/07/14
Performance benchmarks
Benchmark conclusions (1)
NCI Cloud is clearly the fastest in Australia and possibly in the World,
There is significant room for optimizing peformance further,
Narrowing and closing the gap between HPC and Cloud is our ambition
Benchmark conclusions (2)
Cloud ≠ virtualization. Ironic? Docker? Native IB. Next goal: 0% overhead cloud.
Questions?
• Thank you for your attention.
Further reading
https://wiki.openstack.org/wiki/Mellanox-Neutron-Icehouse-Redhathttps://openstack.redhat.com/Neutron_with_OVS_and_VLANs