Download pdf - Providing Australian researchers with world-class, high ...•Cloud computing (since 2009) – DCC Cluster (now retired in favour of OpenStack) – Virtualized under VMware. – 384

Providing Australian researchers with world-class, high-end computing services

Perspectives on implementation of a high performance scientific cloud backed

by a 56G high speed interconnect

Presenter: Jakub Chrzeszczyk, Cloud Architect,

National Computational Infrastructure, The Australian National University

Authors: Jakub Chrzeszczyk, Dr. Muhammad Atif, Dr. Joseph Antony,

Dongyang Li, Matthew Sanderson, Allan Williams, National Computational Infrastructure,

The Australian National University

About the National Computational Infrastructure

History of supercomputing (at ANU)

• History:– Origins in ANU Supercomputing Facility from late 1980s– National computational services from 2000 under APAC until 2006/07– NCI established in 2007 under NCRIS Agreement– Evolved as a comprehensive provider of computational/data-intensive services– Infrastructure capabilities extended: EIF Climate HPC Project Agreement (2010) —$50M

• New petascale HPC facility and related infrastructure + Data Centre

• Comprehensive and vertically-integrated provider of infrastructure services:– Petascale Supercomputer (Fujitsu Primergy 1.2 Pflops)– Data-intensive Services via VM and Cloud– Large-scale Data Storage via NCI partnership & RDSI– Nationally-acknowledged expertise

• Sustainable Partnership (through until Dec. 2015, to be extended)– Involving major national institutions and research-intensive universities, $11-12M p.a.

NCI – an overview

Mission:•to foster ambitious and aspirational research objectives and to enable their realisation, in the Australian context, through world-class, high-end computing services.NCI is:•being driven by research objectives, •an integral component of the Commonwealth’s research infrastructure program,•a comprehensive, vertically-integrated research service,•engaging with, and is embedded in, research communities, high-impact centres , and institutions,•fostering aspiration in computational and data-intensive research, and increasing ambition in the use of HPC to enhance research impact,•delivering innovative “digital laboratories”,•providing national access on priority and merit, and•being built on, and sustained by , a collaboration of national organisations and research-intensive universities.

Research Outcomes

Communities and Institutions/Access and Services

Expertise Support and Development

Data Intensive Services Digital Laboratories

Compute (HPC/Cloud) and Data Infrastructure

Re

se

arc

h O

bje

cti

ve

s

Current Infrastructure

Current Peak Computing Infrastructure (NCRIS) •Raijin—Fujitsu Primergy cluster—June 2013•Approx. 57,500 Intel Sandy Bridge (2.6 GHz)•157 TBytes memory, 10 PBytes short-term storage•FDR Infiniband•Centos 6.5 Linux; PBS Pro scheduler•Australia’s first petaflop supercomputer.

– 1195 Tflops, 1,400,000 SPECFPrate– Fastest Filesystem in Southern Hemisphere.

•Significant growth in highly scaling applications– Largest: 32,000 cores; many 1,024 core tasks

Data Storage•HSM (massdata): 10 PB as at September 2013• Global Lustre Filesystem

– 10 PB as of Sept 2013 and growing – 25 Gbytes/sec

•Object Storage: being considered

Current Infrastructure

•Cloud computing (since 2009) – DCC Cluster (now retired in favour of OpenStack)

– Virtualized under VMware.– 384 core private cloud: RedHat OpenStack – NeCTAR Research Cloud node at NCI

• Australia’s highest performance cloud• Architected for strong computational and I/O performance

needed for “big data” research• Intel Sandy Bridge (3200 cores); • 160 TB of SSDs; FDR IB for compute and i/o performance

NCI Cloud Overview

...from 10Gbit to 56GBit Ethernet and towards 56Gbit Infiniband.

10 Gigabit Ethernet: Introduction

The NCI Cloud Prototype was first deployed in using 8 Dell M710HD blades and 2xM8024K 10GE switches.

To keep up with increasing demand for cloud resources the cluster was gradually expanded to 32 Dell M710HD blades and 4xM8024K 10GE switches

10 Gigabit Ethernet : Design (1)

10 Gigabit Ethernet installation doesn't require any specific configuration.

Image source: access.redhat.com

10 Gigabit Ethernet : Design (2)

Network design for OpenStack running on 10 GE. No special hardware or configuration is required

10 Gigabit Ethernet: Pros

Pros:• In-kernel drivers support 10GE “out of the box”,• There is no need to install additional modules for

OpenStack components,• Simple cluster like this can be deployed in days, using

“off-the-shelf” hardware and software.

10 Gigabit Ethernet: Cons

Cons:• Much lower bandwidth and much higher latency

compared to HPC-oriented systems,• High blocking factor, particularly while scaling beyond a

single blade chassis,• No RDMA support,• No L2 separation between tenants (strictly speaking a

nova-network limitation)

56 Gigabit Ethernet: Introduction (1)

Overcoming the limitations of 10G Ethernet.• 56Gbit Ethernet and • SRIOV-enabled HCAs to build next-generation cloud platform.• Neutron.

56 Gigabit Ethernet: Introduction (2)

Cloud hardware• 3200 Intel Sandy bridge E5-2670

• Part of the cluster is HT-enabled• Mellanox ConnectX-3 FDR

• SX1036 Switches • IB and Ethernet capable.

56 Gigabit Ethernet: Network design

56 Gigabit Ethernet: Cloud Design (1)

Customizing Openstack Networking and Nova Compute.

Image source: access.redhat.com


Compute nodes are build with Mellanox OFED drivers and specific kernel parameters:

[root@tc015 ~]# cat /etc/modprobe.d/mlx4_core.conf:options mlx4_core num_vfs=16

[root@tc015 ~]# ip link sh eth24: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP qlen 1000 link/ether 00:02:c9:1c:ca:d1 brd ff:ff:ff:ff:ff:ff vf 0 MAC 00:00:00:00:00:00, vlan 4095 vf 1 MAC 00:00:00:00:00:00, vlan 4095 vf 2 MAC 00:00:00:00:00:00, vlan 4095


Additional OpenStack components:• Eswitchd, mlnxvif and mlnx-neutron-agent need to be

installed on the compute nodesAdditional OpenStack configuration:• Neutron server needs to be configured to pass on

SRIOV guest configuration to the compute nodes

• Compute nodes have to be configured with SRIOV guest support

vif_driver=mlnxvif.vif.MlxEthVIFDriver

vnic_type = hostdev

56 Gigabit Ethernet: Summary

Pros:• Excellent performance (~40-50 Gbit in VMs)• Support for RDMA via ROCE• Support for private tenant networks (Neutron feature)Cons:• SRIOV is great for performance, but doesn't support a

number of features (live migration and snapshots, high ratio of VCPU overcommit)

• Due to using Ethernet, non-blocking topology can't be guaranteed (communication patterns dependent)

Native Infiniband: Introduction

The hardware used to deploy the 56Gigabit NCI Cloud is IB-capable. The NCI engineers are working on building a native IB cloud. The first prototype was deployed in September 2014.

Native Infiniband: Motivation

Why is NCI interested in Native IB cloud?• Ability to use RDMA via native IB – less overheads,• Enabling a true fat tree topology for all traffic patterns,• Ability to use (excellent) IB profiling, debugging and

troubleshooting tools,• IB is “native mode of operation” for many HPC users

Native Infiniband: Network Design

The network topology is almost identical to Ethernet fat tree. Main difference is lack of the ring connection in the core.

Native Infiniband: Cloud Design (1)

Native IB installation of OpenStack will require the same degree of modification to core components as 56Gigabit Ethernet PLUS:

Replacing default DHCP with custom agent with IPoIB support

OpenSM needs to be pre-configured with PKIs and corresponding Ethernet VLANs

Neutron, hypervisor OS and nova-compute need to be configured to support IB vifs instead of Ethernet

Native Infiniband: Cloud Design (2)

Configuration examples

/etc/neutron/plugins/mlnx/mlnx_conf.ini:tenant_network_type = ibphysical_interface_mapping = default:autoib

[root@tk-0503 ~]# cat /etc/opensm/partitions.conf management=0x7fff,ipoib, sl=0, defmember=full : ALL, ALL_SWITCHES=full,SELF=full;vlan1=0x1, ipoib, sl=0, defmember=full : ALL;vlan2=0x2, ipoib, sl=0, defmember=full : ALL;vlan3=0x3, ipoib, sl=0, defmember=full : ALL;vlan4=0x4, ipoib, sl=0, defmember=full : ALL;vlan5=0x5, ipoib, sl=0, defmember=full : ALL;vlan6=0x6, ipoib, sl=0, defmember=full : ALL;vlan7=0x7, ipoib, sl=0, defmember=full : ALL;

/etc/neutron/dhcp_agent.ini:dhcp_driver = mlnx_dhcp.MlnxDnsmasq

Native Infiniband: Challenges

• Evolution of OpenStack can conflict with out-of-tree customizations required by 56GE and/or IB.

• Example: As for Sep 2014 IB OpenStack Networker functionality was broken in Icehouse/OpenVSwitch

• NCI engineers build a limited-functionality physical “Networker” system to work around this problem,

• 2 OCT VENDOR UPDATE: the fix will be available Q4 • IB Cloud means extensive use for IPoIB which can

mean higher overhead then 56Gbit Ethernet. • IPoIB is generally sensitive to CPU scheduling.

Performance benchmarks

First, we compared 56Gbit Ethernet Cloud with traditional 10Gbit Ethernet Cloud.


10/07/14


10/07/14


Next, we compared with Raijin.

10/07/14


Benchmark conclusions (1)

NCI Cloud is clearly the fastest in Australia and possibly in the World,

There is significant room for optimizing peformance further,

Narrowing and closing the gap between HPC and Cloud is our ambition

Benchmark conclusions (2)

Cloud ≠ virtualization. Ironic? Docker? Native IB. Next goal: 0% overhead cloud.

Questions?

• Thank you for your attention.

Further reading

https://wiki.openstack.org/wiki/Mellanox-Neutron-Icehouse-Redhathttps://openstack.redhat.com/Neutron_with_OVS_and_VLANs