Lego Cloud SAP Virtualization Week 2012

Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast

Aidan Shribman Sr. Researcher; SAP Research Israel

TRND04

The Lego Cloud

© 2012 SAP AG. All rights reserved. 2

Agenda

Introduction

Hardware Trends

Live Migration

Memory Aggregation

Compute Aggregation

Summary

Introduction The evolution of the datacenter


No virtualization

Basic Consolidation

Flexible Resources Management (Cloud)

Resources Disaggregation

(True utility Cloud)

Evolution of Virtualization


Why Disaggregate Resources?

Better Performance

Replacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM).

Many remote devices working in parallel (e.g. DRAM, disk, compute)

Superior Scalability

Going beyond boundaries of the single node

Improved Economics

Do more with existing hardware

Reach better hardware utilization levels


The Hecatonchires Project

Hecatonchires “Hundred Headed One”

Original idea: provide Distributed Shared Memory (DSM)

capabilities to the cloud

Strategic goal : full resource liberation brought to the cloud by:

Providing more resource flexibility to current cloud paradigm by breaking

down nodes to their basic elements (CPU, Memory, I/O)

Extend existing cloud software stack (KVM, Qemu, libvirt, OpenStack)

without degrading any existing capabilities.

Using commodity cloud hardware: medium sized hosts (typically 64 GB

and 8/16 cores), and standard interconnects (such as 1 Gigabit or 10 GE).

Initiated by Benoit Hudzia in 2011. Currently developed by two

small teams of researchers from the TI Practice located in

Belfast and Ra’anana


High Level Architecture

No special HW required but RDMA enabled

NICs which support the low overhead low

latency communication layer

VMs are not bounded by host size anymore as

resources such as memory, I/O and compute

can be aggregated

Different sized VMs can share infrastructure

so we can still support the smaller VMs not

requiring dedicated hosts

Application stack runs unmodified

CPUs

Memory

I/O

CPUs

Memory

I/O

CPUs

Memory

I/O

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

Ap

p

VM

H/W

OS

App

VM

Server #1 Server #2 Server #n

Guests

Fast RDMA Communication


The Team - Panoramic View

Hardware Trends Are hosts getting closer?


CPUs stopped getting faster

Moore’s law prevailed until 2003 when core’s

speed hit a practical limit of about 3.4 Ghz

In data center core are even slower running at

2.0 - 2.8 Ghz for to power conservation

reasons

Since 2000 you do get more cores – but that

does not effect compute cycle and compute

instruction latencies

Effectively arbitrary sequential algorithms

have not gotten faster since

Source: http://www.intel.com/pressroom/kits/quickrefyr.htm


DRAM latency has remained constant

CPU clock speed and memory bandwidth

increased steadily (at least until 2000)

But memory latency remained constant – so

local memory has gotten slower from the CPU

perspective

Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010


Disk latency has virtually not improved

8.3

7.1 6.7

6.1 5.8 5.6

4.2

3 2.5

2

3,600 4,200 4,500 4,900 5,200 5,400 7,200 10,000 12,000 15,000

Average Latency (ms) 1980s standard disk has a 3,600 RPM

2010s standard disk has a 7,200 RPM

2x speedup in 30 years is negligible –

effectively disk has become slower from the

CPU perspective.

Panda et al. Supercomputing 2009


But: Networks are Steadily Getting Faster

0

10

20

30

40

50

60

70

Network Performance (Gbit/s) Since 1979 we went from 0.01 Gbit/s to up 64

Gbit/s a x6400 Speedup

A competitive marketplace

10 and 40 Gbps Ethernet – originated from network

interconnects

40 Gbps QPX InfiniBand – originated from computer

internal bus technology

InfiniBand/Ethernet convergence

Virtual Protocol Interconnects

InfiniBand over Ethernet

RDMA over converged enhanced Ethernet

Using standard semantics defined by OFED

Panda et al. Supercomputing 2009


And: Communication Stacks Are Becoming Faster

Network stack deficiencies

Application / OS context switches

Intermediate buffer copies

Transport processing

RDMA OFED Verbs API provides

Zero copy

Offloading TCP to NIC using RoCE

Flexibility to use IB, GE or IWARP

Resulting in

Reduced latency

Processor offloading

Operational flexibility


Benchmarking Modern Interconnects

Intel MPI benchmark (IMP)

Used typically in HPC and parallel computing

Comparing:

4x DDR IB using Verbs API

10 GE TOE (TCP offloading engine) iWARP

1 GE

Measured latencies

IB 2 us

10 GE 8.23 us

1 GE 46.52 us

Broadcast latency

Exchange bandwidth

Source: Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet, IBM


Conclusion: Remote Nodes Have Gotten Closer

Interconnects have become much faster

Fast interconnects have become a commodity

and are moving out of the High Performance

Computing (HPC) niche

IB latency 2000 ns is only 20x slower than

RAM and is 100x faster than SSD

Remote page faulting is much faster than

traditional disk backed page swapping!

HANA Performance Analysis, Chaim Bendelac, 2011


Result: Blurring of the physical node boundaries

10,000,000 ns

10,000,000 ns

2,000 ns

100 ns

0 ns

Live Migration Pretext to Hecatonchire


Enabling Live Migration of SAP Workloads

Business Problem

Typical SAP workloads (e.g. SAP ERP) are transactional,

large (possibly 64 GB), with a fast rate of memory writes.

Classic live migration fails for such workloads as rapid

memory writes cause memory pages to be re-sent over

and over again

Hecatonchire’s Solution

Enable live migration by reducing both the number of

pages re-sent and the cost of a page re-send

Non intrusive reducing downtime, service degradation, and

total migration time


Live Migration Technique

Pre-migration process

Reservation process

Iterative pre-copy

Stop and copy

Commitment

• VM active on host A

• Destination host selected

(Block devices mirrored)

• Initialize container on target host • Copy dirty pages in successive

rounds

• Suspend VM on host A

• Redirect network traffic

• Synch remaining state

• Activate on host B

• VM state on host A released


Pre-copy live migration

Reducing number of page re-sends

Page LRU Reordering such that pages which

have a high chance of being re-dirtied soon are

delayed until later

Reducing the cost of a page re-send

By using XBZRLE delta encoder we can much

more efficiently represent page changes


Pre-Copy Live-

Migration

Page Pushing

1

Round

Stop

and

Copy

Commit

Total Migration Time

Downtime Live on A Degraded on B Live on B

Page Pushing

1

Round

Stop

and

Copy

Commit


Downtime Live on A Degraded on B Live on B

Iterative

Pre-Copy

X

Rounds

Post-Copy Live-

Migration

Hybrid Post-Copy

Live-Migration

More Than One Way to Live Migrate …

Iterative

Pre-copy X

Rounds

Stop

and

Copy

Commit


Downtime Live on A Live on B

Pre-migrate;

Reservation

Pre-migrate;

Reservation

Commit

Pre-migrate;

Reservation


Post-copy live migration using fast interconnects

In Post-copy live migration the state of the VM

is transferred to the destination and activated

before memory is transferred

Post-copy implementation includes

Handling of remote page faults

Background transfer of memory pages

Service degradation mitigated by

RDMA zero-copy interconnects

Pre-paging – similar in concept to pre-fetching

Hybrid Post Copy – begins with a pre-copy phase

MMU integration – eliminating need for VM pause

Demo

Memory Aggregation In the oven …


The Memory Cloud Turns memory into a distributed memory service

Server1 Server2 Server3



VM VM VM

Storage

Applications

Memory

Breaks memory

from the bounds of the

physical box

Yields double digit

percentage gains in IT

economics

Transparent

deployment with

performance at scale

and Reliability

App

RAM

App App

RAM RAM


RRAIM : Remote Redundant Array of Inexpensive Memory Supporting Large Memory Instances On-Demand

Business Problem

Current instance memory sizes are constrained by physical hosts’

memory size ( Amazon Biggest VM occupy the whole physical host)

Heavy swap usage slows execution time for data intensive applications

Hecatonchire Solution

Access remote DRAM via fast interconnects zero-copy RDMA

Hide remote DRAM latency by using page pre-pushing

MMU Integration for transparency for applications and VMs

Reliability by using a RAID-1 (mirroring) like schema

Hecatonchire Value Proposition

Provide memory aggregation on-demand

Totally transparent to workload (no integration needed)

No hardware investment! No dedicated servers!

RAIM Solution

Memory Cloud

Application VM swaps to memory

Cloud

RAM

Compression / De-

duplication / N-tiers

storage / HR-HA


Capacity

Access S

peed

Hecatonchire / RRAIM: Breakthrough Capability Breaking the memory box barrier for memory intensive applications

10

mse

c

1 m

se

c

10

0 μ

se

c

10

μse

c

1 μ

se

c

SAN

NAS

Local Disk

Performance

Barrier

DRAM

L1 cache

L2 cache

Em

bed

ded

Reso

urc

es

Lo

cal

Reso

urc

es

Netw

ork

ed

Reso

urc

es

PB TB GB MB

SSD


Lego Cloud Architecture ( Memory block)

Compute VM Memory Guest

Memory Cloud Management

Services

App App

memory Memory

Cloud

RRAIM

VM

RAM

VM VM

RAM

Many Physical Nodes

Hosting a variety of VMs

Combination VM Memory Guest & Host

Memory VM Memory Host


Instant Flash Cloning On-Demand

Business Problem

Burst load / service usage that cannot be satisfied in time

Existing solutions

Vendors: Amazon / VMWare/ rightscale

Startup VM from disk image

Requires full VM OS startup sequence

Hecatonchire Solution

Using a paused VM as source for Copy-on-Write (CoW)

We perform a Post-Copy Live Migration


Just in time (sub-second) provisioning


Instant Flash Cloning On-Demand

We can clone VMs to meet demand much faster

than other solutions

Reducing infrastructure costs while still minimizing

lost opportunities => Just in time provisioning

Requires Application Integration

We track OS/application metrics in running VMs or in Load

Balancer (LB)

Alerts are defined if metrics pass a pre-define threshold

According to alerts we can scale-up adding more resources

or scale-down to save on resources not utilized

Amazon Web Services - Guide

Compute Aggregation Our next challenge


Cost Effective “Small” HPC Grid

High Performance Computing (HPC)

Supercomputers at the frontline of processing speeds 10k-100k core

Typical benchmark: Grid 500 (Linear Algebra)

Small HPC using 10-20 commodity (2 TB / 80 core) nodes

Typical Applications

Relational Databases

Analytics tasks (Linear Algebra)

Simulations


Optimal price / performance by using commodity hardware

Operational flexibility: node downtime without downing the cluster

Seamless deployment within existing cloud


Distributed Shared Memory (DSM)

Traditional cluster

Distributed memory

Standard interconnects

OS instance on each node

Distribution handled by application

ccNUMA

Cache coherent shared memory

Fast interconnects

One OS instance

Distribution handled by hardware

Vendors: ScaleMP, Numascale, others


Distributed Shared Memory – Inherent Limitations

Linux provides NUMA topology discovery

Distance between compute cores

Distance between cores to memory

While the Linux OS is aware of the NUMA

layout the application may not be aware …

Cache-coherency may get very expensive

Inter-core: L3 Cache 20 ns

Inter-socket: Main Memory 100 ns

Inter-node (IB): Remote Memory 2,000 ns

Thus the ccNUMA architecture many not

“really” be transparent to the application!

Summary


Roadmap

2011

• Live Migration

• Pre-copy XBZRLE Delta Encoding

• Pre-copy LRU page reordering

• Post-copy using RDMA interconnects

2012

• Resource Aggregation

• Cloud Management Integration

• Memory Aggregation – RAIM (Redundant Array of Inexpensive Memory)

• I/O Aggregation – vRAID (virtual Redundant Array of Inexpensive Disks)

• Flash cloning

2013

• Lego Landscape

• CPU Aggregation - ccNUMA

• Flexible resource management


Key takeaways

Hecatonchire extends standard Linux stack requiring

standard commodity hardware

With Hecatonchire unmodified applications or VMs can

tape into remote resources tranparently

To be released as open source under GPLv2 and LGPL

licenses to Qemu and Linux communities

Developed by SAP Research TI

Thank you Contact information:

Benoit Hudzia; Sr. Researcher;

SAP Research CEC Belfast

[email protected]

Aidan Shribman; Sr. Researcher;

SAP Research Israel

[email protected]

Hecatonchire Wiki

https://wiki.wdf.sap.corp/wiki/display/cecbelfast/Hecatonc

hire%2C++Distributed+Shared+Memory+%28DSM%29+

And+Datacenter+Resources+disaggregation+for+Cloud

mailto:[email protected]

mailto:[email protected]

https://wiki.wdf.sap.corp/wiki/display/cecbelfast/Hecatonchire,++Distributed+Shared+Memory+(DSM)+And+Datacenter+Resources+disaggregation+for+Cloud



Appendix


Linux Kernel Virtual Machine (KVM)

Released as a Linux Kernel Module (LKM)

under GPLv2 license in 2007 by Qumranet

Full virtualization via Intel VT-x and AMD-V

virtualization extensions to the x86 instruction

set

Uses Qemu for invoking KVM, for handling of

I/O and for advanced capabilities such as VM

live migration

KVM considered the primary hypervisor on

most major Linux distributions such as

RedHat and SuSE


Remote Page Faulting Architecture Comparison

Hecatonchire

No context switches

Zero-copy

Use iWarp RDMA

Yobusame

Context switches into user mode

Use standard TCP/IP transport

Horofuchi and Yamahata, KVM Forum 2011 Hudzia and Shribman, SYSTOR 2012


Legal Disclaimer

The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of

SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP

has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or

release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future

developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at

any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to

deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied,

including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This

document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or

omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.

All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially

from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as

of their dates, and they should not be relied upon in making purchasing decisions.

Documents

Lego Cloud SAP Virtualization Week 2012