Upload
benoit-hudzia
View
1.186
Download
1
Tags:
Embed Size (px)
DESCRIPTION
This session will demonstrate that by extending KVM we can deliver none-disruptively the next level of IaaS platform modularization. We will first show instantaneous live migration of VM. Then we will introduce the memory aggregation concept, and finally how to achieve full operational flexibility by dis-aggregating the datacenter resource to its core elements.
Citation preview
Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast
Aidan Shribman Sr. Researcher; SAP Research Israel
TRND04
The Lego Cloud
© 2012 SAP AG. All rights reserved. 2
Agenda
Introduction
Hardware Trends
Live Migration
Memory Aggregation
Compute Aggregation
Summary
Introduction The evolution of the datacenter
© 2012 SAP AG. All rights reserved. 4
No virtualization
Basic Consolidation
Flexible Resources Management (Cloud)
Resources Disaggregation
(True utility Cloud)
Evolution of Virtualization
© 2012 SAP AG. All rights reserved. 5
Why Disaggregate Resources?
Better Performance
Replacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM).
Many remote devices working in parallel (e.g. DRAM, disk, compute)
Superior Scalability
Going beyond boundaries of the single node
Improved Economics
Do more with existing hardware
Reach better hardware utilization levels
© 2012 SAP AG. All rights reserved. 6
The Hecatonchires Project
Hecatonchires “Hundred Headed One”
Original idea: provide Distributed Shared Memory (DSM)
capabilities to the cloud
Strategic goal : full resource liberation brought to the cloud by:
Providing more resource flexibility to current cloud paradigm by breaking
down nodes to their basic elements (CPU, Memory, I/O)
Extend existing cloud software stack (KVM, Qemu, libvirt, OpenStack)
without degrading any existing capabilities.
Using commodity cloud hardware: medium sized hosts (typically 64 GB
and 8/16 cores), and standard interconnects (such as 1 Gigabit or 10 GE).
Initiated by Benoit Hudzia in 2011. Currently developed by two
small teams of researchers from the TI Practice located in
Belfast and Ra’anana
© 2012 SAP AG. All rights reserved. 7
High Level Architecture
No special HW required but RDMA enabled
NICs which support the low overhead low
latency communication layer
VMs are not bounded by host size anymore as
resources such as memory, I/O and compute
can be aggregated
Different sized VMs can share infrastructure
so we can still support the smaller VMs not
requiring dedicated hosts
Application stack runs unmodified
CPUs
Memory
I/O
CPUs
Memory
I/O
CPUs
Memory
I/O
H/W
OS
App
VM
H/W
OS
App
VM
H/W
OS
Ap
p
VM
H/W
OS
App
VM
Server #1 Server #2 Server #n
Guests
Fast RDMA Communication
© 2012 SAP AG. All rights reserved. 8
The Team - Panoramic View
Hardware Trends Are hosts getting closer?
© 2012 SAP AG. All rights reserved. 10
CPUs stopped getting faster
Moore’s law prevailed until 2003 when core’s
speed hit a practical limit of about 3.4 Ghz
In data center core are even slower running at
2.0 - 2.8 Ghz for to power conservation
reasons
Since 2000 you do get more cores – but that
does not effect compute cycle and compute
instruction latencies
Effectively arbitrary sequential algorithms
have not gotten faster since
Source: http://www.intel.com/pressroom/kits/quickrefyr.htm
© 2012 SAP AG. All rights reserved. 11
DRAM latency has remained constant
CPU clock speed and memory bandwidth
increased steadily (at least until 2000)
But memory latency remained constant – so
local memory has gotten slower from the CPU
perspective
Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010
© 2012 SAP AG. All rights reserved. 12
Disk latency has virtually not improved
8.3
7.1 6.7
6.1 5.8 5.6
4.2
3 2.5
2
3,600 4,200 4,500 4,900 5,200 5,400 7,200 10,000 12,000 15,000
Average Latency (ms) 1980s standard disk has a 3,600 RPM
2010s standard disk has a 7,200 RPM
2x speedup in 30 years is negligible –
effectively disk has become slower from the
CPU perspective.
Panda et al. Supercomputing 2009
© 2012 SAP AG. All rights reserved. 13
But: Networks are Steadily Getting Faster
0
10
20
30
40
50
60
70
Network Performance (Gbit/s) Since 1979 we went from 0.01 Gbit/s to up 64
Gbit/s a x6400 Speedup
A competitive marketplace
10 and 40 Gbps Ethernet – originated from network
interconnects
40 Gbps QPX InfiniBand – originated from computer
internal bus technology
InfiniBand/Ethernet convergence
Virtual Protocol Interconnects
InfiniBand over Ethernet
RDMA over converged enhanced Ethernet
Using standard semantics defined by OFED
Panda et al. Supercomputing 2009
© 2012 SAP AG. All rights reserved. 14
And: Communication Stacks Are Becoming Faster
Network stack deficiencies
Application / OS context switches
Intermediate buffer copies
Transport processing
RDMA OFED Verbs API provides
Zero copy
Offloading TCP to NIC using RoCE
Flexibility to use IB, GE or IWARP
Resulting in
Reduced latency
Processor offloading
Operational flexibility
© 2012 SAP AG. All rights reserved. 15
Benchmarking Modern Interconnects
Intel MPI benchmark (IMP)
Used typically in HPC and parallel computing
Comparing:
4x DDR IB using Verbs API
10 GE TOE (TCP offloading engine) iWARP
1 GE
Measured latencies
IB 2 us
10 GE 8.23 us
1 GE 46.52 us
Broadcast latency
Exchange bandwidth
Source: Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet, IBM
© 2012 SAP AG. All rights reserved. 16
Conclusion: Remote Nodes Have Gotten Closer
Interconnects have become much faster
Fast interconnects have become a commodity
and are moving out of the High Performance
Computing (HPC) niche
IB latency 2000 ns is only 20x slower than
RAM and is 100x faster than SSD
Remote page faulting is much faster than
traditional disk backed page swapping!
HANA Performance Analysis, Chaim Bendelac, 2011
© 2012 SAP AG. All rights reserved. 17
Result: Blurring of the physical node boundaries
10,000,000 ns
10,000,000 ns
2,000 ns
100 ns
0 ns
Live Migration Pretext to Hecatonchire
© 2012 SAP AG. All rights reserved. 19
Enabling Live Migration of SAP Workloads
Business Problem
Typical SAP workloads (e.g. SAP ERP) are transactional,
large (possibly 64 GB), with a fast rate of memory writes.
Classic live migration fails for such workloads as rapid
memory writes cause memory pages to be re-sent over
and over again
Hecatonchire’s Solution
Enable live migration by reducing both the number of
pages re-sent and the cost of a page re-send
Non intrusive reducing downtime, service degradation, and
total migration time
© 2012 SAP AG. All rights reserved. 20
Live Migration Technique
Pre-migration process
Reservation process
Iterative pre-copy
Stop and copy
Commitment
• VM active on host A
• Destination host selected
(Block devices mirrored)
• Initialize container on target host • Copy dirty pages in successive
rounds
• Suspend VM on host A
• Redirect network traffic
• Synch remaining state
• Activate on host B
• VM state on host A released
© 2012 SAP AG. All rights reserved. 21
Pre-copy live migration
Reducing number of page re-sends
Page LRU Reordering such that pages which
have a high chance of being re-dirtied soon are
delayed until later
Reducing the cost of a page re-send
By using XBZRLE delta encoder we can much
more efficiently represent page changes
© 2012 SAP AG. All rights reserved. 22
Pre-Copy Live-
Migration
Page Pushing
1
Round
Stop
and
Copy
Commit
Total Migration Time
Downtime Live on A Degraded on B Live on B
Page Pushing
1
Round
Stop
and
Copy
Commit
Total Migration Time
Downtime Live on A Degraded on B Live on B
Iterative
Pre-Copy
X
Rounds
Post-Copy Live-
Migration
Hybrid Post-Copy
Live-Migration
More Than One Way to Live Migrate …
Iterative
Pre-copy X
Rounds
Stop
and
Copy
Commit
Total Migration Time
Downtime Live on A Live on B
Pre-migrate;
Reservation
Pre-migrate;
Reservation
Commit
Pre-migrate;
Reservation
© 2012 SAP AG. All rights reserved. 23
Post-copy live migration using fast interconnects
In Post-copy live migration the state of the VM
is transferred to the destination and activated
before memory is transferred
Post-copy implementation includes
Handling of remote page faults
Background transfer of memory pages
Service degradation mitigated by
RDMA zero-copy interconnects
Pre-paging – similar in concept to pre-fetching
Hybrid Post Copy – begins with a pre-copy phase
MMU integration – eliminating need for VM pause
Demo
Memory Aggregation In the oven …
© 2012 SAP AG. All rights reserved. 26
The Memory Cloud Turns memory into a distributed memory service
Server1 Server2 Server3
Server1 Server2 Server3
Server1 Server2 Server3
VM VM VM
Storage
Applications
Memory
Breaks memory
from the bounds of the
physical box
Yields double digit
percentage gains in IT
economics
Transparent
deployment with
performance at scale
and Reliability
App
RAM
App App
RAM RAM
© 2012 SAP AG. All rights reserved. 27
RRAIM : Remote Redundant Array of Inexpensive Memory Supporting Large Memory Instances On-Demand
Business Problem
Current instance memory sizes are constrained by physical hosts’
memory size ( Amazon Biggest VM occupy the whole physical host)
Heavy swap usage slows execution time for data intensive applications
Hecatonchire Solution
Access remote DRAM via fast interconnects zero-copy RDMA
Hide remote DRAM latency by using page pre-pushing
MMU Integration for transparency for applications and VMs
Reliability by using a RAID-1 (mirroring) like schema
Hecatonchire Value Proposition
Provide memory aggregation on-demand
Totally transparent to workload (no integration needed)
No hardware investment! No dedicated servers!
RAIM Solution
Memory Cloud
Application VM swaps to memory
Cloud
RAM
Compression / De-
duplication / N-tiers
storage / HR-HA
© 2012 SAP AG. All rights reserved. 28
Capacity
Access S
peed
Hecatonchire / RRAIM: Breakthrough Capability Breaking the memory box barrier for memory intensive applications
10
mse
c
1 m
se
c
10
0 μ
se
c
10
μse
c
1 μ
se
c
SAN
NAS
Local Disk
Performance
Barrier
DRAM
L1 cache
L2 cache
Em
bed
ded
Reso
urc
es
Lo
cal
Reso
urc
es
Netw
ork
ed
Reso
urc
es
PB TB GB MB
SSD
© 2012 SAP AG. All rights reserved. 29
Lego Cloud Architecture ( Memory block)
Compute VM Memory Guest
Memory Cloud Management
Services
App App
memory Memory
Cloud
RRAIM
VM
RAM
VM VM
RAM
Many Physical Nodes
Hosting a variety of VMs
Combination VM Memory Guest & Host
Memory VM Memory Host
© 2012 SAP AG. All rights reserved. 30
Instant Flash Cloning On-Demand
Business Problem
Burst load / service usage that cannot be satisfied in time
Existing solutions
Vendors: Amazon / VMWare/ rightscale
Startup VM from disk image
Requires full VM OS startup sequence
Hecatonchire Solution
Using a paused VM as source for Copy-on-Write (CoW)
We perform a Post-Copy Live Migration
Hecatonchire Value Proposition
Just in time (sub-second) provisioning
© 2012 SAP AG. All rights reserved. 31
Instant Flash Cloning On-Demand
We can clone VMs to meet demand much faster
than other solutions
Reducing infrastructure costs while still minimizing
lost opportunities => Just in time provisioning
Requires Application Integration
We track OS/application metrics in running VMs or in Load
Balancer (LB)
Alerts are defined if metrics pass a pre-define threshold
According to alerts we can scale-up adding more resources
or scale-down to save on resources not utilized
Amazon Web Services - Guide
Compute Aggregation Our next challenge
© 2012 SAP AG. All rights reserved. 33
Cost Effective “Small” HPC Grid
High Performance Computing (HPC)
Supercomputers at the frontline of processing speeds 10k-100k core
Typical benchmark: Grid 500 (Linear Algebra)
Small HPC using 10-20 commodity (2 TB / 80 core) nodes
Typical Applications
Relational Databases
Analytics tasks (Linear Algebra)
Simulations
Hecatonchire Value Proposition
Optimal price / performance by using commodity hardware
Operational flexibility: node downtime without downing the cluster
Seamless deployment within existing cloud
© 2012 SAP AG. All rights reserved. 34
Distributed Shared Memory (DSM)
Traditional cluster
Distributed memory
Standard interconnects
OS instance on each node
Distribution handled by application
ccNUMA
Cache coherent shared memory
Fast interconnects
One OS instance
Distribution handled by hardware
Vendors: ScaleMP, Numascale, others
© 2012 SAP AG. All rights reserved. 35
Distributed Shared Memory – Inherent Limitations
Linux provides NUMA topology discovery
Distance between compute cores
Distance between cores to memory
While the Linux OS is aware of the NUMA
layout the application may not be aware …
Cache-coherency may get very expensive
Inter-core: L3 Cache 20 ns
Inter-socket: Main Memory 100 ns
Inter-node (IB): Remote Memory 2,000 ns
Thus the ccNUMA architecture many not
“really” be transparent to the application!
Summary
© 2012 SAP AG. All rights reserved. 37
Roadmap
2011
• Live Migration
• Pre-copy XBZRLE Delta Encoding
• Pre-copy LRU page reordering
• Post-copy using RDMA interconnects
2012
• Resource Aggregation
• Cloud Management Integration
• Memory Aggregation – RAIM (Redundant Array of Inexpensive Memory)
• I/O Aggregation – vRAID (virtual Redundant Array of Inexpensive Disks)
• Flash cloning
2013
• Lego Landscape
• CPU Aggregation - ccNUMA
• Flexible resource management
© 2012 SAP AG. All rights reserved. 38
Key takeaways
Hecatonchire extends standard Linux stack requiring
standard commodity hardware
With Hecatonchire unmodified applications or VMs can
tape into remote resources tranparently
To be released as open source under GPLv2 and LGPL
licenses to Qemu and Linux communities
Developed by SAP Research TI
Thank you Contact information:
Benoit Hudzia; Sr. Researcher;
SAP Research CEC Belfast
Aidan Shribman; Sr. Researcher;
SAP Research Israel
Hecatonchire Wiki
https://wiki.wdf.sap.corp/wiki/display/cecbelfast/Hecatonc
hire%2C++Distributed+Shared+Memory+%28DSM%29+
And+Datacenter+Resources+disaggregation+for+Cloud
Appendix
© 2012 SAP AG. All rights reserved. 41
Linux Kernel Virtual Machine (KVM)
Released as a Linux Kernel Module (LKM)
under GPLv2 license in 2007 by Qumranet
Full virtualization via Intel VT-x and AMD-V
virtualization extensions to the x86 instruction
set
Uses Qemu for invoking KVM, for handling of
I/O and for advanced capabilities such as VM
live migration
KVM considered the primary hypervisor on
most major Linux distributions such as
RedHat and SuSE
© 2012 SAP AG. All rights reserved. 42
Remote Page Faulting Architecture Comparison
Hecatonchire
No context switches
Zero-copy
Use iWarp RDMA
Yobusame
Context switches into user mode
Use standard TCP/IP transport
Horofuchi and Yamahata, KVM Forum 2011 Hudzia and Shribman, SYSTOR 2012
© 2012 SAP AG. All rights reserved. 43
Legal Disclaimer
The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of
SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP
has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or
release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future
developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at
any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to
deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied,
including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This
document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or
omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.
All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially
from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as
of their dates, and they should not be relied upon in making purchasing decisions.