Upload
benoit-hudzia
View
1.582
Download
2
Embed Size (px)
DESCRIPTION
This session will present a memory scale-out solution that liberates SAP HANA, or similar memory demanding enterprise applications, from the classical limitation of underlying physical servers. The solution relies on key enabling technology developed within SAP. It allows applications or hypervisors to go beyond the boundaries of the underlying hardware, and effectively enables a fluid transformation from commodity sized physical nodes to very large virtual instances, in order to meet the rapidly growing demand of memory intensive applications.
Citation preview
Virtualized Memory Scale-Out for SAP HANADr. Benoit Hudzia SAP Research – Next Business and TechnologyCloud and Virtualization Week , April 2013
© 2013 SAP AG. All rights reserved. 2Public
Agenda
• Reducing Memory TCO via Scale Out• Memory As a service• Hana Memory Scale Out• Conclusion & Next Steps
Reducing Memory TCO via Scale OutHecatonchire Project – Memory Cloud
© 2013 SAP AG. All rights reserved. 4Public
Achieving the Right-sized Servers
• Eliminates unnecessary components• Increased power efficiency • Optimized performance by workload
Using high-volume components to build high-value systems
© 2013 SAP AG. All rights reserved. 5Public
How do we offer better Memory TCO
• Memory in the nodes of current clusters is Overscaled in order to fit the requirements of “any” application• It remains unused most of the time
• How can we unleash your memory-constrained application by using the memory in the rest of nodes
Memory that grows with your business, not before.
© 2013 SAP AG. All rights reserved. 6Public
Decouple memory from cores aggregation
Many shared-memory parallel applications do not scale beyond a few tens of cores...However, may benefit from large amounts of memory:
• In-memory databases• Datamining• VM• Scientific applications• etc
Eliminate Physical Limitation of Cloud / DC
© 2013 SAP AG. All rights reserved. 7Public
Scaling-Out with Fast Networking
Low latency memory transfers can help reduce the need for overprovisioning while also providing much lower latency than swapping data out to disk.
Author: Chaim Bendalac
Memory as a ServiceHecatonchire Project – Memory Cloud
© 2013 SAP AG. All rights reserved. 9Public
The Idea: Turning memory into a distributed memory service
Breaks memory from the bounds of the physical box
Transparent deployment with performance at scale and Reliability
© 2013 SAP AG. All rights reserved. 10Public
Network
High Level Principle
Memory Sponsor A
Memory Sponsor B
Memory Demander
Virtual Memory Address Space
Memory Demanding Process
© 2013 SAP AG. All rights reserved. 11Public
Hard Page Fault Resolution Performance
Time spend over the wire one way
Average (μs)
Resolution timeAverage (μs)- 4KB
Page
Page (4KB) Transfer/Sec
(with prefetch)SoftIwarp (10
GbE)150 + 330 50k/s
Iwarp (10GbE)
4-6 28 250k/s
Infiniband (40 Gbps)
2-4 16 650k/s
© 2013 SAP AG. All rights reserved. 12Public
Average Compounded Page Fault Resolution Time(With Prefetch)
1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000IW 10GE Sequential IB 40 Gbps SequentialIW 10GE- Binary splitIB 40Gbps- Binary splitIW 10GE- Random WalkIB- Random WalkM
icro-seconds
Avg IW
Avg IB
Benchmark of Scaled Out HANAHecatonchire Project – Memory Cloud
© 2013 SAP AG. All rights reserved. 14Public
Use-Case: Hana Memory Scale Out - Memory Aggregation
• Aggregate the RAM of a cluster into 1 HUGE HANA DB
Overview:• Given n nodes, each with X GB of RAM:
• 1 node will run a VM with a HANA DB • High Core count / Memory Ratio
• n-1 nodes will serve as memory containers • Low Core Count / Memory Ratio
• Deploy a VM with n*X GB using Hecatonchire• Instead of using swap for freeing up the RAM, push pages to remote hosts• If a page is needed again, request it back
Up to 30% in HW cost Reduction compare to a single node instance
© 2013 SAP AG. All rights reserved. 15Public
Memory Scale out for SAP HANA – Benchmark
Hardware:•Server with Intel Xeon West Mere
• 4 socket • 10Core • 512 GB RAM
•Fabric: • Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • 10 GbE Ethernet Switch + Chelsio T422 NIC
Virtual Machine:• Small Size: 64 GB Ram - 32 vCPU• Medium Size: 128 GB RAM – 40 vCPU• Hypervisor: KVM
• Application : SAP HANA ( In memory Database)
• Workload : OLAP ( TPC-H Variant)• Data size ~600 GB uncompressed• 18 differents Queries• 15 iteration of each query set
© 2013 SAP AG. All rights reserved. 16Public
1 20 40 600
50
100
150
200
250
300
350
400
450
500
Baseline
Half-Half
Scaling Out Hana(Small Size)
Nb Users HECA Overhead per query set
1 ~3%
20 ~3%
40 ~3%
60 ~3%
Virtual Machine:• 64 GB Ram , 32 vCPU (Small)• Application : HANA ( In memory Database)• Workload : OLAP ( TPC-H Variant)• ~600GB data set uncompressed
Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 512 GB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2
Query Set Completion Time (s)
Users
© 2013 SAP AG. All rights reserved. 17Public
Per Query Overhead
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-50%
0%
50%
100%
150%
200%
1 User20 Users40 Users
Query
© 2013 SAP AG. All rights reserved. 18Public
Scaling out HANA(Medium Size )
Virtual Machine:• 128 GB Ram , 40 vCPU (Medium)• Application : HANA ( In memory Database )• Workload : OLAP ( TPC-H Variant)- 80 Users• ~600GB data set uncompressed
Memory Ratio
HECA Overhead per query set – 80 users
1:2 1%
1:3 1.6%
2:1:1 0.1%
1:1:1 1.5%
Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 512 GB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2
1:02 1:03 2:01 1:010
0.0020.0040.0060.0080.01
0.0120.0140.0160.018
Memory ratio
Production and Cloud readyHecatonchire Project – Memory Cloud
© 2013 SAP AG. All rights reserved. 20Public
High Availability of Memory Node ( RRAIM ) running HANA
Memory Ratio (Medium 128GB - SAP-H Benchmark – 80 users)
RRAIM Overhead vs Heca
1:2 -0.2%
1:3 -0.1%
RAMRAM
RAMRAM
RAMRAM
RAMRAM
RAMRAM
RAMRAM
HA
Mirroring
• Memory region backed by two remote nodes.
• Remote page faults and swap outs initiated simultaneously to all relevant nodes.
• No immediate effect on computation node upon failure of node.
• When we a new remote enters the cluster, it synchronizes with computation node and mirror node.
© 2013 SAP AG. All rights reserved. 21Public
Enterprise Class Feature
Before DedupAfterDedup
Online RAM Deduplication (via KSM) Automatic Memory Tiering
Local Memory
Remote Memory
Compressed Memory
SSD
HDD
Local Node Remote Node
© 2013 SAP AG. All rights reserved. 22Public
Enabling Live Migration of HANA DB (Small Instance)
Baseline Pre-Copy(Standard)
Post-Copy(Heca)
Downtime N/A 7.47 s 675 ms
Performance Degradation
(80 users)
0% Benchmark Failed
(HANA crash- Vm unresponsive)
5%
NextHecatonchire Project – Memory Cloud
© 2013 SAP AG. All rights reserved. 24Public
Transitioning to a Memory Cloud Transparent Cloud Integration
Compute VMMemory Demander
Memory Cloud Management Services (OpenStack)
App App
memoryMemoryCloud
Heca-NOVA
VM
RAM
VM VM
RAM
Many Physical NodesHosting a variety of VMs
Combination VMMemory Sponsor & Demander
Memory VMMemory Sponsor
PoC Q3 2013
• Automatically reclaim underutilized or abandoned memory resources
• Automatically and intelligently redeploys memory workloads across the infrastructure
© 2013 SAP AG. All rights reserved. 25Public
BareBone Memory Scale Out and Fine grained Memory Sharing
BareBone Memory Scale Out
• No need for Virtualization and/or Hypervisor
• Similar to Linux Control Group
• Allow barebones memory scale out for HANA or any other applications
Fine grained Distributed Shared Memory :
• Memory coherence Model • Shared Nothing,• Write Invalidate• Read-Write protection
PoC Q2 2013
PoC Q3 2013
© 2013 SAP AG. All rights reserved. 26Public
Virtual Distributed Shared Memory System(Compute Cloud)
Compute aggregation Idea : Virtual Machine compute and memory span
Multiple physical Nodes
Challenges Coherency Protocol Granularity ( False sharing )
Hecatonchire Value Proposition Optimal price / performance by using commodity
hardware Operational flexibility: node downtime without downing
the cluster Seamless deployment within existing cloud
CPUs
Memory
I/O
CPUs
Memory
I/O
CPUs
Memory
I/O
H/W
OS
App
VM
H/W
OS
App
VM
H/W
OS
App
VM
H/W
OS
App
VM
Server #1 Server #2 Server #n
Guests
Fast RDMA Communication
Future Works
© 2013 SAP AG. All rights reserved. 27Public
Hecatonchire: Open-Source Project
Http://www.hecatonchire.com
https://github.com/hecatonchire/
© 2013 SAP AG. All rights reserved. 28Public
Open Source Roadmap
Standardization : • Linux Kernel Upstream• KVM/QEMU Upstream• OpenStack Module (Nova-Heca)
Collaboration, collaboration, collaboration!• Academic and Industrial
Thank You!Contact information:
Dr. Benoit Hudzia [email protected]
Hecatonchire Project: WWW: http://www.hecatonchire.comGithub: https://github.com/hecatonchire/
Backup Slides
© 2013 SAP AG. All rights reserved. 31Public
How does it work (Simplified Version)
Virtual Address
MMU(+ TLB)
Physical Address
Page Table Entry
Coherency Engine
RDMA Engine RDMA Engine
MMU(+ TLB)
Physical Address
Page Table Entry
Coherency Engine
Miss
Remote PTE
(Custom Swap Entry)
Page request
Page Response
PTE write
Update MMU
Invalidate PTE
Invalidate MMU
Extract Page
Extract Page
Prepare Page for RDMA transfer
Physical Node BPhysical Node A
Network
Fabric
© 2013 SAP AG. All rights reserved. 32Public
Raw Bandwidth usageHW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps
Total Gbit/sec (SIW - Seq)
Total Gbit/sec (IW-Seq)
Total Gbit/sec (IB-Seq)
Total Gbit/sec (SIW- Bin split)
Total Gbit/sec (IW- Bin split)
Total Gbit/sec (IB- Bin split)
Total Gbit/sec (SIW- Random)
Total Gbit/sec (IW- Random)
Total Gbit/sec (IB- Random)
0
5
10
15
20
251 Thread 2 Threads3 Threads4 Threads5 Threads6 Threads7 Threads
Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM
Maxing out Bandwidth
Not enough core to saturate (?)
No degradation under high load
Software RDMA has significant
overhead
© 2013 SAP AG. All rights reserved. 33Public
Redundant Array of Inexpensive RAM: RRAIM
1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes.
2. No immediate effect on computation node upon failure of node.
3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node.
© 2013 SAP AG. All rights reserved. 34Public
Quicksort Benchmark with Memory Constraint
Memory Ratio (constraint using cgroup)
HECA Overhead RRAIM Overhead
3:4 2.08% 5.21%1:2 2.62% 6.15%1:3 3.35% 9.21%1:4 4.15% 8.68%1:5 4.71% 9.28%
Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset
3:041:021:031:041:050.00%
2.00%
4.00%
6.00%
8.00%
10.00%
DSM OverheadRRAIM Overhead
© 2013 SAP AG. All rights reserved. 37Public
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
-50%
0%
50%
100%
150%
200%
1 User
20 Users
40 Users