Hana Memory Scale out using the hecatonchire Project

Virtualized Memory Scale-Out for SAP HANADr. Benoit Hudzia SAP Research – Next Business and TechnologyCloud and Virtualization Week , April 2013

© 2013 SAP AG. All rights reserved. 2Public

Agenda

• Reducing Memory TCO via Scale Out• Memory As a service• Hana Memory Scale Out• Conclusion & Next Steps

Reducing Memory TCO via Scale OutHecatonchire Project – Memory Cloud


Achieving the Right-sized Servers

• Eliminates unnecessary components• Increased power efficiency • Optimized performance by workload

Using high-volume components to build high-value systems


How do we offer better Memory TCO

• Memory in the nodes of current clusters is Overscaled in order to fit the requirements of “any” application• It remains unused most of the time

• How can we unleash your memory-constrained application by using the memory in the rest of nodes

Memory that grows with your business, not before.


Decouple memory from cores aggregation

Many shared-memory parallel applications do not scale beyond a few tens of cores...However, may benefit from large amounts of memory:

• In-memory databases• Datamining• VM• Scientific applications• etc

Eliminate Physical Limitation of Cloud / DC


Scaling-Out with Fast Networking

Low latency memory transfers can help reduce the need for overprovisioning while also providing much lower latency than swapping data out to disk.

Author: Chaim Bendalac

Memory as a ServiceHecatonchire Project – Memory Cloud


The Idea: Turning memory into a distributed memory service

Breaks memory from the bounds of the physical box

Transparent deployment with performance at scale and Reliability


Network

High Level Principle

Memory Sponsor A

Memory Sponsor B

Memory Demander

Virtual Memory Address Space

Memory Demanding Process


Hard Page Fault Resolution Performance

Time spend over the wire one way

Average (μs)

Resolution timeAverage (μs)- 4KB

Page

Page (4KB) Transfer/Sec

(with prefetch)SoftIwarp (10

GbE)150 + 330 50k/s

Iwarp (10GbE)

4-6 28 250k/s

Infiniband (40 Gbps)

2-4 16 650k/s


Average Compounded Page Fault Resolution Time(With Prefetch)

1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000IW 10GE Sequential IB 40 Gbps SequentialIW 10GE- Binary splitIB 40Gbps- Binary splitIW 10GE- Random WalkIB- Random WalkM

icro-seconds

Avg IW

Avg IB

Benchmark of Scaled Out HANAHecatonchire Project – Memory Cloud


Use-Case: Hana Memory Scale Out - Memory Aggregation

• Aggregate the RAM of a cluster into 1 HUGE HANA DB

Overview:• Given n nodes, each with X GB of RAM:

• 1 node will run a VM with a HANA DB • High Core count / Memory Ratio

• n-1 nodes will serve as memory containers • Low Core Count / Memory Ratio

• Deploy a VM with n*X GB using Hecatonchire• Instead of using swap for freeing up the RAM, push pages to remote hosts• If a page is needed again, request it back

Up to 30% in HW cost Reduction compare to a single node instance


Memory Scale out for SAP HANA – Benchmark

Hardware:•Server with Intel Xeon West Mere

• 4 socket • 10Core • 512 GB RAM

•Fabric: • Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • 10 GbE Ethernet Switch + Chelsio T422 NIC

Virtual Machine:• Small Size: 64 GB Ram - 32 vCPU• Medium Size: 128 GB RAM – 40 vCPU• Hypervisor: KVM

• Application : SAP HANA ( In memory Database)

• Workload : OLAP ( TPC-H Variant)• Data size ~600 GB uncompressed• 18 differents Queries• 15 iteration of each query set


1 20 40 600

50

100

150

200

250

300

350

400

450

500

Baseline

Half-Half

Scaling Out Hana(Small Size)

Nb Users HECA Overhead per query set

1 ~3%

20 ~3%

40 ~3%

60 ~3%

Virtual Machine:• 64 GB Ram , 32 vCPU (Small)• Application : HANA ( In memory Database)• Workload : OLAP ( TPC-H Variant)• ~600GB data set uncompressed

Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 512 GB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2

Query Set Completion Time (s)

Users


Per Query Overhead

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-50%

0%

50%

100%

150%

200%

1 User20 Users40 Users

Query


Scaling out HANA(Medium Size )

Virtual Machine:• 128 GB Ram , 40 vCPU (Medium)• Application : HANA ( In memory Database )• Workload : OLAP ( TPC-H Variant)- 80 Users• ~600GB data set uncompressed

Memory Ratio

HECA Overhead per query set – 80 users

1:2 1%

1:3 1.6%

2:1:1 0.1%

1:1:1 1.5%

Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 512 GB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2

1:02 1:03 2:01 1:010

0.0020.0040.0060.0080.01

0.0120.0140.0160.018

Memory ratio

Production and Cloud readyHecatonchire Project – Memory Cloud


High Availability of Memory Node ( RRAIM ) running HANA

Memory Ratio (Medium 128GB - SAP-H Benchmark – 80 users)

RRAIM Overhead vs Heca

1:2 -0.2%

1:3 -0.1%

RAMRAM

RAMRAM

RAMRAM

RAMRAM

RAMRAM

RAMRAM

HA

Mirroring

• Memory region backed by two remote nodes.

• Remote page faults and swap outs initiated simultaneously to all relevant nodes.

• No immediate effect on computation node upon failure of node.

• When we a new remote enters the cluster, it synchronizes with computation node and mirror node.


Enterprise Class Feature

Before DedupAfterDedup

Online RAM Deduplication (via KSM) Automatic Memory Tiering

Local Memory

Remote Memory

Compressed Memory

SSD

HDD

Local Node Remote Node


Enabling Live Migration of HANA DB (Small Instance)

Baseline Pre-Copy(Standard)

Post-Copy(Heca)

Downtime N/A 7.47 s 675 ms

Performance Degradation

(80 users)

0% Benchmark Failed

(HANA crash- Vm unresponsive)

5%

NextHecatonchire Project – Memory Cloud


Transitioning to a Memory Cloud Transparent Cloud Integration

Compute VMMemory Demander

Memory Cloud Management Services (OpenStack)

App App

memoryMemoryCloud

Heca-NOVA

VM

RAM

VM VM

RAM

Many Physical NodesHosting a variety of VMs

Combination VMMemory Sponsor & Demander

Memory VMMemory Sponsor

PoC Q3 2013

• Automatically reclaim underutilized or abandoned memory resources

• Automatically and intelligently redeploys memory workloads across the infrastructure


BareBone Memory Scale Out and Fine grained Memory Sharing

BareBone Memory Scale Out

• No need for Virtualization and/or Hypervisor

• Similar to Linux Control Group

• Allow barebones memory scale out for HANA or any other applications

Fine grained Distributed Shared Memory :

• Memory coherence Model • Shared Nothing,• Write Invalidate• Read-Write protection

PoC Q2 2013

PoC Q3 2013


Virtual Distributed Shared Memory System(Compute Cloud)

Compute aggregation Idea : Virtual Machine compute and memory span

Multiple physical Nodes

Challenges Coherency Protocol Granularity ( False sharing )

Hecatonchire Value Proposition Optimal price / performance by using commodity

hardware Operational flexibility: node downtime without downing

the cluster Seamless deployment within existing cloud

CPUs

Memory

I/O

CPUs

Memory

I/O

CPUs

Memory

I/O

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

App

VM

Server #1 Server #2 Server #n

Guests

Fast RDMA Communication

Future Works


Hecatonchire: Open-Source Project

Http://www.hecatonchire.com

https://github.com/hecatonchire/


Open Source Roadmap

Standardization : • Linux Kernel Upstream• KVM/QEMU Upstream• OpenStack Module (Nova-Heca)

Collaboration, collaboration, collaboration!• Academic and Industrial

Thank You!Contact information:

Dr. Benoit Hudzia [email protected]

Hecatonchire Project: WWW: http://www.hecatonchire.comGithub: https://github.com/hecatonchire/

mailto:[email protected]

http://www.hecatonchire.com/




Backup Slides


How does it work (Simplified Version)

Virtual Address

MMU(+ TLB)

Physical Address

Page Table Entry

Coherency Engine

RDMA Engine RDMA Engine

MMU(+ TLB)

Physical Address

Page Table Entry

Coherency Engine

Miss

Remote PTE

(Custom Swap Entry)

Page request

Page Response

PTE write

Update MMU

Invalidate PTE

Invalidate MMU

Extract Page

Extract Page

Prepare Page for RDMA transfer

Physical Node BPhysical Node A

Network

Fabric


Raw Bandwidth usageHW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps

Total Gbit/sec (SIW - Seq)

Total Gbit/sec (IW-Seq)

Total Gbit/sec (IB-Seq)

Total Gbit/sec (SIW- Bin split)

Total Gbit/sec (IW- Bin split)

Total Gbit/sec (IB- Bin split)

Total Gbit/sec (SIW- Random)

Total Gbit/sec (IW- Random)

Total Gbit/sec (IB- Random)

0

5

10

15

20

251 Thread 2 Threads3 Threads4 Threads5 Threads6 Threads7 Threads

Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM

Maxing out Bandwidth

Not enough core to saturate (?)

No degradation under high load

Software RDMA has significant

overhead


Redundant Array of Inexpensive RAM: RRAIM

1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes.

2. No immediate effect on computation node upon failure of node.

3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node.


Quicksort Benchmark with Memory Constraint

Memory Ratio (constraint using cgroup)

HECA Overhead RRAIM Overhead

3:4 2.08% 5.21%1:2 2.62% 6.15%1:3 3.35% 9.21%1:4 4.15% 8.68%1:5 4.71% 9.28%

Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset

3:041:021:031:041:050.00%

2.00%

4.00%

6.00%

8.00%

10.00%

DSM OverheadRRAIM Overhead


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-50%

0%

50%

100%

150%

200%

1 User

20 Users

40 Users