Upload
vodang
View
219
Download
0
Embed Size (px)
Citation preview
Efficient Memory Disaggregation with Infiniswap
Juncheng Gu, Youngmoon Lee, Yiwen Zhang,Mosharaf Chowdhury, Kang G. Shin
Agenda• Motivation and related work
• Design and system overview
• Implementation and evaluation
• Future work and conclusion
3/30/17 1
3/30/17 4
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 5
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 6
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 7
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 8
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 9
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 10
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 11
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.060.12
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 12
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.060.12
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.060.12
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 13
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.060.12
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.060.12
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
Memory overestimation
3/30/17 14
• Google Cluster Analysis[1]
[1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC’12.
Memory underutilization
How to utilize ABU memory?
Allocated Used
Porti
on o
f Mem
ory
Time (days)
3/30/17 15
• Google Cluster Analysis[1]
[1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC’12.
Memory underutilization
How to utilize ABU memory?
Allocated Used
Porti
on o
f Mem
ory
0.8
Time (days)
3/30/17 16
• Google Cluster Analysis[1]
[1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC’12.
Memory underutilization
How to utilize ABU memory?
Allocated Used
Porti
on o
f Mem
ory
0.8
0.5
Time (days)
3/30/17 17
• Google Cluster Analysis[1]
[1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC’12.
Memory underutilization
How to utilize ABU memory?
Allocated Used
Porti
on o
f Mem
ory
0.8
0.5≈30%
Time (days)
3/30/17 18
• Google Cluster Analysis[1]
[1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC’12.
Memory underutilization
How to utilize ABU memory?
Allocated Used
Porti
on o
f Mem
ory
0.8
0.5≈30%
Time (days)Can we utilize this memory?
3/30/17 20
Disaggregate free memory
Machine 2
Used memory Free memory Remote memory
Machine 3 Machine 4 Machine N
Machine 1
Machine 2
Memory Disaggregation Layer
Machine 3 Machine 4 Machine N
Machine 1
Used memory Free memory Remote memory
3/30/17 21
Disaggregate free memory
Machine 2
Used memory Free memory Remote memory
Machine 3 Machine 4 Machine N
Machine 1
Machine 2
Memory Disaggregation Layer
Machine 3 Machine 4 Machine N
Machine 1
Used memory Free memory Remote memory
Machine 2
Memory Disaggregation Layer
Machine 3 Machine 4 Machine N
Machine 1
Used memory Free memory Remote memory
Machine 2
Memory Disaggregation Layer
Machine 3 Machine 4 Machine N
Machine 1
Used memory Free memory Remote memory
3/30/17 22
What are the challenges?
• Minimize deployment overhead• No hardware design• No application modification
• Tolerate failures• e.g. network disconnection, machine crash
• Manage remote memory at scale
No HW design No appmodification
Fault-tolerance Scalability
Memory Blade[ISCA’09]
HPBD[CLUSTER’05] / NBDX[1]
RDMA key-value service(e.g. HERD[SIGCOMM’14], FaRM[NSDI’14])
Intel Rack Scale Architecture(RSA)[2]
Infiniswap
3/30/17 23
Recent work on memory disaggregation
[1] https://github.com/accelio/NBDX[2] http://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html
Agenda• Motivation and related work
• Design and system overview
• Implementation and evaluation
• Future work and conclusion
3/30/17 24
3/30/17 25
System Overview
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
Local Disk RNIC
Machine 1
ApplicationInfiniswapDaemon User
Space
Machine 2
RNIC
SyncAsync
3/30/17 26
System Overview
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
Local Disk RNIC
Machine 1
ApplicationInfiniswapDaemon User
Space
Machine 2
RNIC
SyncAsync
Infiniswap Block Device• Swap space• Request router
3/30/17 27
System Overview
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
Local Disk RNIC
Machine 1
ApplicationInfiniswapDaemon User
Space
Machine 2
RNIC
SyncAsync
Local disk• [ASYNC] backup swapped-out
data• Tolerate remote memory
failure
3/30/17 28
System Overview
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
Local Disk RNIC
Machine 1
ApplicationInfiniswapDaemon User
Space
Machine 2
RNIC
SyncAsync
Infiniswap Deamon• Local memory region• Remote memory service
3/30/17 29
System Overview
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
Local Disk RNIC
Machine 1
ApplicationInfiniswapDaemon User
Space
Machine 2
RNIC
SyncAsync
RDMA • One-sided operations• Bypass remote CPU
Objectives Ideas
No hardware designRemote paging
No application modification
Fault-tolerance Local backup disk
Scalability Decentralized remote memory management
3/30/17 30
How to meet the design objectives?
3/30/17 31
One-to-many
Application1 Application2
Virtual Memory Manager (VMM)
Infiniswap Block Device
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 1 Machine 2
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 3
RNIC
Local Disk
User Space
Kernel Space
Async Sync
3/30/17 32
Many-to-many
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 1 Machine 2
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 3
RNIC
Application1 Application2 User Space
Kernel SpaceVirtual Memory Manager (VMM)
Infiniswap Block Device
RNIC
Machine 4
Local Disk Local Disk
Async Sync AsyncSync
3/30/17 33
Many-to-many
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 1 Machine 2
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 3
RNIC
Application1 Application2 User Space
Kernel SpaceVirtual Memory Manager (VMM)
Infiniswap Block Device
RNIC
Machine 4
Local Disk Local Disk
Async Sync AsyncSync
How to scale remote memory?
• How to find remote memory in the cluster?• Which remote mapping should be evicted?
Objectives Ideas
No hardware designRemote paging
No application modification
Fault-tolerance Local backup disk
Scalability Decentralized remote memory management
3/30/17 34
How to meet the design objectives?
3/30/17 35
Management unit: memory page?
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 36
Management unit: memory page?
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
Local Page Remote Pagep100 <s1, p1>
1GB = 256K entries1GB = 256K RTTs
3/30/17 37
Management unit: memory slab!
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 38
Management unit: memory slab!
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 39
Which remote machine should be selected?
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 40
Which remote machine should be selected?
Goal: balance memory utilization
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 41
Which remote machine should be selected?
Ø Central controller
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 42
Which remote machine should be selected?
Ø Central controller
Ø Decentralized approach
3/30/17 43
[1]Power of two choices[1]
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
[1] Mitzenmacher, Michael. "The power of two choices in randomized load balancing.”, Ph.D. thesis, U.C. Berkeley, 1996
3/30/17 44
[1]Power of two choices[1]
[1] Mitzenmacher, Michael. "The power of two choices in randomized load balancing.”, Ph.D. thesis, U.C. Berkeley, 1996
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 45
Slab eviction
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Mapped Slab Unmapped Slab
3/30/17 46
Slab eviction
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Mapped Slab Unmapped Slab
3/30/17 47
Slab eviction
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Mapped Slab Unmapped Slab
3/30/17 48
Which slab should be evicted?
Daemon: Does not know the swap activities
Infiniswap Daemon
1 2 3 4
3/30/17 49
Daemon: Too expensive to query all the slabs
Infiniswap Daemon
1 2 3 4
Which slab should be evicted?
Infiniswap Daemon
1 2 3 4
3/30/17 50
Power of multiple choices[1]
Select E least-active slabs from E+E’ random slabs
[1] Park, Gahyun. "A generalization of multiple choice balls-into-bins.” PODC’11
Infiniswap Daemon
1 2 3 4
3/30/17 51
Power of multiple choices[1]
Select E least-active slabs from E+E’ random slabs
[1] Park, Gahyun. "A generalization of multiple choice balls-into-bins.” PODC’11
Infiniswap Daemon
1 2 3 4
Infiniswap Daemon
1 2 3 4
3/30/17 52
Power of multiple choices[1]
Select E least-active slabs from E+E’ random slabs
[1] Park, Gahyun. "A generalization of multiple choice balls-into-bins.” PODC’11
Infiniswap Daemon
1 2 3 4
Infiniswap Daemon
1 2 4
Agenda• Motivation and related work
• Design and system overview
• Implementation and evaluation
• Future work and conclusion
3/30/17 53
3/30/17 54
Implementation
• Connection Management• One RDMA connection per active block device - daemon pair
• Control Plane• SEND, RECV
• Data Plane• One-sided RDMA READ, WRITE
Kernel Space
InfiniswapBlock Device
User Space
InfiniswapDaemon
RDMA
3/30/17 55
What are we expecting from Infiniswap?
§ Application performance
§ Cluster memory utilization
§ Network usage
§ Eviction overhead
§ Fault-tolerance overhead
§ Performance as a block device
3/30/17 56
Evaluation
2 x 8 cores (32 vcores)64GB DRAM56Gbps InfiniBand NIC
32-node cluster
InfiniBandNetwork
• 50% working sets in memory
3/30/17 57
Application performance
• Application performance is improved by 2-16x
0.04 0.060.12
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory Disk+50%workingsetsinmemoryInfiniswap+50%workingsetsinmemory
• 50% working sets in memory
3/30/17 58
Application performance
• Application performance is improved by 2-16x
0.04 0.060.12
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory Disk+50%workingsetsinmemoryInfiniswap+50%workingsetsinmemory
0.04 0.060.12
0.04
0.66
0.77
0.61
0.08
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Normalized
Perform
ance
100%workingsetsinmemory Disk+50%workingsetsinmemoryInfiniswap+50%workingsetsinmemory
• 50% working sets in memory
3/30/17 59
Application performance
• Application performance is improved by 2-16x
0.04 0.060.12
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory Disk+50%workingsetsinmemoryInfiniswap+50%workingsetsinmemory
0.04 0.060.12
0.04
0.66
0.77
0.61
0.08
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Normalized
Perform
ance
100%workingsetsinmemory Disk+50%workingsetsinmemoryInfiniswap+50%workingsetsinmemory
• 90 containers (applications), mixing all applications and memory constraints.
3/30/17 60• Cluster memory utilization is improved from 40.8% to 60% (1.47x)
Cluster memory utilization
0
20
40
60
80
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Mem
oryU
tiliza
tion(%)
RankofMachines
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
AxisTitle
AxisTitle
ChartTitle
Infiniswapw/oInfiniswap
Agenda• Motivation and related work
• Design and system overview
• Implementation and evaluation
• Future work and conclusion
3/30/17 61
3/30/17 62
Limitations and future work• Trade-off in fault-tolerance
• Local disk is the bottleneck• Multiple remote replicas
• Fault-tolerance vs. space-efficiency
• Performance isolation among applications• W/o limitation on each application’s usage• W/o mapping between remote memory and applications
• Infiniswap: remote paging over RDMA• Application performance• Cluster memory utilization
3/30/17 63
Conclusion
• Efficient, practical memory disaggregation• No hardware design• No application modification• Fault-tolerance• Scalability
Source code is coming soon!https://github.com/Infiniswap/infiniswap.git