a Quantitative Comparison...•Contributed by Huawei, Google, MSFT, etc. Kata Containers is Virtualized Container. ... Memory: 256 GB of DDR4 ECC RAM (16 ×16 GB) Storage: 2.8 TB of

&a Quantitative Comparison

Xu Wanghyper.sh; Kata Containers Architecture Committee

Fupan Lihyper.sh; Kata Containers Upstream Developer

“All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections.”

----David Wheeler

Agenda• Secure Containers, or Linux ABI Oriented Virtualization

• Architecture Comparison

• Benchmarks

• The Future of the Secure Container Projects

What’s Kata Containers

• A container runtime, like runC

• Built w/ virtualization tech, like VM

• Initiated by hyper.sh and Intel®

• Hosted by OpenStack Foundation

• Contributed by Huawei, Google, MSFT, etc.

Kata Containers is Virtualized Container

What’s gVisor• A container runtime, like runC, too

• A user-space kernel, written in Go

• implements a substantial portion of the Linux

system surface

• Developed in Google and was open-sourced in

mid-2018

gVisor implements Linux by way of Linux.

The Similarity• OCI compatible container runtime

• Linux ABI for container applications

• Support OCI runtime spec (docker, k8s, zun…)

• Extra isolation layer

• Guest Kernel, or say, Linux* in Linux

• Do not expose system interface of the host to the container apps

• Goals

• Less overhead, better isolation

Container App

BLACK BOX

Host OS

Linux ABI

Linux ABI

OCI

Inte

rfac

e H

andl

ing

General Architecture ofKata Containers and gVisor

Architecture Comparison

Run Kata Containers• Employ existing VMM (or improved) and standard Linux kernel

• Per-container shims and per-sandbox proxy are stand-alone binaries, or built-in as library (such as in containerd shim v2 implementation)

• An agent in guest OS based on libcontainer

Kata Agent

VMM (e.g. Qemu)

shimKata runtime

Container App

proxyContainer App

virt-serial or vsock

No proxy when vscok enabled

Guest Linux Kernel

Host Kernel

runC-likecmdline

IO-stub process for container Apps shim

Run gVisor• All in one binary (runsc + Sentry +Gofer)• Sentry as a user-space kernel, and per-container process stub• Gofer for 9p IO

Sentry

runsc

ContainerAppSentry Container

AppGofer

runC-likecmdline

Gofer

Syscall redirectHost Kernel

9p

Detailed Architecture – Kata • Classic hardware-assistedvirtualization

• Provide virtual hardwareinterface to Guest

• 2 layers of isolation• kvm and CPU ring protection

• Many IO optimization ways• Pass through• Vhost and vhost-user

Guest Kernel

Drivers

agent Appproc

Appproc

Qem

ude

vices

kvm vhost

Host Kernel

vhost-user

passthru

Detailed Architecture – gVisor (1)• 2 layers of isolations

• Minimal kernel and written in Golang• Safer but the compatibility…

• Ring protection and small set of syscalls• Avoid access more buggy syscalls

• File operations are proceeded inseparated gofer process

• Syscall interception by ptrace or kvm• The performance impactions…

Syscallinterception

Host Kernel

seccomp/namespace

sentryApp proc Gofer

Detailed Architecture – gVisor (2)• In gVisor-kvm:

• Sentry is both host process and guestkernel

• Ring0 lower in guest• Upper in VM• In Host

• Kvm is for higher syscall interceptionperformance rather than isolation

• No insolation between different modesof sentry itself

kvm: Syscallinterception

Host Kernel

seccomp/namespace

Sentry inhost

App proc

GoferSentry as guest

kernel

Expectations based on the Arch• Isolation:

• Both include 2 isolation layer – better isolation than runC

• Compatibility:• Kata should have better compatibility over gVisor

• Performance:• Both should have little overhead on CPU/Memory• gVisor should have smaller memory footprint over kata, and may boot faster• gVisor may have some pain on syscall heavy (include IO heavy) workloads• Kata may have some IO optimization ways

Benchmarks and Analytics

Benchmarks & Setup• Functional Tests

• Syscall coverage

• Standard Benchmarks• Boot time• Memory footprint• CPU/Memory Performance• IO Performance

• Networking Performance

• Real-life cases• nginx, redis, tensorflow

• gVisor didn’t complete a mysql test, and a jenkins test (pull and build)failed due to the ‘git clone’ failure

Testing Setup• Testing Setup

• Most of the Cases: server on packet.net

CPU: 4 Physical Cores @ 2.0 GHz (1 × E3-1578L)

Memory: 32GB DDR4-2400 ECC RAM

Storage: 240 GB of SSD (1 × 240 GB)

Network: 10 Gbps Network (2 × INTEL X710 NIC'S IN TLB)

• Disk IO cases: a more powerful server with additional disks

CPU: 24 Physical Cores @ 2.2 GHz (2 × E5-2650 V4)

Memory: 256 GB of DDR4 ECC RAM (16 × 16 GB)

Storage: 2.8 TB of SSD (6 × 480GB SSD)

• Most of the test containers have 8GB memory if not further noted

• For networking tests, two servers play roles as client and server

Testing

client

container

host

disks

Syscall Coverage & Memory Footprint• gVisor will call about 70 syscalls

according to the code

• For kata, we collect statistics ofthe qemu processes in differentcases• apt-get: 53• redis: 35• nginx: 36

• With busybox (512MB memory)• We enabled template for Kata

01000020000300004000050000600007000080000

kata containers gvisor

Memory Footprint (KB)

single container average of 20

Boot time• Boot busybox image

• Kata (different configurations), gvisor, runc

10

0.5

1

1.5

2

2.5

Boot time (seconds)

cli_dis-factory

cli_en-factory

shimv2_dis-factory

shimv2_en-factory

gvisor

runc

CPU/Memory Performance• Sysbench CPU and memory test

• Test 5 rounds with sysbench and get the average result

• CPU overhead are very limited for both kata and gVisor• Random memory write: 2.5% vs 5.5% overhead comparing to host• Seq memory write: 8% vs 13% overhead comparing to host

10.00048 10.0004 10.00066

0

2

4

6

8

10

12

host kata gvisor

CPU Test (sec), Lower is better

5246.844868.7 4561.58

1917.63 1870.29 1815.2

0

1000

2000

3000

4000

5000

6000

host kata gvisor

Memory Write

avg seq write (MB/s) avg rand write (MB/s)

IO performance (1)• IO tests with FIO, gVisor didn’t complete the test

• Note: The “kata passthru fs” has multi-queue support (in the default kata guest kernel), which is not present inthe host driver

0

5000

10000

15000

20000

25000

30000

128k randread 128k randwrite 4k randread 4k randwrite

Buffer IO (MiB/s)

host fs runC kata passthru(virtio-blk) kata passthru(virtio-scsi) kata 9p

0

200

400

600

800

1000

1200

128k randread 128k randwrite 4k randread 4k randwrite

Direct IO (MiB/s)

host fs runC kata passthru fs

IO Performance (2)• IO test with dd command

• dd if=/dev/zero of=/test/fio-rand-read bs=4k count=25000000• dd if=/dev/zero of=/test/fio-rand-read bs=128k count=780000

• Note: pay attention to the cache of 9p• The DIO on a network FS like 9p only means no guest (client) page cache, but there may exist cache on the host (server)

side

0100200300400500600700800

128k 4k

dd test (MB/s)

host fs runC kata passthru fs kata 9p gvisor 9p

Networking Performance• The networking performance is tested on 10Gb Ethernet

• In default configurations

10123456789

10

Network Throughput (Gbps)

host docker container kata gvisor

Real-life case: Nginx• Apache Bench version 2.3

• ab -n 50000 -c 100 http://10.100.143.131:8080/

• General result

runC Kata* gVisorConcurrency Level 100 100 100Time taken for tests 3.455 3.439 161.338 secondsComplete requests 50000 50000 50000Failed requests 0 0 0Total transferred 42250000 42700000 42250000 bytesHTML transferred 30600000 30600000 30600000 bytesRequests per second 14473.73 14541.18 309.91 [#/sec] (mean)Time per request 6.909 6.877 322.677 [ms] (mean)Time per request 0.069 0.069 3.227 [ms] (mean, across all concurrent requests)Transfer rate 11943.65 12127.12 255.73 [Kbytes/sec] received

* The www-root of the kata test case is not located on 9pfs

Real-life case: Nginx (Cont)• Detailed Connection time(ms) and percentagerunc min mean[+/-sd] median maxConnect 0 0 0.3 0 7Processing 1 7 0.3 7 14Waiting 1 7 0.3 7 8Total 3 7 0.4 7 15

kata min mean[+/-sd] median maxConnect 0 0 0.1 0 7Processing 2 7 1 7 207Waiting 2 7 1 7 207Total 3 7 1 7 207

gvisor min mean[+/-sd] median maxConnect 0 0 0.1 0 2Processing 44 322 119.8 305 836Waiting 21 322 119.9 304 836Total 45 322 119.8 305 836

7 7 7 7 7 8 8 8

207305

355390

415

485

553

634674

836

7 7 7 7 7 7 7 7 150

100

200

300

400

500

600

700

800

900

50% 66% 75% 80% 90% 95% 98% 99%100

%

Percentage of the requests served within a certain time

(ms)

runC

kata

gvisor

* The www-root of the kata test case is not located on 9pfs

Real-life case: Redis

0

20000

40000

60000

80000

100000

120000

140000

160000

PING_IN

LINE

PING_BULK SE

TGET

INCR

LPUSH

RPUSHLP

OPRPOP

SADD

HSET

SPOP

LPUSH

(need

ed to ben

chmark

LRANGE)

LRANGE_1

00 (fi

rst 100

elements

)

LRANGE_3

00 (fi

rst 300

elements

)

LRANGE_5

00 (fi

rst 450

elements

)

LRANGE_6

00 (fi

rst 600

elements

)

MSE

T (10 k

eys)

Redis-benchmark (requests per second)

runC kata gvisor

Real-life case: Tensorflow (CPU)Note: this is a CPU test (No GPU)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

runc kata gvisor-kvm gvisor-ptrace

CNN benchmarks (images/sec)TensorFlow 1.10Model resnet50Dataset imagenet (synthetic)Mode trainingSingleSess FalseBatch size 32 global

32 per deviceNum batches 100Num epochs 0Devices ['/cpu:0']Data format NHWCOptimizer sgd

Summary• Both kata and gVisor

• Small attack interface

• CPU performance identical to host and runC

• On Kata Containers• Some suggestions: Passthru fs, Factory and/or shimv2, etc.

• Good performance in most cases with proper configuration

• On gVisor• Fast boot performance and low memory footprint

• Still need improve the compatibility for general usage

• Furthermore tests• Fine tuned benchmarks, vhost-user/dpdk etc. and NUMA Specific tests, nested virtualization tests• Different kinds of networking tests• More real-life cases

The Future of the Projects

Kata Containers• Kata is on the way more efficient

• The shim v2 shown in the benchmarks is going to be mergedsoon

• The nemu will reduce the VMM overhead• Multiple network improvement is under developing or testing

• And more featureful• Better GPU and other accelerator support

• Contributions are welcome

gVisor and Similar Projects• gVisor shows real lightweight secure sandboxing tech• gVisor need more work on

• More efficient syscall interception and processing• Better compatibility for more applications

• Another project, linuxd, is based on linux kernel anddoing similar jobs

• Lai, Jiangshan, Containerize Linux Kernel, OpenSource Summit2018, Vancouver(https://schd.ws/hosted_files/ossna18/db/Containerize%20Linux%20Kernel.pdf)

https://schd.ws/hosted_files/ossna18/db/Containerize%20Linux%20Kernel.pdf

Q&AContributions are Welcome

Documents

a Quantitative Comparison...•Contributed by Huawei, Google, MSFT, etc. Kata Containers is Virtualized Container. ... Memory: 256 GB of DDR4 ECC RAM (16 ×16 GB) Storage: 2.8 TB of