Towards Sidecore Management for Virtualized Environments … · to a statically tuned system. SidecoreM shows up to a 2.2x performance gain over the traditional approach. SidecoreM

Towards Sidecore

Management for Virtualized

Environments

Eyal Moscovici

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-09 - 2017


Towards Sidecore

Management for Virtualized

Environments

Research Thesis

Submitted in partial fulfillment of the requirements

for the degree of Master of Science in Computer Science

Eyal Moscovici

Submitted to the Senate of

the Technion — Israel Institute of Technology

Av 5126 Haifa August 2016



The research thesis was done under the supervision of Prof. Dan Tsafrir in

the Computer Science Department.

The generous financial support of the Technion is gratefully acknowledged.



Contents

Abstract 1

Abbreviations and Notations 3

1 Introduction 4

2 Background 6

2.1 Hardware Virtualization . . . . . . . . . . . . . . . . . . . . . 6

2.2 I/O Virtualization . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Emulation . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Paravirtual I/O . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Single Root I/O Virtualization (SRIOV) . . . . . . . . 10

2.3 Sidecores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Efficient and Scalable Paravirtual I/O System (ELVIS) 12

2.3.2 Data-Plane Development Kit (DPDK) . . . . . . . . . 12

2.3.3 Virtual Remote I/O (vRIO) . . . . . . . . . . . . . . . 13

3 Motivation 14

3.1 Sidecore vs. Traditional Approaches . . . . . . . . . . . . . . 14

3.2 Drawbacks of Static Configuration . . . . . . . . . . . . . . . 15

4 Aspects Affecting Sidecore Performance 17

4.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . 18

4.2 Netperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Load Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3.1 ELVIS’ Stuck Cycles . . . . . . . . . . . . . . . . . . . 20

4.3.2 VM Idle Mode: Poll or Yield . . . . . . . . . . . . . . 21

i


4.4 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 22

4.5 Number of Sidecores . . . . . . . . . . . . . . . . . . . . . . . 23

4.6 Packet Batching . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.6.1 TCP Segmentation Offload Batching . . . . . . . . . . 26

4.6.2 Latency vs. Throughput . . . . . . . . . . . . . . . . . 29

4.7 Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.7.1 Dividing the VMs Among Cores . . . . . . . . . . . . 31

4.7.2 Dividing the Virtual I/O Devices Among Sidecores . . 31

4.8 Exit-less Interrupts & Polling I/O Devices . . . . . . . . . . . 36

5 Number of Sidecores 38

5.1 The Butterfly Model . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Cost of Generating a Message . . . . . . . . . . . . . . . . . . 40

5.3 I/O Interposition . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 Throughput and Scalability . . . . . . . . . . . . . . . . . . . 42

5.5 Latency and Scalability . . . . . . . . . . . . . . . . . . . . . 45

6 Design and Implementation 47

6.1 Generalizing I/O Threads for Sidecores . . . . . . . . . . . . . 47

6.2 Dynamically Relocating Devices between I/O Threads . . . . 50

6.3 Communicating with Userland . . . . . . . . . . . . . . . . . 51

6.4 The I/O Manager . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . 54

6.4.2 Setting Parameters . . . . . . . . . . . . . . . . . . . . 56

6.4.3 Keeping History and a Regret Policy . . . . . . . . . . 57

6.4.4 Utilizing the Baseline . . . . . . . . . . . . . . . . . . 58

6.4.5 Minimum Batching . . . . . . . . . . . . . . . . . . . . 58

6.4.6 Dividing Virtual I/O Devices Among Sidecores . . . . 58

6.4.7 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Evaluation 61

7.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . 61

7.2 Micro benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 62

7.2.1 Netperf TCP Stream . . . . . . . . . . . . . . . . . . . 62

7.2.2 I/O Interposition . . . . . . . . . . . . . . . . . . . . . 63

7.2.3 Varying Message Costs . . . . . . . . . . . . . . . . . . 63

7.3 Apache Web Server . . . . . . . . . . . . . . . . . . . . . . . . 66

ii


7.4 Dynamic Workload . . . . . . . . . . . . . . . . . . . . . . . . 69

8 Related Work 72

9 Conclusions 74

Abstract in Hebrew א

iii


List of Figures

2.1 Side-by-side comparison of the components used for imple-

menting a virtual network device across the four virtual I/O

models. Emulation, baseline Virtio, ELVIS, DPDK and vRIO

are interposable; SRIOV is not. . . . . . . . . . . . . . . . . . 8

3.1 A Netperf TCP stream with 12 VMs (left) and four VMs

(right). The baseline traditional model is denoted base. The

statically tuned sidecore configurations with a different num-

ber of sidecores are denoted e1 -e7. We see that different con-

figurations perform better under various workload parameters. 16

4.1 Our testbed setup: Two dual socket machines with eight cores

per socket, each connected using two dual-port 10Gbps Eth-

ernet NICs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 The VM configuration. Every three VMs share a single phys-

ical NIC port. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 12 VMs and four sidecores, running a Netperf TCP stream:

(1) Throughput, (2) Overall VCPU CPU usage. . . . . . . . . 22

4.4 Undercommitted system and four VM guests running Netperf

UDP R/R: (1) Transactions per second, (2) Overall I/O CPU

usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.5 A Netperf TCP stream w. various message sizes: (a) 12 VMs,

(b) 4 VMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.6 Comparison of CPU usage in the baseline no affinity config-

uration running a Netperf TCP stream with various message

sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.7 One VM with one sidecore running a Netperf TCP stream. . 27

iv


4.8 12 VMs with 1-2 sidecores running a Netperf TCP stream w.

256B messages. . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.9 A comparison between one- and four-sidecore configurations,

running a Netperf TCP stream with 64B messages. We com-

pare: (a) The number of VM exits and their reasons; (b)

Average transmitted packet size; (c) VM guest CPU usage;

(d) Average guest interrupts per second. . . . . . . . . . . . . 29

4.10 12 VMs running a Netperf TCP stream. (a) The difference in

throughput between running with and without VM affinity;

a positive value means that using VM affinity outperformed

no VM affinity. (b) The difference in standard deviation per

VM running with and without affinity; a positive value means

that with affinity, there is a larger standard deviation in the

throughput between the VMs with affinity than the VMS

without affinity. . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.11 12 VMs with four sidecores running a Netperf TCP stream.

Mirrored : device affinity = VM affinity. Spread : the devices

belonging to VMs with the same affinity are assigned to dif-

ferent sidecores. We see that spread outperforms mirrored by

up to 1.8x with default and full TSO. For no TSO, we see

that mirrored outperforms spread by about 1.1x. . . . . . . 34

4.12 A Netperf TCP stream with 1KB messages. A comparison of

the VM guests running with the posted interrupts feature en-

abled vs. running with the posted interrupts feature disabled. 35

5.1 A Netperf TCP stream with messages of size 1KB; tested

under various sidecore configurations (e1-e7). . . . . . . . . . 39

5.2 A Netperf TCP stream having varying message sizes; tested

under various sidecore configurations (e1-e7). . . . . . . . . . 39

5.3 A Netperf TCP stream having a varying message generating

cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4 A Netperf TCP stream with various interposition delays (a1)

64B; (a2) 1KB; (b1) 16KB. . . . . . . . . . . . . . . . . . . . 43

5.5 All configurations with 1-12 VM guests running a Netperf

TCP stream with different message sizes: (a1) 64B; (a2) 1KB;

(b1) 16KB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

v


5.6 All configurations with 1-12 VM guests running a Netperf

UDP request/response. . . . . . . . . . . . . . . . . . . . . . 46

6.1 Prominent files and directories in the sysfs interface. . . . . . 52

7.1 A Netperf TCP stream with various message sizes. . . . . . . 64

7.2 A Netperf TCP stream with various interposition delays. . . . 65

7.3 A Netperf TCP stream with a busy-loop delay that simulates

performing a computation before transmitting packets. . . . 66

7.4 Running Apachebench with static HTML page sizes of 64B-

1MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.5 Running Apachebench with static HTML page sizes of 64B-

1MB (cont.). . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.6 Dynamic benchmark running of a Netperf TCP stream w.

2KB messages and seven phases of various busy-loop config-

urations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

vi


Abstract

Virtualization is the ability of modern computer systems to run guest Vir-

tual Machines (VMs). The VM host exposes various I/O devices to its

guests such as the Network Interface Controller (NIC) and the hard disk.

Paravirtual I/O is a common technique for presenting the guest VM with

an interface similar, but not identical, to the underlying hardware. Such

interfaces are called virtual I/O devices, and their behavior is emulated by

the VM host. This emulation must be scalable, able to handle high through-

put I/O workloads, and not consume system resources during periods of low

throughput.

There are two leading approaches in use today: traditional paravirtual

I/O and sidecores. In the traditional approach, VM guests experience per-

formance degradation while handling high-throughput I/O workloads. On

the other hand, the sidecore approach always consumes a set amount of

system resources, which are wasted during low-throughput I/O workloads.

This work explores a dynamic virtual I/O device management design that

improves system utilization by combining the two approaches.

Our system, SidecoreM , uses an I/O manager to dynamically determine

the preferred approach based on the current I/O load. To this end, we first

modeled the system under varying I/O workloads, then implemented the

I/O manager based on the model.

Evaluation of our design under Linux shows that SidecoreM is able to

find the optimum configuration in all cases tested. While searching for the

optimum configuration, we degrade performance by at most 6% compared

to a statically tuned system. SidecoreM shows up to a 2.2x performance

gain over the traditional approach.

SidecoreM incurs 280µs of overhead per second during normal operation,

and a few milliseconds when changing the configuration of the system. The

1


results show that SidecoreM improved performance by at most x2 with a

dynamic workload and by x1.8 in static workloads compared to traditional

paravirtual I/O. For most workloads, SidecoreM ’s overhead – how far it

was from state-of-the-art statically tuned configuration – was less than 5%,

and in most cases, there was no noticeable overhead.

2


Abbreviations and Notations

VM — Virtual Machine

KVM — Kernel Virtual Machine

QEMU — Quick Emulator

I/O — Input/Output

NIC — Network Interface Controller

TX — Transmit

RX — Receive

RR — Request/Response

IP — Internet Protocol

TCP — Transport Control Protocol

HTTP — Hypertext Transport Protocol

API — Application Programing Interface

TPS — Transactions Per Second

Sec — Second

ms — Millisecond

µs — Microsecond

B — Byte

KB — Kilo Byte

MB — Mega Byte

GB — Giga Byte

Gb — Giga Bit

Gbps — Giga Bit Per Second

V CPU — VM Virtual CPU

3


Chapter 1

Introduction

Machine virtualization technology has proven useful in many scenarios.

Real-world application of virtualization, including most enterprise data

centers and cloud computing sites, use Paravirtual I/O [16, 5, 21, 25, 26].

Paravirtual I/O is a technique by which a hypervisor exposes a software

interface to I/O devices that is similar, but not identical, to existing hard-

ware. Examples of such interfaces are KVM’s Virtio [21], VMWare’s

VMXNET [25] and Xen PV [5]. Paravirtual I/O has been shown to outper-

form device emulation, as the interface is optimized for virtualized environ-

ments.

Another benefit of Paravirtual I/O is that it allows the hypervisor to

inspect and interpose a VM’s I/O channels at runtime. On the other hand,

device assignment does not allow for interposition as the hypervisor is re-

moved from the datapath. For this reason, paravirtualization is the de facto

standard for exposing I/O devices to virtual machines.

In the traditional model, the guest sends I/O requests that are handled

by the host. The guest’s driver, also called the front-end, sends each I/O

request to a dedicated thread in the host, called the back-end. The back-end

handles the request and later returns a reply. As a result, each paravirtual

I/O device has one dedicated thread.

Paravirtual I/O devices are not without their drawbacks, and have been

shown to have major performance penalties [6, 7, 17, 18, 27]. These arise

from two main sources: exits and scheduling. Exits, which are context

switches between the VM and the hypervisor, are generated in order to

communicate with the virtual I/O device, as was shown in [1]. The schedul-

4


ing problem is caused by the hypervisor scheduler, which schedules virtual

I/O devices regardless of I/O activity. This scheduling may cause major

performance losses, as shown in [12, 28].

The sidecore approach, applied to paravirtualization in [12], was shown

to reduce some bottlenecks. This approach divides the CPU cores in the

system into two groups. The first group of CPUs runs the VMs, VCPUs

and is called the VM-cores. The second group runs the virtual I/O devices’

threads and is called the IO-cores. Each IO-core is assigned a group of I/O

devices, enabling each core to have its own custom scheduler and to poll the

guest VMs. Polling is instrumental in reducing the number of exits.

The sidecore approach is not without its drawbacks. The division of the

CPU cores into two groups reduces the computation power available to the

VMs – which in turn reduces the system utilization when the workload is

not I/O intensive. Furthermore, in current solutions, the number of cores

that should be assigned to each set is dependent on the constantly changing

workload.

In this work we show that for optimum performance, the resources must

be allocated according to measurements taken at runtime. Chapters 2 and 3

present the historical and technical background of virtualization and par-

avirtual I/O, and introduce the motivation behind this work. In Chapter 4,

we go over some aspects that affect sidecore performance such as packet

batching and core affinity in the system. We discuss each identified aspect.

In Chapter 5, we investigate the number of sidecores required while concen-

trating on throughput intensive workloads in an overcommited system. Our

analysis uncovers a model, denoted the “butterfly model”, which explains

the throughput of different setups and guides our design of the sidecore

manager. In Chapter 6, we use the model to create an I/O manager, called

SidecoreM, for virtualized environments.

We show that SidecoreM is able to improve performance by at most

2x with a dynamic workload and by 1.8x in static workloads compared to

traditional paravirtual I/O. For most workloads, SidecoreM’s overhead —

how far it was from state-of-the-art statically tuned configuration — was

less than 5%, and in most cases, there was no noticeable overhead.

5


Chapter 2

Background

2.1 Hardware Virtualization

Virtualization as a concept and a technique has been around for more then 50

years, and its roots stem from mainframe technology pioneered in the 1960s

and 1970s. In the broadest terms, virtualization is defined as a framework

or methodology in which a computer’s resources are divided into multiple

execution environments, by applying one or more concepts or technologies

such as hardware and software partitioning, partial or complete machine

simulation, emulation etc. [23].

Hardware virtualization, or machine virtualization, divides the resources

of the physical hardware machine into VMs. Software running on these VMs

is presented with virtual hardware interfaces that are separate from under-

lining hardware resources. On the physical hardware machine, named the

host, a special program called the hypervisor, in charge of running these

guest virtual machines, or guests, operates. For example, a common hyper-

visor software is KVM/QEMU, which runs on the Linux Operating system.

KVM is able to run VMs that run Windows Operating System.

There are two major types of virtualization, full virtualization and par-

avirtualization. In the first, full virtualization, the software running on the

VM is unaware that it is not running on a physical machine. Full virtualiza-

tion dictates that the hypervisor must provide three properties to its VMs:

compatibility, isolation and protection. Compatibility is the ability of the

hypervisor to run software in the VM in an unmodified fashion regardless

of the underlining physical hardware. Isolation means that the guest VM

6


should run as if it were running on a dedicated physical machine regardless

of the number of currently running VMs. Protection means that a VM can-

not access or affect the state of the hypervisor or another VM running in

the system.

Full virtualization allows the guest software to run unmodified. The first

such machine, IBM CP-40 of the IBM System/360 Model 67, was produced

in 1967. IBM CP-40 provided VMs with a virtual hardware interface en-

compassing all of their target virtualized architecture. Each VM ran in a

“problem state” where privileged instructions such as I/O operations caused

a machine exception that was caught and simulated by the hypervisor. When

a VM attempted to access an address not present in the main memory, this

would trigger a page fault exception that was again caught and simulated

by the hypervisor. These two techniques are still very much used today in

virtualization.

In the second virtualization type, paravirtualization, the software run-

ning inside the VM is aware that it is being run on a VM, as the hypervisor

exposes hardware interfaces similar but not identical to the underlining hard-

ware. As a result, the software inside the VM needs to be modified to run

in this virtualization type. Thus the hypervisor loses some compatibility

in favor of performance and increased efficiency. Most modern hypervisors

such as VMWare ESXi, Xen, HyperV and KVM apply paravirtual tech-

niques [5, 21, 25].

Today, most virtualization is hardware assisted virtualization, meaning

that the hardware, and especially the CPU, exposes special instructions

and mechanisms for virtualization. These mechanisms help improve the

performance and efficiency of virtualization.

2.2 I/O Virtualization

I/O virtualization is an important part of virtualization that allows VMs

to communicate with the outside world. The latter are not permitted by

the hypervisor to access peripheral hardware such as I/O devices in order to

preserve the three principals of virtualization. Compatibility is broken if the

guest VM cannot work with the physical hardware. Isolation and protection

may be broken because the guest VM is usually running operating systems

that assume exclusive access and control over all hardware devices.

7


Figure 2.1: Side-by-side comparison of the components used forimplementing a virtual network device across the four virtual I/O models.

Emulation, baseline Virtio, ELVIS, DPDK and vRIO are interposable;SRIOV is not.

8


I/O virtualization is a technique used to virtualize access to I/O de-

vices by decoupling the logical I/O from the physical I/O by introducing

an indirection layer. The VM host exposes a virtual I/O device to its guest

VMs. It then intercepts VM requests directed at the virtual device from the

guest and fulfills them using the physical hardware. The benefits of the in-

direction layer are substantial: (1) Improving utilization of the physical I/O

devices via time and space multiplexing; (2) facilitating live migration and

suspend/resume using the indirection layer to decouple and re-couple VMs

from/to the physical device; (3) enabling seamless switching between differ-

ent I/O channels, which allows for device aggregation, load balancing be-

tween channels, and failure masking; and (4) providing hosts with the ability

to interpose (control and manipulate) the I/O activity of their guests, thus

supporting features such as software-defined networking (SDN), file-based

images, replication and snapshots, along with security related functionality

such as record-replay, encryption, firewalls, and intrusion detection.

SidecoreM uses and builds upon existing I/O virtualization models. Fig-

ure 2.1 shows an abstract representation of these models. In the remainder

of this chapter we briefly survey and highlight some additional aspects.

2.2.1 Emulation

The most basic implementation of virtual I/O devices is emulation, whereby

the VM host exposes an interface identical to that of an existing hardware.

Emulation is beneficial in that the guest VM is not aware it is being vir-

tualized and requires no changes to work in this configuration. Hardware

interfaces emulated in this way are not optimized for virtualization, and are

often inefficient and costly to emulate.

For example: KVM/QEMU, a common Linux-based hypervisor, uses

the Intel e1000 interface as its default NIC, when interacting with its VM

guests. The Intel e1000 interface is common and widely supported by most

unmodified VM operating systems (OSs). This makes the e1000 a very

convenient default interface for VM guests; however, it suffers from high

performance overhead. The interface contains several memory mapped reg-

isters (MMRs), and a ring buffer shared between the guest VM’s OS and

the virtual I/O device. When the guest VM sends or receives packets from

the network through the virtual NIC, it writes to and reads from several

9


of its MMRs, which are mapped to addresses in its RAM. Each time the

addresses of these MMRs are accessed, they must be handled by the hyper-

visor, which emulates the communication with the NIC. Handling access to

MMRs is done using a trap mechanism: Each time the VM guest accesses

the address of an MMR, an exception that transfers control from it to the

underlying hypervisor occurs. The hypervisor then determines the reason

for the exception and emulates the behavior of the MMR, finally resuming

the execution of the guest VM. Transferring control from the guest VM to

the hypervisor by use of exceptions is called a VM exit or world switch and

is among the most costly operations in modern computer systems.

As a result, the need for several MMR accesses during every packet

transmission is the main source of performance overhead of emulated devices.

2.2.2 Paravirtual I/O

A paravirtual I/O device shares many of the benefits of emulated I/O de-

vices, given that the hypervisor exposes to the guest VM an interface sim-

ilar yet not identical to existing hardware. This interface is designed to

minimize the number of VM exits that occur during normal operation, and

consequently, has a smaller impact on performance. Accordingly, paravir-

tualization is the preferred device type used in an overwhelming number

of installations. Despite being similar in essence across different hyper-

visors, paravirtual device drivers are hypervisor-specific, because vendors

have opted to use different interfaces. Thus, appropriate paravirtual device

drivers must be installed in order to use the device. In this work, we use

the KVM Virtio [21] as the state-of-practice representative of paravirtual-

ization. The main source of overhead is VM exits and interrupts during

the transmission and reception of packets. Further, although better per-

forming than naive emulation, paravirtualization still traps upon every I/O

request and thus fails to deliver the same performance level as a hardware

implementation.

2.2.3 Single Root I/O Virtualization (SRIOV)

SRIOV is seen as a viable alternative to paravirtual I/O because it offers

the VM guests better performance than the paravirtual approach. PCI-SIG

has standardized the SRIOV extension of PCIe, which allows physical I/O

10


devices to self-virtualize and expose multiple virtual instances of themselves

such that each instance can be assigned to a different VM. SRIOV outper-

forms paravirtualization, as I/O requests generated by guests do not trap.

Nevertheless, SRIOV does not provide bare metal performance, because de-

vice interrupts are still handled by the hypervisor. The Exit-less Interrupt

approach (denoted ELI) overcomes this barrier by utilizing a virtual I/O

model in which guests receive the interrupts of their SRIOV instances di-

rectly, without hypervisor involvement. SRIOV+ELI provides bare metal

performance for VMs. In this research, we decided not to use SRIOV as,

unfortunately, it is problematic because it negates many key benefits of vir-

tualization that rely on interposition. For example, it conflicts with memory

over-commitment and host swapping, as DMA transfers do not tolerate page

faults, forcing the host to either pin the entire image of the VM to memory,

or to become aware of all its DMA target buffers via explicit guest–host in-

teraction. SRIOV likewise conflicts with live migration, because the source

host cannot decouple the device while it is working, and because the target

host might not have compatible hardware. Other problems include the host

not being able to meter, account for, suspend, resume, log, isolate, and ma-

nipulate the I/O activity. All the features that interposable I/O provides

are missing.

2.3 Sidecores

The sidecore approach is an interposable I/O model that provides all the

benefits of traditional paravirtualization, but with higher performance. As

discussed in Chapter 1, the sidecore approach divides the host’s cores into

two groups: Sidecores dedicated to processing virtual I/O and VM-cores

for running the VMs that generate the I/O. The sidecores poll relevant

memory regions, observe changes that reflect I/O requests, and process these

requests without inducing the overhead of trapping. The sidecore paradigm

has been applied to various tasks such as IOMMU virtualization [2], storage

virtualization [6], TCP/IP processing [24], and GPU processing [22]. This

approach was shown to be superior to the baseline, oftentimes approaching

SRIOV performance. Presently, in existing implementations, the number of

sidecores is determined statically during setup, limiting the usability of this

approach.

11


There are several different implementations that apply the sidecore ap-

proach to the baseline Virtio. The following is a quick survey of such imple-

mentations used in or considered for this work.

2.3.1 Efficient and Scalable Paravirtual I/O System (ELVIS)

The ELVIS system applies the sidecore paradigm to the baseline KVM Virtio

setup [12]. Using the sidecore approach allows the ELVIS system to eliminate

the trap-and-emulate model of the traditional paravirtual implementation,

instead polling the guest VM for outbound I/O packets. In this way, costly

VM exits are avoided.

Consider, for example, the interrupts induced by the virtual I/O models

when receiving one network packet addressed to a VM guest from an external

client and sending back a packet in response. In this case, a VM guest

induces three synchronous exits: One when sending the response, and two

when writing to a register after handling the interrupts that notify the VM

guest that (1) a packet has arrived and that (2) the response has been sent.

ELVIS avoids all three interrupts; the first by using polling and the others

by utilizing ELI to deliver Exit-less interrupts (in the form of IPIs) from host

(side)cores to guest cores. The ELVIS source code is publicly available [10];

and is the central approach used in this work as representative of the state-of-

the-art of interposable sidecore virtual I/O. The number of sidecores utilized

by ELVIS is static.

2.3.2 Data-Plane Development Kit (DPDK)

The Data-Plane Development Kit (DPDK) [13] framework is a set of library

APIs for the acceleration of packet processing. Among the DPDK libraries

is a Virtio poll mode driver. The poll mode driver allows for polling on

both the outgoing packets sent by the guest VM, and on packets arriving

from the physical NIC addressed to the guest VM. This driver enables the

sidecore approach in a similar fashion to ELVIS. Both implementations are

aimed at eliminating the cost of asynchronous I/O, though each focuses on

a different aspect. ELVIS’s approach is to eliminate VM exits caused by

the mechanism used to notify the hypervisor of an outbound packet, while

DPDK focuses on eliminating the overhead of interrupts caused by inbound

packets from the physical NIC. Despite the difference in perspective, both

12


approaches ended up with more or less the same solution: polling. In this

paper, we chose to use the ELVIS implementation over the DPDK one, but

believe that the results can be applied to DPDK.

2.3.3 Virtual Remote I/O (vRIO)

In vRIO [16], we extended the sidecore model from a single server to an entire

rack of servers. Instead of the classical sidecore approach of dividing cores

into two groups, the servers themselves are divided into two groups: VM

hosts dedicated to running guest VMs, and IO hosts dedicated to emulating

virtual I/O devices for the guest VMs. The VM hosts do not process (and

are unaware of) the paravirtual I/O produced by their VM guests. Consol-

idation of cores dedicated to I/O processing into servers dedicated only to

virtual I/O handling (the IO hosts) has multiple benefits. The first is the

ability to utilize sidecores dedicated to other servers in order to alleviate a

bottleneck, which may occur when the load generated by the VMs exceeds

the processing capabilities of their designated sidecores. The second benefit

of consolidation, and arguably the most notable benefit of virtualization, is

resource consolidation: The ability of multiple VM hosts to share physical

hardware resources through an IO host. Experience shows that multiple

virtual servers can be adequately serviced by a smaller number of physical

servers. In vRIO, we showed that consolidating the sidecores in IO hosts

allows us to achieve superior performance using the same number of CPUs,

or comparable performance using fewer CPUs. In exchange for reducing the

number of CPUs, vRIO requires additional networking infrastructure in the

rack. We showed how this trade-off yields a cheaper system. In the present

work, we do not use vRIO since we are concentrating on the performance of

a single server.

13


Chapter 3

Motivation

3.1 Sidecore vs. Traditional Approaches

In Chapter 2 we covered the basics of both the traditional paravirtual model

and the sidecore model. In Chapter 3 and in Table 3.1, we compare the two

models and discuss their strengths and weaknesses.

The I/O resource sharing of the sidecore model is static in two regards.

The first is that the correct number of sidecores must be set during system

setup, which requires prior knowledge about the workload. Moreover, in

cases where the load changes over time, it may be impossible to establish

the correct number of sidecores needed to optimize performance. The second

aspect is that VM VCPUs and I/O threads cannot co-inside on the same

physical CPU due to the fact that all cycles within each dedicated I/O core

are always in use. As the sidecore must constantly poll all relevant memory

regions of its associated guest VM in order to fully avoid the cost of inter-

rupts and VM exits. Therefore, 100% of the sidecore’s cycles are consumed.

Traditional Model Sidecore Model

I/O resource sharing Dynamic but inefficient Static

I/O scheduling Host thread scheduling Based on I/O activity

Guest notifications Cause VM exit Exit-less

System utilization advantage I/O non-intensive workloads I/O intensive workloads

Table 3.1: A side-by-side comparison of the traditional paravirtual modeland sidecore models.

14


As such, the computational power consumed by the sidecores might be too

low or too high, thereby wasting valuable cycles and hindering performance.

In contrast, the traditional model is dynamic in both regards. It uses in-

terrupts, meaning that cycles are consumed only during I/O handling. The

result is that only a necessary number of cycles are in use at any given time,

enabling the system to utilize unused cores and cycles for other purposes.

Having said this, the traditional approach is inefficient when handling

intensive I/O workloads, e.g., high overhead of VM exits. As mentioned in

Chapter 2, ELVIS avoids all VM exits during the sending and receiving of

packets by leveraging polling and Exit-less interrupts, while the traditional

model does not. The traditional Virtio design is that each virtual I/O device

has its own thread that is scheduled using the Linux scheduler.

ELVIS, on the other hand, has a single I/O thread handling multiple

virtual I/O devices for each dedicated IO-core. The ELVIS approach has

been shown to be beneficial when compared to the traditional model in

cases where several VM guests produce a high I/O load. In the traditional

thread-based I/O scheduling, the scheduler may service one VM guest for a

long time, delaying I/O to other VM guests until the OS decides to switch

threads. In contrast, the ELVIS fine-grained scheduler promptly switches

between the virtual I/O devices and contains mechanisms to promote fair-

ness and decrease latency when under a heavy I/O load.

To conclude, we claim that in terms of system utilization the traditional

approach has an advantage when the I/O workload is low, and the sidecore

approach has an advantage when the I/O workload is high.

3.2 Drawbacks of Static Configuration

Using the sidecore approach with a single manually defined static configu-

ration, to be used regardless of actual I/O loads, may lead to performance

degradation compared to the traditional approach. This limits the usability

of the sidecore approach to specialized setups where the I/O load is known

in advance and the system is manually tuned. Moreover, determining the

optimal configuration may prove to be a challenge.

We ran different statically tuned sidecore configurations on a Netperf

TCP stream with various parameters as detailed in 4. We saw that in most

tested cases, the sidecore approach outperformed the traditional approach

15


0

10

20

30

40

64 128256

5121K 2K 4K 8K 16K

(a1)

thro

ughput

[Gbps]

message size [Bytes]

basee1e2e3e4e5e6e7

0

10

20

30

40

64 128256

5121K 2K 4K 8K 16K

(a2)

thro

ughput

[Gbps]


Figure 3.1: A Netperf TCP stream with 12 VMs (left) and four VMs(right). The baseline traditional model is denoted base. The staticallytuned sidecore configurations with a different number of sidecores are

denoted e1 -e7. We see that different configurations perform better undervarious workload parameters.

only when we got the “right” configuration, as demonstrated in Figure 3.1.

Note that our results show not all workloads, in our case the different mes-

sage sizes, require the same numbers of sidecores. Additionally, we see that

e1 does not scale with the increase in message size, we will detail and explain

this fact in Section 4.6.1

There seems to be a direct link between message sizes and the number of

sidecores needed. Nonetheless, while such a link exists, we cannot utilize it

in this work as we assume no knowledge of the workload running in the guest

VM. In Chapters 4 and 5, we explore the difficulties inherent in statically

tuning the system, and model the virtual I/O behavior in our system using

metrics we can leverage.

16


Chapter 4

Aspects Affecting Sidecore

Performance

In order to gain better insight into the virtual I/O device behavior, we inves-

tigated it under different loads. This was achieved by conducting a thorough

investigation using a wide variety of parameters and configurations.

During our investigation, we monitored the following system character-

istics: (1) Throughput of the virtual I/O device; (2) CPU usage of the VM,

both inside the VM guest and on the host system; (3) distribution of CPU

usage within the I/O worker threads; specifically, the number of cycles that

are spent performing I/O processing and the amount of time that is spent

on idling and waiting; (4) average packet size sent and received and how

it is affected by different system parameters. We also analyze two baseline

configurations: baseline with affinity in which each I/O thread shares a core

with its guest VM, and baseline no affinity in which no affinity is used and

the Linux scheduler can schedule the VMs on all the cores available in the

socket. In each graph, we display only one of them, the one that performs

better.

We identified several aspects that affect the performance of the system

under examination. Some aspects are the number of sidecores in the sidecore

model, the load on the system, and the performance metric to optimize. In

the rest of Chapter 4, we go over each aspect and explain and demonstrate

its importance. We do not perform a full investigation of each one as it is

beyond the scope of a single work.

17


Figure 4.1: Our testbed setup: Two dual socket machines with eight coresper socket, each connected using two dual-port 10Gbps Ethernet NICs.

We will focus on the number of sidecores needed when optimizing the

overall throughput of a system in which the number of VCPUs and sidecores

combined is larger than the number of cores available.

4.1 Experimental Methodology

Our testbed system consists of two physical machines: a host and a load

generator, as shown in Figure 4.1. The two servers are IBM System x3550

M4 machines, each equipped with two 8-core Intel 2.2GHz Xeon E5-2660

CPUs; 56GB memory; and two Intel x520 dual-port 10Gbps NICs, allowing

a total throughput of 40Gbps. The host hypervisor is KVM with QEMU

1.3.0, hosting VMs configured with one VCPU and 1GB of memory each,

backed by huge pages of 2MB to maximize performance.

All machines run Linux 3.9. Hyperthreading and all power optimizations

are turned off, namely sleep states (C-states) and DVFS (Dynamic Voltage

and Frequency Scaling). This is done in order to obtain consistent results

and to avoid reporting artifacts caused by nondeterministic events.

In all our experiments, unless stated otherwise, we use 12 VM guests,

each with a single paravirtual NIC. We divided the VM guests into four

groups, each sharing one 10Gbps NIC port, as shown in Figure 4.2. The

18


Figure 4.2: The VM configuration. Every three VMs share a singlephysical NIC port.

VM guests connect to the NIC directly using Macvtap. Macvtap is a device

driver in the Linux kernel that connects the VMs to a physical NIC without

the need for a virtual bridge, thereby improving performance.

4.2 Netperf

We chose to test our system with a well-known and widely used network

based benchmark, Netperf, which is used to test various aspects of network

performance. We use Netperf with different configurations and perform

various tests to simulate a real-life setup. In all tests the VMs were the

sending side, each on a single flow.

TCP stream is a unidirectional transfer of data over a TCP/IP con-

nection using Berkeley sockets. This test is a common measurement for the

maximum throughput achievable in the system.

UDP request/response performs a sequence of single request and

response transactions and measures the average number of transactions per-

formed in one second. The data is sent and received over the UDP/IP stack.

This test measures system latency.

TCP stream with varying message cost is an extension of the

original Netperf TCP/IP stream test that we implemented. Our extension

performs a busy-loop delay before sending data. This delay enables us to

better simulate the time taken by the guest VM in order to produce a packet,

19


as opposed to the original test, which sends the same packet over and over.

For example, when the VM is running a web-server application, it takes time

to produce the static or dynamic web pages to be sent over the network.

4.3 Load Conditions

First, we differentiate between two possible states that the system can be

in, with regard to its load conditions. We call a system overcommited when

the combined number of sidecores and VCPUs is larger than the number

of cores in the system. This definition is in line with the commonly used

definition of “overcommit”, which only takes into account the number of

VCPUs in the system. Alternatively, a system can be undercommited, i.e.,

the combined number of sidecores and VCPUs is equal to or less than the

number of cores in the system.

Previous work [11, 12] concentrated on presenting and optimizing the

sidecore approach for undercommited systems with a single sidecore per

socket. We concentrate on overcommited systems with multiple sidecores

per socket. Designing an I/O manager that uses the sidecore approach for

such a system condition presents challenges that are different from those

of an undercommitted system. The sidecore approach divides the cores in

the system into IO-cores and VM cores, so increasing the number of IO-

cores causes the number of VM-cores to decrease. As such, adding more

IO-cores does not necessarily increase the overall throughput of the system.

To improve the system performance, the I/O manager must find the best

balance between the two.

Previous work developed some optimizations for the sidecore approach.

In Sections 4.3.1 and 4.3.2 we present optimizations that improve perfor-

mance in undercommited systems, but do not work well in overcommited

systems.

4.3.1 ELVIS’ Stuck Cycles

The first such optimization is tied to the ELVIS fine-grained scheduler, and

was disabled in this work. In ELVIS, each sidecore runs a single I/O worker

thread. This thread handles multiple I/O devices. The fine-grained sched-

uler runs in each I/O worker thread, and decides when to switch between

20


the different virtual I/O devices assigned to the thread. In most cases,

the scheduler chooses the next device using round-robin semantics. These

semantics, however, can be broken with the stuck cycle heuristic. In this

heuristic, the scheduler can pause the handling of one virtual device and

switch to another if it recognizes that the device has been stuck for a cer-

tain amount of time. A device is considered stuck if a set amount of time

has passed since the device received any new data. This heuristic was born

from the need to balance workloads in which handling I/O for throughput

intensive queues can cause starvation of latency sensitive queues. Although

this heuristic works well in the scenario described above, we found that it has

negative effects in our workloads. In our setup with co-located VMs, using

the heuristic often causes switching to a VM that is not currently running.

This hurts performance and increases latency. Accordingly, we decided to

disable the stuck cycles heuristic.

4.3.2 VM Idle Mode: Poll or Yield

The second optimization is the behavior of the guest VM when idle. In pre-

vious work where each VCPU had a dedicated physical core, the VM would

never yield the physical CPU. This behavior improved both throughput and

latency, with latency showing a greater improvement. The reason for the

improvement in latency, as shown in [12], is that the guest VM can halt

between sending a request and receiving a response. In some cases, this halt

can dominate latency [12, 20, 19]. We did not use this method as it would

not be beneficial with co-located VMs since it does not yield the core when

an idle guest VM “wastes” CPU time that could have been used to schedule

more VMs. This effect can be seen in Figure 4.3, which shows that polling

improves the throughput by at most 11% when the VCPUs are not fully

utilized, but degrades performance by about 20% when the system is fully

utilized. A possible solution might be to optimize and dynamically change

the idle mode to poll.

A better possible solution to balancing between having the guest halt

when idle and not increasing latency is to use halt-polling, as in recent KVM

implementations [20, 19]. With halt–polling, the guest polls for external

interrupts and events for a short while before yielding the processor. This

feature has been shown to have a good impact on latency, but was not used

21


0

10

20

30

40

64 128256

5121K 2K 4K 8K 16K

(1)

thro

ughput

[Gbps]


sbestyieldpoll

0

200

400

600

800

64 128256

5121K 2K 4K 8K 16K

(2)

vC

PU

thre

ads [

%]


Figure 4.3: 12 VMs and four sidecores, running a Netperf TCP stream: (1)Throughput, (2) Overall VCPU CPU usage.

0K

20K

40K

60K

80K

100K

basee1 e2 e3 e4

(1)

transacti

ons p

er

second

0

200

400

600

800

basee1 e2 e3 e4

(2)

cpu u

tilizati

on

VMse�ective I/O

overhead I/O

Figure 4.4: Undercommitted system and four VM guests running NetperfUDP R/R: (1) Transactions per second, (2) Overall I/O CPU usage.

in this work because of compatibility issues with the version of the Linux

kernel used.

4.4 Performance Metrics

While designing the dynamic manager, we identified two main performance

metrics to use when optimizing: throughput and latency. Although we

evaluated latency both in Section 5.5 and here, our focus is on optimizing

the overall throughput of the system.

We now show the results of not applying some optimizations mentioned

22


in previous work. To this end, we focus on the latency of Netperf UDP

R/R with four VMs and 1-4 sidecores. In this test case, the system is

undercommited. In Figure 4.4(2), we see that the four-sidecore configuration

outperforms all other configurations, although only by a very small margin.

The fact that the margin is small can be explained by the high number of

VM exits that result from guest halts. The halts occur because the VM

becomes idle while waiting for a response to arrive after each request is

sent. The cost of the VM becoming idle and reawakening greatly affects the

latency, as discussed in Section 4.3.2.

4.5 Number of Sidecores

The number of cores is an aspect that greatly affects system performance,

as shown in Chapter 3, where the most noticeable difference in the config-

urations tested is the number of sidecores dedicated to I/O processing. We

now focus on exploring this aspect.

To further show the effect of the number of cores, we present Figure 4.5.

The figure shows the results of a Netperf TCP stream with 4 and 12 VMs.

With 4 VMs, we demonstrate the behavior of an undercommited system

under a throughput intensive workload, while with 12 VMs we show over-

commited system behavior. The two topmost graphs (a1, a2) in Figure 4.5

each show the results of three configurations. The first, base, is the baseline

Virtio implementation in LINUX. The second, e1, is the ELVIS implemen-

tation with a single sidecore. The third, sbest, shows the performance of the

best sidecore configuration in terms of throughput. Each point along the

sbest line has a label detailing the number of sidecores in use.

We begin by focusing on the graphs in the left column, which display the

results for 12 VMs. Note that the number of sidecores and VCPUs is larger

than the number of available cores in the system. In this test we run all

sidecore configurations, from 1 through 7 sidecores. We see in (a1) that for

each message size, there exists a statically tuned sidecore configuration that

outperforms the baseline configuration. Moreover, there is no single sidecore

configuration that consistently gives the best performance. (b1) shows the

throughput normalized compared to the baseline configuration. We can see

that for 64B messages, a single sidecore outperforms the baseline by a factor

of 1.8x, while for 128B messages, the two-sidecore configuration outperforms

23


0

10

20

30

40

64 128256

5121K 2K 4K 8K 16K

(a1)

thro

ughput

[Gbps]


basee1

sbest

1 2

23

44 4 4 4

0

10

20

30

40

64 128256

5121K 2K 4K 8K 16K

(a2)

thro

ughput

[Gbps]


11 1

2 2

3 4

4 4

0

0.5

1

1.5

2

2.5

3

64 128256

5121K 2K 4K 8K 16K

(b1)

norm

alized t

hro

ughput


basee1

sbest

0

0.5

1

1.5

2

2.5

3

64 128256

5121K 2K 4K 8K 16K

(b2)

norm

alized t

hro

ughput


0

200

400

600

800

64 128256

5121K 2K 4K 8K 16K

(c1)

IO t

hre

ads C

PU

[%

]


base: overallbase: overhead

sbest: overallsbest: overhead

0

200

400

600

800

64 128256

5121K 2K 4K 8K 16K

(c2)

IO t

hre

ads C

PU

[%

]


0

200

400

600

800

64 128256

5121K 2K 4K 8K 16K

(d1)

vC

PU

thre

ads [

%]


basesbest

0

200

400

600

800

64 128256

5121K 2K 4K 8K 16K

(d2)

vC

PU

thre

ads [

%]


Figure 4.5: A Netperf TCP stream w. various message sizes: (a) 12 VMs,(b) 4 VMs.

24


0

200

400

600

800

64 128256

5121024

20484096

819216384

0

10

20

30

CPU

usage [

%]

thro

ughput

[Gbps]

# sidecores

VMsI/O e�ective

I/O overheadbase

Figure 4.6: Comparison of CPU usage in the baseline no affinityconfiguration running a Netperf TCP stream with various message sizes.

the baseline by 1.25x, and so on up to 4 sidecores for 16KB messages.

The right hand side of Figure 4.5 shows the results of running a Netperf

TCP stream on each different configuration using 4 VMs. Examination of

the figure reveals some interesting phenomena. For example, there is a large

leap in the performance of the baseline no affinity configuration between

2KB and 4KB message sizes. Additionally, we note that allocating more

sidecores is not always beneficial for all message sizes in terms of through-

put. Lastly, it can be seen that the baseline configuration outperformed all

sidecore configurations with 4KB messages. We explain these phenomena

below.

In order to explain these phenomena, we begin by analyzing the big leap

in throughput of the baseline no affinity configuration between 2KB and

4KB message sizes. The main cause of this leap is the ability of the I/O

threads to use the CPU quantum allocated to them by the Linux scheduler

for I/O handling. This can be seen in Figure 4.6, which shows three kinds of

CPU usage: cycles executed on the VM guest, effective I/O, and overhead

I/O. Effective I/O is the cycles spent in the I/O threads processing the

I/O produced by the guest VMs, while overhead I/O is the rest of the cycles

spent in the I/O threads. Clearly, there is a great reduction in overhead I/O

for message sizes of 4KB and larger. The subsequent increase in effective

I/O accounts for the leap in throughput.

25


4.6 Packet Batching

4.6.1 TCP Segmentation Offload Batching

The size of the packet sent by the guest affects performance greatly. The

smaller the packet size, the more processing per byte is required. TCP

segmentation offload (TSO) is a technique aimed at increasing a system’s

throughput by offloading some packet processing from the CPU to the phys-

ical NIC. TCP packets have to be segmented before transmission to the net-

work because the maximum buffer size for the TCP/IP stack is 64KB and

the maximum transmission unit (MTU) of the Ethernet protocol is 1448B

(and 12B of headers). Segmenting the packets into up to 46 segments and

transmitting each one consumes CPU and increases other overhead such as

checksum calculations, packet copying costs, etc. Current TCP/IP stack

implementations, such as the one in the Linux kernel, try to take advantage

of this feature by aggregating a few small packets received by the application

and transferring them to the NIC in one big chunk.

In paravirtual I/O, the size of transmitted packets plays a pivotal part

in the overhead of virtualization. The bigger the packet, the higher the

throughput, because with larger packets, there are fewer VM exits and less

data is being copied. A Netperf TCP stream test is an example of this

behavior. The size of the message signifies the amount of work the Netperf

user space application has to perform in order to send data over the network.

Netperf works by pre-allocating a send buffer of the message size and then

sending it over and over again using the standard c library call send. Each

send requires a system call to transfer the data from the user space to the

LINUX kernel TCP/IP stack. The TCP/IP stack can then choose to either

buffer or send the message immediately. Thus, when the average packet

sent by the TCP/IP stack is constant, the difference between two message

sizes in Netperf is influenced only by the number of system calls required to

reach the packet size sent by the TCP/IP packet. In Figure 4.7, we see this

behavior.

Figure 4.7 shows the measurement of throughput, CPU utilization and

average packet size sent when a single VM sends various message sizes using

a Netperf TCP stream. The VM uses three configurations of TSO: off, in

which the feature is turned off and the guest can send packets of up to

1.5KB; on, in which the feature is turned on at its default configuration

26


0

2.5

5

7.5

10

64 128256

5121K 2K 4K 8K 16K

(a1)

thro

ughput

[Gbps]


tso o�tso ontso full

0

10

20

30

40

64 128256

5121K 2K 4K 8K 16K

(a2)

VM

idle

[%

CPU

]


1K

4K

16K

64K

64 128256

5121K 2K 4K 8K 16K

(b1)

avg.

packet

siz

e [

Byte

s]


0

25

50

75

100

64 128256

5121K 2K 4K 8K 16K

(b2)IO t

hre

ads d

o-n

oth

ing p

ollin

g [

%C

PU

]


1K

10K

100K

1000K

64 128256

5121K 2K 4K 8K 16K

(c1)

avg.

packets

/sec


Figure 4.7: One VM with one sidecore running a Netperf TCP stream.

27


0

10

20

30

e1 e2 e1 e2

avg. packet

siz

e [

KB

]

tso ontso o�

0

200

400

600

800

e1 e2 e1 e2

CPU

usage [

%]

tso ontso o�

0

10

20

30

e1 e2 e1 e2

thro

ughput

[Gbps]

tso ontso o�

Figure 4.8: 12 VMs with 1-2 sidecores running a Netperf TCP stream w.256B messages.

which we fixed as detailed below; and full, in which the guest stack tries to

buffer packets as much as possible.

Additionally, current paravirtual implementations such as Virtio take

advantage of the TSO feature to offload the physical NIC without segment-

ing the packet whenever possible. We see in Figure 4.7 that the sidecore

maximum throughput is also dependent on the size of the packet sent.

During our examination, we came across a bug/oddity in the TCP/IP

stack in our guest VMs. We saw that the average message size was a lot

larger with three sidecores when compared to a four-sidecore configuration

for Netperf with message sizes of 1KB and 2KB. This, in turn, caused the

throughput using three sidecores to be higher compared to four sidecores,

though the CPU utilization was lower. During our investigation of the oc-

curence, we saw that in kernel 3.9, the mechanism responsible for batch

packets in the TCP/IP stack chooses never to batch packets when the send-

ing window is larger than 64KB. This behavior was changed in later kernels,

but occasionally still caused a regression in performance. As a result we

backported the patch to kernel 3.9 and tested the batching with the new

behavior.

The results of these tests are shown in Figure 4.8. As can be seen, three

sidecores perform better with 2KB messages in the default configuration.

The improvement stems from the fact that with three sidecores, the I/O

workers are more loaded, and as such, poll the virtual I/O devices less and

are able to handle larger packets each time.

28


0

1

2

3

4

5

e1 e4

(1)

thro

ughput

[Gbps]

0K

10K

20K

30K

e1 e4

(2)

guest

inte

rrupts

per

second

0

100

200

300

400

e1 e4

(3)

CPU

usage [

%]

VMhost

0K

200K

400K

600K

800K

e1 e4

(4)

VM

exit

s p

er

second ext

wininj

other

Figure 4.9: A comparison between one- and four-sidecore configurations,running a Netperf TCP stream with 64B messages. We compare: (a) The

number of VM exits and their reasons; (b) Average transmitted packetsize; (c) VM guest CPU usage; (d) Average guest interrupts per second.

4.6.2 Latency vs. Throughput

Next, we explain why allocating more sidecores does not always benefit

throughput, and why the baseline configuration outperforms all of the side-

core configurations when sending 4KB messages. All these phenomena de-

rive from the same reasons, which we will discuss using the 64B message

size as an example. With 64B and four sidecores, we get a throughput of

2.8Gbps. While this throughput is better than that of the baseline under

the same message size, 2.43Gbps, it is much worse than the 3.82Gbps that

is achieved in the one-sidecore configuration.

To explain the results, we present Figure 4.9, which compares one and

four sidecores in terms of VM exits, guest CPU usage, average transmitted

packet size, and guest notifications. In Figure 4.9(1) we see that with four

sidecores, the guest VMs perform three times as many exits as they do

under one sidecore. The result is that the guest VM spends 3.3x as much

time in the hypervisor emulating these exits, as shown in Figure 4.9(2). The

increased time is directly related to the increase in VM exists which play a

pivotal role in the overhead of paravirtual I/O, as shown by the slowdown

of the configuration in which more VM exits occur.

Another aspect that contributes to the slowdown is the average packet

size transmitted by the VM guest. Figure 4.9(3) shows that with one

sidecore, the average is 1.42KB, which is larger than the 1.24KB average

29


cores

# sidecores 0 1 2 3 4 5 6 7

baseline (0) 0, 8 1, 9 2, 10 3, 11 4 5 6 7

1 all 0, 8 1, 9 2, 10 3, 11 4 5,7 6

2 even odd 0, 8 1, 9 2, 10 3, 11 4, 6 5, 7

3 0, 4, 8, 10 1, 5, 6, 9 2, 3, 7, 11 0, 9 1, 5 2, 6, 10 3, 7, 11 4, 8

4 0, 1, 2 3, 4, 5 6, 7, 8 9, 10, 11 0, 4, 8 1, 5, 9 2, 6, 10 3, 7, 11

5 0, 9 1, 5 2, 6, 10 3, 7, 11 4, 8 0, 4, 8, 10 1, 5, 6, 9 2, 3, 7, 11

6 0, 8 1, 9 2, 10 3, 11 4, 6 5, 7 even odd

7 0, 8 1, 9 2, 10 3, 11 4 5,7 6 all

Table 4.1: The static configurations used throughout the work. Each rowcorresponds to one configuration. Orange cells represent IO-cores and

write cells represent VM-cores. The numbers in each cell represent the IDthat the VM either assigned to this core, or that of the virtual I/O devices

handled by this core.

with four sidecores. The number of packets sent in each case is roughly the

same. We attribute the lower average packet size to the fact that with four

sidecores available, the I/O workers were able to poll the VM guest faster,

causing the buffers to be emptied faster. This results in smaller packet sizes,

because of the behavior of the Linux kernel inside the guest. Furthermore,

Figure 4.9(4) shows that the number of virtual interrupts injected to guest

VMs with four sidecores is 2.2x higher than that of the one-sidecore config-

uration.

To overcome the cost of VM exits, previous studies suggested an Exit-less

interrupt technique. While we did not use this technique, it would benefit

our system in cases similar to this experiment, as there are no co-located

VM guests. This method combined with better virtual interrupt coalescing

techniques, which we propose as future work, will help eliminate most, if

not all, factors hindering the performance with four sidecores.

4.7 Affinity

We now detail the static configurations of our system when using the sidecore

approach, and discuss some prominent decisions made throughout the con-

figuration process.

30


4.7.1 Dividing the VMs Among Cores

In the sidecore approach, we divide the cores in our system into two groups:

IO-cores and VM-cores. There are several ways to partition VMs between

the VM-cores. We considered two possible schemes. The first option is the

no VM affinity configuration, in which all VMs can run on all VM-cores.

The second is the with VM affinity configuration, in which each group of

VMs is assigned to a dedicated core. We tested both configurations, as

shown in Figure 4.10. The results are mixed and do not show a clear-cut

winner in terms of throughput or fairness.

In column (a), we show the difference in the overall throughput of the

system between VMs with and without affinity. Positive numbers show that

the overall system throughput with VM affinity is better. We can see that

using VM affinity performs well in some cases, such as with three sidecores

with large message sizes and TSO enabled, while showing worse performance

in other cases.

We measured fairness as the average standard deviation in throughput

between the different VMs running. In column (b), we show the difference

between the average standard deviation with and without affinity. Here,

a positive point shows that using VM affinity provides less fairness, while

using no affinity provides more. Most of the unfairness, however, is in the

baseline configuration, and in the fact that the VM VCPUs must be divided

in an uneven way when using VM affinity.

We chose to use the configurations with VM affinity as we wanted to

maximize the overall system throughput in spite of the relative lack of fair-

ness. To test this feature on the baseline configurations, we tested two

configurations: with and without VM affinity. We show the configuration

that performed better.

4.7.2 Dividing the Virtual I/O Devices Among Sidecores

In the traditional paravirtual approach, each virtual I/O device has a ded-

icated worker thread that emulates it. In the sidecore approach, an I/O

worker thread handles multiple virtual I/O devices and has a dedicated core

in the system upon which to run.

When tuning the system, we statically divided the virtual I/O devices

among the different I/O workers. The division is straightforward and pre-

31


-3

0

3

6

9

64 128256

5121K 2K 4K 8K 16K

(a1) tso o�

di�

[G

bps]


basee1e2e3e4e5e6e7 -1

-0.5

0

0.5

1

64 128256

5121K 2K 4K 8K 16K

(a2) tso o�

std

dev [

Gbps]


-6

-3

0

3

6

64 128256

5121K 2K 4K 8K 16K

(b1) tso on

di�

[G

bps]


-1

-0.5

0

0.5

1

64 128256

5121K 2K 4K 8K 16K

(b2) tso on

std

dev [

Gbps]


-6

-3

0

3

6

64 128256

5121K 2K 4K 8K 16K

(c1) tso full

di�

[G

bps]


-1

-0.5

0

0.5

1

64 128256

5121K 2K 4K 8K 16K

(c2) tso full

std

dev [

Gbps]


Figure 4.10: 12 VMs running a Netperf TCP stream. (a) The difference inthroughput between running with and without VM affinity; a positive

value means that using VM affinity outperformed no VM affinity. (b) Thedifference in standard deviation per VM running with and without affinity;

a positive value means that with affinity, there is a larger standarddeviation in the throughput between the VMs with affinity than the VMS

without affinity.

32


sented in Table 4.1. The main idea behind the division is to create as much

symmetry in the system as possible. As mentioned in Section 4.7.1, we

set the VM guest core affinity to a specific core. For the baseline Virtio

configuration, we divided the VM guest as equally as possible among the

VM-cores.

In the baseline configuration, we used a single core, which is the tradi-

tional approach. We ended up using VM-core affinity for consistency and

fairness. We wanted to create as much symmetry as possible in the system,

which worked well except for one particular place (explained below).

Interrupt Affinity Physical NICs generate interrupts, which can be di-

rected to a physical CPU core. Each NIC port has its own distinct interrupt

vector. In our configuration, there are four NIC ports, giving a total of

four interrupt vectors, one in each NIC. For the baseline configuration, we

assigned each vector to its own core, with the vector of NIC port 0 assigned

to core 0, vector 1 to core 2 and so on.

When using the sidecore approach, we divided the interrupt vectors’

affinity among the sidecores. We assigned an NIC port interrupt handler to

a sidecore handling virtual I/O devices communicating with this NIC port.

Configuration Symmetry During our investigation, we noticed an in-

teresting case of symmetry affecting performance. We compared two con-

figurations with four sidecores, as shown in Figure 4.2. In the first configu-

ration, mirrored, the VM guest affinity matches the virtual device IO-core

affinity. All guests on a single core are handled by the same IO-core and

are connected to the same physical NIC port. This configuration is highly

symmetrical compared to the other configuration we tested, spread. In the

spread configuration, the affinity of the guest remains the same but guests

connected to a single NIC port are divided among multiple IO-cores. As

shown in Figure 4.11, we observed an increase in performance of up to 1.8x

using the spread configuration with the default TSO settings. As a result,

we opted to use the spread configuration throughout our experiments.

33


IO-cores VM-cores

0 1 2 3 4 5 6 7

Dev 0 Dev 1 Dev 2 Dev 3 VM 0 VM 1 VM 2 VM 3Dev 4 Dev 5 Dev 6 Dev 7 VM 4 VM 5 VM 6 VM 7Dev 8 Dev 9 Dev 10 Dev 11 VM 8 VM 9 VM 10 VM 11

(Mirrored)

IO-cores VM-cores

0 1 2 3 4 5 6 7

Dev 0 Dev 3 Dev 6 Dev 9 VM 0 VM 1 VM 2 VM 3Dev 1 Dev 4 Dev 7 Dev 10 VM 4 VM 5 VM 6 VM 7Dev 2 Dev 5 Dev 8 Dev 11 VM 8 VM 9 VM 10 VM 11

(Spread)

Table 4.2: Two static configurations with 4 sidecores, 1) Mirrored - thecore affinity for both VMs and their respective virtual I/O devices are

mirrored, and all devices serviced by an IO-core are connected to the samephysical NIC port; 2) Spread - the configuration is non-symmetrical and

I/O devices connected to the same physical NIC port are serviced bydifferent IO-cores.

0

10

20

30

40

no default full

(1) 1KB

thro

ughput

[Gbps] spread

mirrored

0

10

20

30

40

no default full

(2) 2KB

thro

ughput

[Gbps]

Figure 4.11: 12 VMs with four sidecores running a Netperf TCP stream.Mirrored : device affinity = VM affinity. Spread : the devices belonging to

VMs with the same affinity are assigned to different sidecores. We see thatspread outperforms mirrored by up to 1.8x with default and full TSO. For

no TSO, we see that mirrored outperforms spread by about 1.1x.

34


0

10

20

30

40

base e1 e2 e3 e4

(a1)

thro

ughput

[Gbps]

o�on

0

200

400

600

800

base e1 e2 e3 e4

(a2)

vC

PU

thre

ads [

%]

0

50

100

150

200

base e1 e2 e3 e4

(b1)

exit

s

0

2

4

6

8

base e1 e2 e3 e4

(b2)

halt

exit

s

Figure 4.12: A Netperf TCP stream with 1KB messages. A comparison ofthe VM guests running with the posted interrupts feature enabled vs.

running with the posted interrupts feature disabled.

35


4.8 Exit-less Interrupts & Polling I/O Devices

Host–Guest Notifications in Co-located VMs When the host wants

to cause an interrupt inside one of its guests, it first sends an Inter-Process

Interrupt signal (IPIs) to the core on which the VM is currently running.

The signal causes a VM exit. During the exit, the host injects a virtual

interrupt and then resumes the guest run. In previous works in which the

environment was configured statically and the VMs were not co-located, the

Exit-less Interrupts (ELI) technique, or similar techniques such as posted

interrupts, gave a clear benefit when used. These techniques work by over-

riding a running guest interrupt descriptor table (IDT) so that all interrupts,

including IPIs, can be sent directly to the running VM guests. Each VM

guest only handles interrupts directed to itself. Any other interrupts received

cause a VM exit to the hypervisor. Thus, when the VM has a dedicated core,

all interrupts can be sent to this core without causing a VM exit. These

techniques have proven to be beneficial.

In our work, we want a dynamic system in which the cores available to the

VMs vary over time, and which falls back to the traditional approach when

the load is low. When faced with co-located VMs, the Exit-less interrupts

technique falters, since, often, a currently running VM is interrupted when

the hypervisor tries to inject an interrupt not addressed to this VM. This

makes the technique less beneficial in reducing the number of VM exits and

in improving performance. To emphasize this point, we present Figure 4.12,

which shows the difference in throughput and exits of a Netperf TCP stream

with a message size of 1KB. In the graphs, we see that the executions look

similar both in terms of throughput and number of VM exits. We note that

when using this feature, we saw a decrease in throughput with 64B messages

and an increase with 16KB messages.

There may be ways to incorporate the Exit-less notifications technique

in a more advanced form that takes into account the currently running VM,

and thus reaping the benefit of injecting the interrupt. Another possible

optimization is to add a dynamic component that turns on posted inter-

rupts when the system is undercommited. These optimizations, however,

are outside the scope of this work and are left as future work.

36


Polling the Physical NIC The guest VMs send data to the load gener-

ator. We chose this direction for two important reasons. The first is that

we found that sending the data is much more CPU-intensive than receiving

data, and as a result, our load generator became the bottleneck. The second

reason is that our work is based on the ELVIS implementation, and so it

only polls the VM guest and not the physical NIC. As such, guest VMs only

send data, and do not receive, as it is the more optimized I/O path. We did

implement a version of ELVIS that polled the physical NIC; however, we

failed to see an improvement in performance in terms of either throughput

or latency. We suspect that the lack of improvement is due to our imple-

mentation not being fully optimized. Consequently, we decided to omit the

results from the current work and test only the more optimized I/O path.

37


Chapter 5

Number of Sidecores

Our goal is to develop a method to determine the optimal number of side-

cores whereby an “optimal” configuration maximizes throughput in an over-

committed environment. To this end, we systematically explored a large

number of configurations and analyzed their performance using micro and

macro benchmarks. Micro refers to a Netperf stream with varying message

sizes. Macro refers to Apache with varying HTML file sizes. Our analy-

sis uncovers a model, denoted the Butterfly model, which guides our design

of the sidecore manager. We use this model to explain the throughput of

different setups.

In this chapter, we present the Butterfly model and use it to show

that eliminating wasted CPU cycles increases the system performance. We

demonstrate this observation through the use of micro benchmarks. In Sec-

tion 7.3 we will evaluate the dynamic I/O manager with macro benchmarks.

5.1 The Butterfly Model

The Butterfly model is best demonstrated using a simple test case of 12 VMs

running a Netperf TCP stream with 1KB messages. Figure 5.1 shows that

the maximum throughput is achieved when the smallest amount of CPU

cycles are wasted. Additionally, it can be seen that that in each configura-

tion, the bottleneck is either the CPU usage of the sidecores or the CPU

usage of the VCPUs threads. When the number of sidecores is low, only the

VM-cores have wasted cycles, and when the number of sidecores is high, the

38


0

200

400

600

800

1 2 3 4 5 6 7 0

10

20

30

1 2 3 4 5 6 7

CPU

[%

]

Thro

ughput

[Gbps]

# Sidecores

# VMcores

sidecores VM cores Throughput

128

242318

393

298

198

99

99

197

296336

211129

60

Figure 5.1: A Netperf TCP stream with messages of size 1KB; testedunder various sidecore configurations (e1-e7).

0

10

20

30

40

e1 e2 e3 e4 e5 e6 e7

(1)

thro

ughput

[Gbps]


64B

128B

256B

512B

1KB

2KB

4KB

8KB

16KB

0

200

400

600

800

e1 e2 e3 e4 e5 e6 e7

(2)

Waste

d C

PU

[%

]


Figure 5.2: A Netperf TCP stream having varying message sizes; testedunder various sidecore configurations (e1-e7).

VM-cores are saturated and the sidecores have wasted cycles.

This behavior can be observed for almost all message sizes, as can be

seen in Figure 5.2 that shows the effect of each message size for each sidecore

configuration. Figure 5.2(1) reflects the aggregated throughput, and the

graph on the right shows the wasted CPU cycles. The observation holds

for all message sizes other than 256B, in which the wasted CPU utilization

is lowest for three IO-cores and the most throughput is achieved with two

IO-cores.

This anomaly is a side effect of the inherent inefficiency of our configu-

ration for three IO-cores when compared to two IO-cores. The inefficiency

stems from the fact that the three IO-core configuration causes an asymmet-

39


ric allocation of resources amongst the available cores: four NICs must be

divided among three IO-cores instead of two, and 12 VMs must be divided

among five cores instead of six.

The asymmetric allocation does not affect other message sizes to the

same extent. In the case of message sizes larger than 256B, the peak is

not reached with two IO-cores. On the other hand, message sizes smaller

than 256B require fewer CPU cycles on three IO-cores than they do on two

because the VMs cannot produce as much I/O traffic.

The inverse similarity between the peaks in the two graphs suggests that

we can use simple CPU-usage thresholds to determine when to change the

number of IO-cores in the system. The thresholds are defined for the VM

guests and for the IO-cores. When increasing the number of IO-cores, we

are effectively decreasing the number of VM-cores. Thus, we require two

thresholds in order to increase the number of IO-cores: One indicating that

the IO-cores need more computation power than currently available and

another indicating that the VM guests are not utilizing all cores allocated

to them and so will not be hurt by relinquishing a core.

5.2 Cost of Generating a Message

We tested the effects of running a Netperf TCP stream while varying the

cost of generating messages. To this end, we added a busy-loop delay before

transmitting data. The delay helped us to better simulate a guest VM that

performs computation before transmitting packets on the network. The

various delays were tested for packets of size 2KB. The results in Figure 5.3

row (a) show that the sidecore approach outperforms the baseline by 1.22x-

1.73x.

One notable result is the peak reached with a 10000 cycle delay, as seen

in the normalized throughput shown in Figure 5.3 row (b). The peak is

achieved because all the sidecores are fully utilized.

We note that with larger delays, the sidecore configuration’s relative

improvement increases. This finding is in line with our earlier observation

that the inefficiency is caused by the Linux scheduler.

40


0

10

20

30

40

0 1K 5K 10K 15K 20K 25K 30K 35K

(a1) TSO on

thro

ughput

[Gbps]

delay [cycles]

basee1

sbest

3 2

2

11

1 1 1 1

0

0.5

1

1.5

2

2.5

3

0 1K 5K 10K 15K 20K 25K 30K 35K

(a2) TSO on

norm

alized t

hro

ughput

delay [cycles]

32 2

1 1 1 1 1 1

0

10

20

30

40

0 1K 5K 10K 15K 20K 25K 30K 35K

(b1) TSO on

thro

ughput

[Gbps]

delay [cycles]

33

22

2 2 1 1 1

0

0.5

1

1.5

2

2.5

3

0 1K 5K 10K 15K 20K 25K 30K 35K

(b2) TSO off

norm

alized t

hro

ughput

delay [cycles]

3 3 22

2 21 1 1

Figure 5.3: A Netperf TCP stream having a varying message generatingcost.

41


5.3 I/O Interposition

I/O interposition is enabled by paravirtual I/O, and produces many benefits

such as deep packet inspection and software defined networking. To perform

these tasks, the sidecore consumes more CPU cycles per packet. To simulate

the behavior of the system under different interposition loads, we added a

busy-loop delay inside the packet handling function of vHost. The delay is

applied per packet. In Figure 5.4, we show the results of the experiment in

which we tested various delay lengths in order to simulate different types

of interposition. The experiment was executed on three different message

sizes: 64B, 1KB and 16KB. Each combination was measured both with and

without TSO. As shown in the graphs, the number of sidecores used in

the best static configuration generally increases with the number of cycles.

Consider, for example, messages of size 64B without TSO, in which the

number of sidecores increases from one to six. We also observe that for

a large interposition delay, the baseline performs on par with the sidecore

approach, especially when the best static configuration is seven sidecores

(e7 ). This makes sense, given that the load of generating the I/O in the

sidecores increases dramatically, and in some cases, may take up most of the

socket.

5.4 Throughput and Scalability

We tested the system under different load conditions starting from an un-

dercommited system with a single VM and a single IO-core up to an over-

commited system with 12 VMs and seven sidecores. We ran a Netperf TCP

stream with 64B, 1KB, and 16KB message sizes on 1-12VMs. As mentioned

above, a system is overcommited when the combined number of sidecores

(IO-cores) and VCPUs is larger than the number of cores. The system we

tested contains eight cores, and as such, for a single-sidecore configuration,

the system becomes overcommited when using eight or more VCPUs. In a

similar fashion, the system is overcommited in a two-sidecore configuration

when using seven or more VCPUs, etc. The results, depicted in Figure 5.5,

show that when the system is overcommited, the throughput does not de-

grade with the increase in the number of VMs. Thus, we conclude that using

12 VMs is a good representative of an overcommited system as the number

42


0

2.5

5

7.5

10

1K 8K 32K64K

(a1) 64B msgs, TSO on

thro

ughput

(Gbps)

interposition [cycles]

base e1 sbest

11 1 1 1

0

2.5

5

7.5

10

1K 8K 32K64K

(a2) 64B msgs, TSO off

thro

ughput

(Gbps)


11

3

56

0

10

20

30

40

1K 8K 32K64K

(b1) 1KB msgs, TSO on

thro

ughput

(Gbps)


2 2 34

5

0

10

20

30

40

1K 8K 32K64K

(b2) 1KB msgs, TSO off

thro

ughput

(Gbps)


44

5

7 7

0

10

20

30

40

1K 8K 32K64K


thro

ughput

(Gbps)


4 4 4 4

5

0

10

20

30

40

1K 8K 32K64K


thro

ughput

(Gbps)


4

5

6

77

Figure 5.4: A Netperf TCP stream with various interposition delays (a1)64B; (a2) 1KB; (b1) 16KB.

43


0

4

8

12

1 2 3 4 5 6 7 8 9 10 11 12

(a1)

thro

ughput

[Gbps]

# VMs

64B messages

basee1

e2e3

e4e5

e6e7

0

10

20

1 2 3 4 5 6 7 8 9 10 11 12

(a2)

# VMs

1KB messages

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12

(b1)

# VMs

16KB messages

Figure 5.5: All configurations with 1-12 VM guests running a Netperf TCPstream with different message sizes: (a1) 64B; (a2) 1KB; (b1) 16KB.

44


of co-located VMs does not affect performance.

Additionally, we see that with messages of size 64B, a single-sidecore

configuration improves throughput over the baseline by a factor of 1.2x with

a single VM and up to two times with seven VMs. With 1KB messages, we

again see that the sidecore approach outperforms the baseline by a factor

of 2.28x with a single sidecore and two VM guests, and down to about 1.1x

with 8-12 VM guests and three sidecores. This indicates that in the case

of 1KB messages, unlike the case of 64B messages, there is no single best

sidecore configuration for all loads.

Lastly, it is evident from our measurements that when using one or two

sidecores, each sidecore reaches a maximum throughput of around 7.3Gbps.

When using three sidecores, the utilization is lower due to the configuration

being less efficient, as there are four physical NICs that must be divided

among the three sidecores. For messages of size 16KB, we see a linear

improvement in performance as the number of VMs increases. Further, we

see that the sidecore approach outperforms the baseline by up to 1.3x with

a large number of VM guests.

We conclude that the sidecore approach scales with the number of VM

guests when the system undercommited. For overcommited systems, 12 is a

representative amount of VMs.

5.5 Latency and Scalability

Previous works have already explored the scalability of the sidecore approach

with a single sidecore per socket. They showed that it scales well compared

to the baseline.

Using a Netperf UDP request/response as a benchmark, we tested the

effect on latency of having multiple sidecores in a single socket. As in Sec-

tion 5.4, we used 1-12 VMs. Figure 5.6 shows the average latency per VM.

The figure shows that the six- and seven-sidecore configurations had an in-

crease in latency of up to 3.2x and 6.8x, respectively, when compared to a

single VM. In both configurations, the VM-cores were a bottleneck, as they

could not handle the increase in the I/O load coming from increasing the

number of VMs.

Figure 5.6 also shows that the baseline configuration achieved an in-

crease of two times with co-located VMs and 1.4x without, whereas the

45


0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12

avg. la

tency p

er

VM

[usec]

# VMs

basee1

e2e3

e4e5

e6e7

Figure 5.6: All configurations with 1-12 VM guests running a NetperfUDP request/response.

best sidecore configuration (two-sidecore) achieved a 1.2x increase with co-

located VMs and 1.3x without. We conclude that our sidecore configurations

scale better than the baseline even without co-located VMs, and that the

results further improve with co-located VMs. We see that the two-sidecore

configuration outperforms all other configurations, though the system is not

fully utilized. We conclude that in an overcommited system, the number of

VMs plays a large part in influencing latency.

46


Chapter 6

Design and Implementation

We implemented our design as a user space application running on on the

Linux OS. Our implementation communicates with the ELVIS system, which

we modified and enhanced. The ELVIS system was implemented inside the

KVM/QEMU hypervisor based on the Vhost-net kernel module. The Vhost-

net module enables the KVM to offload servicing of the virtual I/O device to

a kernel module, reducing context switches and packet copies in the virtual

data path.

6.1 Generalizing I/O Threads for Sidecores

In the KVM/QEMU hypervisor, Virtio I/O devices have three major com-

ponents: the in-guest driver, the device back-end in host, and a backing

device. The backing device is either a software or a physical device. The in-

guest driver and the back-end communicate via a shared-memory ring buffer

and ioeventfds, while the back-end and the backing device communicate

through a file-descriptor-like interface.

There are two implementations of the Virtio I/O devices’ back-end: one

in user space and one in kernel space. The kernel space implementation is

named VHost. VHost is faster than the user space implementation as it

does not require costly context switches between user and kernel space.

47


Algorithm 6.1: vhost.h in ELVIS (and SidecoreM )

2/∗ A vhost work−item. ∗/struct vhost work {// ...};

/∗ The virtqueue structure describes a queue attached to a device. ∗/struct vhost virtqueue {struct vhost dev ∗dev;

+ /∗ statistics ∗/+ struct {+ // ...+ } stats;

+ /∗ contains a list element as well other related data ∗/+ struct {+ // ...+ } vqpoll;};

/∗ The device structure describes an I/O device working with vhost. ∗/struct vhost dev {

- /∗ the work−item list of the worker, moved from vhost dev ∗/- struct list head work list;- /∗ The pointer to the thread task struct ∗/- struct task struct ∗worker thread;+ struct vhost worker ∗worker;

// ...struct vhost virtqueue ∗vqs;int nvqs;

+ /∗ statistics ∗/+ struct {+ // ...+ } stats;};

+ /∗ Worker structure, a worker thread performing the real I/O operations.+ ∗ The worker data is no longer part of the device. ∗/+ struct vhost worker {+ /∗ the work−item list of the worker, moved from vhost dev ∗/+ struct list head work list;+ int id;+ /∗ The pointer to the thread task struct ∗/+ struct task struct ∗worker thread;+ /∗ A list of virtqueue currently being polled by the device ∗/+ struct list head vqpoll list;+ /∗ statistics ∗/+ struct {+ // ...+ } stats;+ };

48


The VHost implementation presented in Algorithm 6.1 defines three

main data structures: vhost work, vhost dev, and vhost virtqueue. vhost -

virtqueue is a data structure that holds pointers to the shared memory

buffers. vhost dev is the back-end of a virtual I/O device. vhost dev con-

tains worker, a task struct pointer to the kernel worker thread handling

this device.

vhost dev also contains a list of vhost work named work list. vhost -

work is a work-item processed by a worker thread. There are multiple types

of vhost work. Chief among them is a work-item containing a notification

from either the in-guest driver or the backing device. For network devices,

notifications are generated from the in-guest driver if no prior notification

is pending and the back-end has enabled notifications. The conditions for

issuing a notification are met when a packet is queued to an empty shared

memory-buffer. Upon receiving the first incoming packet, the backing device

generates a notification. When it has finished sending all of the packets

queued by the back-end, it does the same.

Two main modifications must be made to ELVIS in order to allow for

sidecores. These modifications are marked by the “NEW” comment in Fig-

ure 6.1. The first modification is to enable a single kernel thread to handle

multiple back-ends. The second is to allow the kernel thread to poll the

guest for outgoing traffic.

We enabled the single kernel thread modification by creating a new struct

called vhost worker, into which we moved the work list and task struct

pointer to the kernel worker. The worker pointer in the vhost dev struct

was then changed to point to the vhost worker that processes this device.

Lastly, we queued all of the work-items that were previously queued in the

vhost dev’s work list in the vhost worker’s work list.

Enabling the kernel thread to poll guests was achieved by adding a

new list, vqpoll list, comprising the currently polled vhost virtqueues

to vhost worker. In the vhost worker thread’s main loop, it first checks

for a work-item in the work list, then picks a vhost virtqueue to poll.

The worker stops polling the virtqueue when it passes the maximum allowed

processing length, or when there is another queue that needs to be polled.

Polling is done for the most part in round-robin fashion.

49


6.2 Dynamically Relocating Devices between I/O

Threads

Implementation of our dynamic design requires that we be able to quickly

switch from the traditional paravirtual model to the sidecore model. Addi-

tionally, we must be able to control the transition from within our user space

implementation. In order to achieve these goals, we added two enhancements

to ELVIS: (1) The ability to move live virtual I/O devices between workers

without pausing the service to the device; and (2) a virtual file system based

interface.

Switching from the traditional model to a single sidecore configuration

entails moving from a configuration in which each worker thread emulates

a single virtual I/O device, to one in which each worker thread emulates

multiple devices. Three steps are required in order to switch between the

configurations. First, we move all virtual I/O devices to a single worker

thread. We then stop and kill all other active worker threads in the system.

Lastly, we dedicate a core and clear it for our use by moving all currently

running VM guest vCPUs’ threads to other available cores according to the

new sidecore configuration.

Switching from the sidecore back to the traditional model is done by

creating new worker threads until we have a worker dedicated to each virtual

I/O device. Once enough worker threads have been created, each virtual

I/O device is assigned to one thread. Finally, we adjust the affinity of both

the vCPUs’ threads and of the worker threads according to the traditional

configuration.

The ability to move virtual devices between I/O threads without disrupt-

ing or stopping them, even during polling, is the key to switching rapidly

between models. The move is facilitated using each worker’s work queue.

Each I/O worker has a dedicated work queue in which it holds pending work-

items. The work queue of a traditional virtual NIC worker thread contains

two main work-item types. The first type is queued when the guest VM

wishes to notify the host of outbound packets. The second is added when

the physical NIC notifies the guest of incoming packets.

In SidecoreM, we added two new work-item types: One that tells the

worker to detach a specified device that is currently assigned to it, and one

that adds a detached device to the worker. Using the new work-item types,

50


each worker thread can manipulate the list of devices attached to it. This

enables us to enforce an important convention: Only a worker thread may

manipulate its own devices. In this fashion, we reduce the amount of locking

required.

Moving a device is done in the following stages: (1) To start the transfer,

we set a flag indicating that this device is about to move. As a result, we

cannot initiate a new move while one is currently in progress. Additionally,

all future incoming work-items for this device are temporarily added to an

auxiliary queue until the move is complete. (2) We add a work-item telling

the worker to detach the device. When this work-item is reached, the worker

must have already performed all the work for this device present in the

queue. Since a device can only be manipulated by the worker to which it is

assigned, we now know that the device is currently not in use. (3) The source

worker adds a work-item to the work queue of the destination worker, telling

it to assign the device to itself. (4) When the destination worker dequeues

this work-item from its work queue, it assigns the device to itself and then

handles the accumulated work-items in the auxiliary queue. This is safe as

the device is currently not in use. (5) To conclude the operation, we set the

transfer flag to false.

6.3 Communicating with Userland

The I/O manager initiates the movement of virtual I/O devices when switch-

ing from one system configuration to the other. It does so by using sysfs and

other interfaces we added to ELVIS. sysfs is a virtual file system provided

by the Linux kernel and designed to export information about various kernel

subsystems, such as KVM, to user space. This interface also allows us to

set various configurations by writing to the virtual files.

The interface we export is shown in Figure 6.1. The root directory is

located at /sys/class/vhost and is divided into three major subdirectories:

worker, dev, and vq.

The worker subdirectory is the root for all files and libraries that handle

I/O worker threads. It contains a subdirectory for each worker thread,

enabling us to read various statistics and counters related to the respective

worker. Additionally, there are two files in the worker subdirectory that

enable addition and removal of workers.

51


vhost.............................................../sys/class/vhostworker

create

remove

<worker id>.................an I/O worker thread with this idvarious statistics

cpu

pid

dev list

dev

<dev id>..................................a virtual I/O deviceworker

owner

vq list

vq

<vq id>....................................... a virtual queuevarious statistics

dev

poll

Figure 6.1: Prominent files and directories in the sysfs interface.

52


Similarly, the dev subdirectory contains a separate subdirectory for each

virtual I/O device present in the system. Each device subdirectory contains

files that handle the moving of the device between I/O workers, along with

files for reading data and statistics related to the device.

The third subdirectory, vq, handles the virtual queues. Each virtual I/O

device has one or more virtual queues. These queues are the communication

channel between the guest and the host. Here, too, there exists a separate

subdirectory for each queue. Each such subdirectory contains a file for

controlling polling by the assigned worker in the queue.

As an example for working with our API, we present the following algo-

rithm for removing an I/O worker thread. Assume that we have two worker

threads whose IDs are w.1 and w.2; as well as one device d.1 that is assigned

to worker w.1. In Algorithm 1 we show the shell commands necessary for

removing w.1. We first lock w.1 so that no new devices can be assigned to it.

We then move all devices assigned to w.1 to other workers. In our example,

the only device assigned is d.1, which we move to worker w.2. Finally, after

all devices have been assigned to other workers, we can remove w.1 by writ-

ing its ID to the remove file. We maintain a single remove file rather than

a unique file in each worker directory in order to avoid a deadlock that may

arise from the fact that removing the worker entails removing its directory.

Algorithm 1 Removing I/O worker w.1 using bash shell commands. In oursystem we have two workers: w.1, w.2, as well as one device: d.1. Deviced.1 is assigned to worker w.1.

echo 1 >/sys/class/vhost/worker/w.1/lockedecho ”w.2” >/sys/class/vhost/dev/d.1/workerecho ”w.1” >/sys/class/vhost/worker/remove

The various counters and statistics exposed through this interface guide

the decisions of our I/O manager. Throughout the development process, we

saw that this interface is relatively slow when it is used several hundred times

a second. Moreover, we saw that reading a single file takes 133 microseconds

(hundreds of thousand of cycles). Thus, in the original implementation,

reading these files dominated the run time of our I/O manager. To reduce

the overhead we implemented a system call that copies data directly from

the kernel space. This mechanism reduced our overhead by a few orders of

magnitude, to around four hundred cycles per read.

53


Another issue that came up during implementation was related to the ef-

fective CPU usage inside our I/O worker. Our I/O manager uses this statis-

tic very frequently, and it is measured by reading the timestamp counter

before and after each packet is processed.

Measuring effective CPU usage in this fashion is efficient, but is inac-

curate when asynchronous events such as hardware interrupts occur during

the measurement. The time taken to handle these interrupts is wrongly

taken into account by the timestamp counter. In order to correct the mea-

surement, we check for the number of concurrent asynchronous events that

occurred by decreasing the number of interrupts handled in the system after

and before each measurement.

The I/O manager calculates the CPU usage of these events and subtracts

this amount from the effective CPU usage received from the counter. We

saw that although this calculation is not perfectly accurate, it is efficient to

calculate and gives reasonable accuracy.

6.4 The I/O Manager

Based on the analysis of the virtual I/O behavior, we designed and im-

plemented a dynamic virtual I/O manager. We took three main design

considerations into account when implementing the manager: (1) When is

it beneficial to use the sidecore approach over the traditional one? (2) When

should the number of IO-cores be increased or decreased if the sidecore ap-

proach is used? (3) How should the virtual I/O devices be divided among

the IO-cores in the sidecore approach? To answer these questions, the I/O

manager takes into account system-wide statistics, e.g., CPU utilization.

The I/O manager then makes configuration changes based on the threshold

and policies described below.

6.4.1 Architecture

The manager works by periodically polling the statistics exported by the I/O

threads, along with other system information such as CPU utilization of the

guest VMs’ VCPUs. We present its main-loop structure in Algorithm 6.2.

We start the main loop by reading system statistics, and then calculating

the load on the system in terms of CPU utilization and overall throughput.

54


From this point, the main loop consists of two segments: (1) Deciding if

we start/stop shared workers, and (2) adding/removing IO-cores. In Sec-

tion 6.4.2, we will cover the parameters used in each segment.

We run the I/O manager periodically. Since running the I/O manager

adds overhead to the system, we want to run it at the lowest possible rate

while still being able to react quickly and accurately to changes in I/O load.

In this work, we ran the manager every 10 milliseconds. This number strikes

a good balance because the statistics generated by the Linux kernel are given

in units of Hz, which is 4 ms by default. Running the manager at intervals

smaller than 1 Hz makes little sense since it adds overhead but does not

improve accuracy, as we found when we tested the manager at intervals of

1 ms. On the other hand, running the I/O manager at large intervals such

as 50 or 100 ms causes a decrease in performance, as it takes longer to react

to changes.

2def run(self):

while True:time.sleep(interval)Vhost.INSTANCE.update() # Get the latest vhost statisticsCPUUsage.INSTANCE.update() # Get CPU usage of the VM−coresshared workers = len(self.io workers) > 0calculate load(shared workers) # calculate the load on the systemupdate history() # update the load historyif not shared workers:

should start, suggested io cores = should start shared worker()if not should start or \

not regret policy.can do move(‘‘start shared worker’’):continue

# Enable shared IO workers.cpu ids = vm manager.remove cores(number=suggested io cores)# Add an IO worker in each CPU in cpuids, divide devices between coresenable shared workers(cpu ids)continueif should stop shared worker():

if not can do move(‘‘stop shared worker’’):return True

# Disable shared IO workers.cpu id = self.io core policy.remove() # choose a CPU to removevm manager.add core(cpu id) # move VCPUs to the free coredisable shared workers() # move devices to dedicated workers

should add io core, can remove io core = should update io cores number()should remove io core, can add io core = should update vm cores number()batching remove io core = \

batching should remove io core number()if batching remove io core and \

regret policy.can do move(‘‘batching remove io core’’):remove io core()if regret policy.is move good(‘‘batching remove io core’’):

continue# Revert actionadd io core()continue

55


if should add io core and can add io core and \regret policy.can do move(‘‘add io core’’):add io core()if regret policy.is move good(‘‘add io core’’):

continue# Revert actionremove io core()continue

if not should remove io core or not can remove io core or \not regret policy.can do move(‘‘remove io core’’):# we don’t want or can’t remove an IO−corecontinue

remove io core()if self.regret policy.is move good(‘‘remove io core’’):

continue# Revert actionadd io core()

Algorithm 6.2: I/O manager main loop

6.4.2 Setting Parameters

We begin by describing the parameters used in the first segment of the I/O

loop: Starting shared I/O workers, and thereby moving from the traditional

to sidecore approach. We consider the overall CPU utilization of all I/O

threads in the system. If their combined CPU utilization is larger than a

single core, we move to the sidecore approach by starting the shared I/O

workers. We stop the shared I/O workers when the load on the sidecores

becomes lower than half a core. This leaves us a safety margin that aids in

avoiding threshing.

When moving to the sidecore approach, we must determine the number

of sidecores necessary. Thus, the second segment of the I/O manager’s

loop handles the decision of whether to increase or decrease the number of

sidecores. We consider the Butterfly model presented in Section 5.1. In the

model, we want to reduce the amount of wasted CPU cycles in the system

as much as possible. This is done by examining CPU usage thresholds to

determine when the number of sidecores in the system should be changed.

The thresholds are defined for the VM guests and for the IO-cores. When

increasing the number of IO-cores, we effectively decrease the number of VM-

cores. Thus, two thresholds are required in order to increase the number of

IO-cores: One indicating that the IO-cores need more computation power

than currently available, and one indicating that the VM guests are not

56


utilizing all cores allocated to them and so will not be hurt by relinquishing

a core.

The threshold indicating that the IO-cores are fully saturated, and shou-

ld be increased, is reached when the effective CPU usage in the IO-cores

reaches ≥ #IOcores ∗ 100 − 5. The opposite threshold, indicating that the

VM-cores can relinquish a core, is reached when the effective CPU usage of

the VM-cores is ≤ (#VMcores− 1) ∗ 100.

Similarly, two more thresholds are required in order to decide when to

decrease the number of IO-cores. The threshold indicating that the IO-cores

are not fully utilized and will not be hurt by giving up a core in favor of

the VM guests is when the effective CPU usage is ≤ (#IOcores− 1) ∗ 100.

Likewise, the threshold indicating that the VM guests are fully saturated

and require an additional core is reached when their effective CPU usage is

≥ #VMcores ∗ 100 − 5.

6.4.3 Keeping History and a Regret Policy

As mentioned above, the manager polls all of the relevant statistics regularly.

During testing, we saw fluctuations in the measured CPU load over time

when the polling rate was high. This happened even though the workload is

constant. The changing load occurs because not all VM guests are scheduled

at the same time, and LINUX timekeeping has delays due to the fact that the

CPU utilization of a process is updated only after the process is scheduled

out.

To account for these fluctuations, the manager makes a configuration

change based on its observations over several iterations. In this work, the

manager makes a configuration change after considering the results of the

last 20 iterations. This “history” of iterations is created by having the I/O

manager make a decision after each iteration based on the state of the system

during this iteration. The manager executes a configuration change only if

it decided on the change in at least 60% of the previous 20 iterations.

After the I/O manager executes a configuration change, we want to

make sure that it made the “correct” decision. To this end, we keep track

of the average throughput of the 20 iterations before the change and use

them as a comparison point. After the change is executed, we measure

throughput for a further 20 iterations, during which the manager cannot

57


execute a new change. If the average throughput during these iterations was

lower than the throughput before the configuration change, we revert to the

old configuration. After reverting, we apply a penalty to the unsuccessful

change by setting a time frame in which the same change cannot be retried.

This time frame is calculated using an exponential backoff, up to a maximum

of 5 seconds. This mechanism allows us to apply changes rapidly but also

to ensure that we persist with a change only if it is beneficial.

6.4.4 Utilizing the Baseline

We further optimized our manager by using the current effective I/O CPU

usage to choose the number of sidecores when moving from the baseline to

the sidecore configuration. For example, if the effective CPU usage in the

baseline configuration is 120% (1.2 full cores), we will choose a two-sidecore

configuration rather than a single-sidecore configuration. This enables us to

find the optimum number of sidecores much more quickly than the trivial

approach, in which we first move to the single-sidecore configuration and

then leave the I/O manager to raise the number of cores as needed. Test-

ing shows that this optimization always brings us to within at most one

configuration change from the optimum.

6.4.5 Minimum Batching

Minimum batching is an optimization we applied aimed at increasing packet

buffering, thereby increasing the average size of packets sent in the system.

We achieved this by allowing the I/O manager to decrease the number of

sidecores in the system when the average size was seen to be less than Ether-

net’s MTU (Maximum Transmit Unit) size. The I/O manager performs the

configuration change using the history and regret policy described above.

6.4.6 Dividing Virtual I/O Devices Among Sidecores

We previously detailed the design of both the traditional paravirtual model

and ELVIS. In the traditional model, each I/O device is handled by a dedi-

cated I/O worker thread. In the sidecore approach, which was based on the

traditional paravirtual approach, each virtual I/O device still has a dedicated

worker thread that emulates it. Unlike the traditional approach, however,

58


each I/O worker thread handles multiple virtual I/O devices. Consequently,

the last major step in creating a dynamic sidecore manager is the division

of virtual I/O devices among the various sidecores. Dividing the devices is

crucial, especially when the I/O workload is not uniform and the virtual I/O

devices produce different amounts of I/O load.

We analyzed the theoretical ramifications of dividing the devices and

found a similarity to an existing optimization problem, Bin Packing, which

is known to be NP-Hard. In Bin Packing, objects having different volumes

must be partitioned into containers of a given volume, with the aim of

minimizing the number of containers used. In our case, the objects are the

virtual I/O devices with their processing cycles demands, and the containers

are the I/O workers that can supply a given amount of processing cycles.

In order to complete the reduction, we simplify our problem by considering

throughput as being dependent only on the amount of processing allocated

to the device in the I/O worker. By minimizing the number of I/O workers

needed to fulfill all the system demands, we get a partition of the I/O devices

into I/O workers with maximum throughput.

Although on the surface this reduction seems correct, there are some

inherent issues that make using an approximation algorithm very difficult.

The first is that the actual demand of the virtual I/O device is unknown

to us, especially when the I/O worker is fully saturated. In this case, each

virtual I/O device assigned to the saturated worker receives only an unknown

part of its demand. Secondly, allocating more I/O workers than needed

reduces the amount of cores available to the VMs, which may reduce the

demand and decrease the throughput. The problem becomes even more

difficult when taking into account other parameters such as latency. Further,

as discussed below, even if the devices are symmetrical in terms of load, their

assignment contributes a great deal to the overall throughput.

As a result, we decided to concentrate on I/O devices that produce a

uniform workload and to assign them statically for each configuration. We

divided the devices as equally as possible among the various I/O workers in

the system.

To eliminate the problem of dividing the devices, a future optimization

would be to allow the I/O workers in the system to poll and handle all the

available I/O devices, thereby removing the need for division.

59


Iterations p. sec. 100

Iteration. 28µsec

Per second 2.8ms

Move device 12µsec

with I/O 1.2ms

Overhead change >10ms

Table 6.1: I/O manager overhead.

6.4.7 Overhead

We now evaluate the performance impact of the various aspects of Side-

coreM. The time it takes our I/O manager to perform one iteration is 28

microseconds when no configuration change is required. An iteration in

which the manager decides that a configuration change is necessary takes

604 microseconds. The main difference lies in the fact that after a system

configuration change, the I/O manager must use the slow sysfs interface to

communicate with the kernel, as described in Section 6.3. SidecoreM iterates

once every 10 milliseconds, resulting in around 280 microseconds of overhead

per second as shown in Figure 6.1.

Another aspect of overhead we incur in the system comes from moving

a virtual I/O device between two I/O workers. We measured the time it

takes to perform 100 transfers of a device between two worker threads. As

shown in Table 6.1, it takes around 12 microseconds to move a device when

there is no I/O traffic in the system; and around 1200 microseconds when

the system is loaded and the device is transmitting and receiving packets.

We attribute this difference to our asynchronous moving scheme, in which

we queue the device transfer order in the I/O workers’ work queues, and the

transfer is performed as the work queue item is handled.

Lastly, we measured the impact on the throughput by moving a single

VM between two cores every 1 millisecond. We saw no noticeable drop

in performance compared to the case in which the VM was running on

dedicated cores. This result makes sense in the context of this work as we

do not move the VM across NUMA nodes, and the VM is not moved fast

enough to cause cache misses that impact performance.

60


Chapter 7

Evaluation

We implemented SidecoreM as described above and evaluated its perfor-

mance, strengths, and weaknesses using micro and macro benchmarks, as

well as dynamic benchmarks.

The results show that SidecoreM improved performance by up to two

times with a dynamic workload and up to 1.8x with a static workload when

compared to traditional paravirtual I/O. In most cases, SidecoreM had no

noticeable overhead — defined by how far SidecoreM was from the state-of-

the-art statically tuned configurations. In the few cases in which there was

overhead, it was less than 5%.

7.1 Experimental Methodology

Our testbed is detailed in Section 4.1. We evaluate SidecoreM against the

I/O models we have considered thus far: The KVM/Virtio baseline as the

state-of-practice we denote base [21], and a group of ELVIS configurations

we denote Static Best. In each experiment, we tested a wide range of ELVIS

configurations (see Chapter 4). The sbest line in each graph represents the

best performing ELVIS configuration for each test case.

We consider ELVIS as the state-of-the-art sidecore approach. We use the

following benchmarks: (1) Netperf UDP RR (request-response) – a standard

tool for measuring network latency [14], which works by repeatedly sending

one byte and waiting for a one-byte response; (2) Netperf TCP stream – a

tool for measuring the maximal throughput sent over a TCP connection.

61


We use this tool with a variety of packet sizes; (3) Apache – [8] an HTTP

web server driven by ApacheBench [4].

We perform each experiment five times and present averages. The stan-

dard deviation for the ELVIS and SidecoreM models is less than 2% of the

average, and it is less than 5% for all the baseline experiments.

7.2 Micro benchmarks

7.2.1 Netperf TCP Stream

The most important features of SidecoreM are its ability to find the best

configuration and the effects on performance as a result of the time it takes

the manager to converge. We evaluated the impact of SidecoreM in the least

favorable conditions as compared to a statically tuned configuration. These

conditions are reached when the load is static. In a dynamic workload, the

demand and the load on the IO-cores may change, and as such, no single

static configuration can perform as well as a dynamic configuration.

Using a Netperf TCP stream, we measured the throughput of the system

while it is running our I/O manager. For each message size tested, we

restarted the I/O manager, each time from the baseline configuration. Each

run takes two minutes. As we see in Figure 7.1(b), SidecoreM performed

on par with the best static sidecore configuration, trailing by at most 4%,

or 0.94Gbps, with a message size of 1KB. We investigated this gap and saw

that it is due to the fact that the I/O manager thresholds make it try to

fluctuate between four sidecores, which yields the best throughput, and three

sidecores. The reason for this fluctuation is that when four IO-cores are in

use, the VM-cores are fully saturated and the IO-cores’ utilization hovers

on the threshold that triggers the manager to try to reduce the amount of

IO-cores. When the manager decides to remove a core, the regret policy

identifies the degradation in throughput and reverts to the configuration

from before the change. The performance hit is reduced by the application

of the exponential backoff penalty, which delays further attempts to reduce

the number of cores and keeps the regression in throughput relatively low.

SidecoreM found the optimum configuration within about 0.47 seconds,

or 47 iterations. About half of the iterations were spent on making sure

that the change in configuration is beneficial – the first part of the regret

62


policy. Our findings in this experiment show that using the effective CPU

utilization of the baseline configuration as a guide always brings us within

one configuration change from the optimum.

As we saw in Chapter 4, enabling TCP Segmentation Offload (TSO)

changes the behavior of the guest TCP/IP stack. These changes make it

difficult for us to model the behavior of the system under TSO. This is a

commonly used feature, however, often enabled by default, and so it was

imperative that we explore its effects on SidecoreM.

We ran the manager with TSO enabled at its default configuration. As

shown in Figure 7.1(a1), a maximum of two or three sidecores was necessary

to achieve our line rate of a 40 Gbps NIC with TSO enabled, depending on

the average size of the packets.

Additionally, Figure 7.1(a2) shows that SidecoreM was able to reach the

optimal configuration as defined by the fully tuned static configuration in all

cases. SidecoreM outperformed the baseline by 1.25x-1.8x before reaching

line-rate.

7.2.2 I/O Interposition

Next, we tested SidecoreM with various delays to emulate the computational

cost in terms of CPU usage inside the sidecores. We see in Figure 7.2 that for

all message sizes, SidecoreM always found the optimal number of sidecores

required — both with and without TSO. Additionally, the overhead is at

most 5%. SidecoreM improved over the baseline by up to 1.79x with TSO;

with the gap lessening as the simulated interposition lengthens.

7.2.3 Varying Message Costs

We tested SidecoreM under a more CPU-intensive workload on the guest

VM. To this end, we utilized a Netperf TCP stream with a busy-loop delay

before transmitting data. The busy-loop delay simulates a CPU-intensive

workload in a controlled manner. We sent 2KB messages with busy-loop

delays of 0-35000 cycles before each transmission. From the results shown

in Figure 7.3(a), we see that SidecoreM performed as well as the best static

configuration, with a maximum of 1% deviation. Moreover, we see that

SidecoreM outperformed the baseline by a factor of 1.22x-1.73x.

63


0

10

20

30

40

64 128256

5121K 2K 4K 8K 16K

(a1) TSO on

thro

ughput

[Gbps]


basee1

sbestSidecoreM

1

1

1

2 2 3 3 3 3

0

0.5

1

1.5

2

2.5

3

64 128256

5121K 2K 4K 8K 16K

(a2) TSO on

norm

alized t

hro

ughput


11

12

2 3 3 3 3

0

10

20

30

40

64 128256

5121K 2K 4K 8K 16K

(b1) TSO off

thro

ughput

[Gbps]


1 2

23

44 4 4 4

0

0.5

1

1.5

2

2.5

3

64 128256

5121K 2K 4K 8K 16K

(b2) TSO off

norm

alized t

hro

ughput


1

22

34 4 4 4 4

Figure 7.1: A Netperf TCP stream with various message sizes.

64


0

2.5

5

7.5

10

1K 8K 32K64K

(a1) 64B msgs, TSO on

thro

ughput

(Gbps)


11 1 1 1

0

2.5

5

7.5

10

1K 8K 32K64K

(a2) 64B msgs, TSO off

thro

ughput

(Gbps)


11

3

56

0

10

20

30

40

1K 8K 32K64K


thro

ughput

(Gbps)


2 2 34

5

0

10

20

30

40

1K 8K 32K64K


thro

ughput

(Gbps)


44

5

7 7

0

10

20

30

40

1K 8K 32K64K


thro

ughput

(Gbps)


4 4 4 4

5

0

10

20

30

40

1K 8K 32K64K


thro

ughput

(Gbps)


4

5

6

77

Figure 7.2: A Netperf TCP stream with various interposition delays.

65


0

10

20

30

40

0 1K 5K 10K 15K 20K 25K 30K 35K

(a1) TSO on

thro

ughput

[Gbps]

delay [cycles]

basee1

sbestSidecoreM

3 2

2

11

1 1 1 1

0

0.5

1

1.5

2

2.5

3

0 1K 5K 10K 15K 20K 25K 30K 35K

(a2) TSO on

norm

alized t

hro

ughput

delay [cycles]

32 2

1 1 1 1 1 1

0

10

20

30

40

0 1K 5K 10K 15K 20K 25K 30K 35K

(b1) TSO on

thro

ughput

[Gbps]

delay [cycles]

33

22

2 2 1 1 1

0

0.5

1

1.5

2

2.5

3

0 1K 5K 10K 15K 20K 25K 30K 35K

(b2) TSO off

norm

alized t

hro

ughput

delay [cycles]

3 3 22

2 21 1 1

Figure 7.3: A Netperf TCP stream with a busy-loop delay that simulatesperforming a computation before transmitting packets.

Figure 7.3(b) shows a peak in the normalized throughput when a delay

of 10000 cycles is tested. The peak is achieved because these parameters

enable full utilization of all sidecores.

7.3 Apache Web Server

Figure 7.4(a) displays the performance of the Apache web server. We config-

ured the Apache server to send static HTML pages of various sizes, from 64B

to 1MB. The server adds a header to each page sent to the client, adding an

overhead we measured as 272B. This is important to note because for small

page sizes, the overhead dominates the size of the sent data, whereas for

larger pages, the overhead is negligible. As a result, when handling smaller

page sizes, doubling the page size does not double the amount of data sent.

Having said that, we measure throughput in transactions per second, as is

customary in this benchmark, rather than in actual network throughput.

66


As shown in Figure 7.4(a), SidecoreM approached the optimum defined

by the best static sidecore configuration for all HTML page sizes, with a

deviation of at most 1%. When compared to the baseline paravirtual con-

figuration, SidecoreM surpasses the performance for all HTML page sizes.

Figure 7.4(b) shows the throughput normalized to the baseline vhost

implementation. We see that for small page sizes, the gain is significant.

SidecoreM performed up to two times better than the baseline for pages in

the range of 64B-1KB. The benefit of using the sidecore approach declines

for HTML page sizes between 2KB-16KB, then peaks for 32KB page sizes.

For larger page sizes, the benefit continues to decay, while remaining above

the baseline.

The two times gain for page sizes of up to 1KB can be explained by

the fact that the load produced by the VM guests is 151Mbps-624Mbps and

is mostly because the packets are small and the cost of generating them is

high. This agrees with the results shown in Section 7.2.3. This is greatly

improved with polling and dedicated cores.

In pages larger than 2KB, throughput starts to play a factor. The ac-

tual amount of data that is sent starts to double when the page size dou-

bles. To get a better understanding of this behavior, compare Figures 7.4(b)

and 7.4(c). Figure 7.4(c) shows the CPU usage of the guest VMs in our sys-

tem. As can be seen, the benefit of SidecoreM in the normalized throughput

is directly related to the guest VMs CPU usage, except for the case of 16KB.

The decline between 2KB-8KB is influenced by the fact that the system

achieves maximum throughput with only one sidecore for these page sizes.

The effective CPU usage for the IO-core in these cases is just below 100%,

while the baseline slowly increases in efficiency with the larger page sizes. As

a result, the Transactions per second (TPS) decline linearly as the amount

of data sent doubles. The increase in the baseline efficiency, which decreases

the relative gain of the sidecore approach, occurs for two reasons: First, the

amount of processing inside the VM guests lessens as the file sizes grow.

Second, the effective CPU usage of the baseline I/O threads rises.

For 16KB, the best configuration consists of two sidecores, as opposed

to the one-sidecore configuration used to handle pages of size 8KB. This

accounts for the incline in TPS between pages of size 8KB and 16KB seen

in Figure 7.4b. The two-sidecore configuration, however, does not reach full

potential when handling pages of size 16KB because of a bottleneck caused

67


0K

20K

40K

60K

80K

64B128B

256B512B

1KB2KB

4KB8KB

16KB32KB

64KB128KB

256KB

512KB

1MB

(a) TSO o�

Thro

ughput

[TPS]

HTML �le size

sbestSidecoreM

basee1

1 11 1 1

11

12 2

33

4 4 4

0

0.5

1

1.5

2

2.5

3

64K128K

256K512K

1K 2K 4K 8K 16K32K

64K128K

256K512K

1K

(b) TSO o�

norm

alized t

hro

ughput

HTML �le size

1 1 1 1 11

11 2

2

3 34 4 4

0

200

400

600

800

64K128K

256K512K

1K 2K 4K 8K 16K32K

64K128K

256K512K

1K

(c) TSO o�

vC

PU

thre

ads [

%]

HTML �le size

0K

20K

40K

60K

80K

64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB256KB512KB 1MB

(d) TSO on

Thro

ughput

[TPS]

1 1

1 1 1

11

12 2 3

34

4 4

Figure 7.4: Running Apachebench with static HTML page sizes of64B-1MB.

68


0

0.5

1

1.5

2

2.5

3

64K 128K 256K 512K 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1K

(e) TSO on

norm

alized t

hro

ughput

1 11 1 1

11 1

2 23 3 4 4 4

Figure 7.5: Running Apachebench with static HTML page sizes of64B-1MB (cont.).

by the VM guests, which do not produce enough data to fill both sidecores.

There is a further, sharper rise in the TPS between 16KB and 32KB. In

this case, the bottleneck no longer exists, as the VM guests are now able

to produce much more data and fill both sidecores. The behavior repeats

itself for pages of sizes 64KB-128KB, but with a less dramatic peak as the

baseline configuration becomes more efficient. To conclude, we see a gap of

1.2x to two times in performance between the best static configuration and

the baseline configuration for a multitude of page sizes, which SidecoreM is

able to reproduce dynamically.

With TSO on, we get similar resutls as shown in Figure 7.4(d+e).

7.4 Dynamic Workload

Finally, we demonstrate the real strength of having a dynamic environment

that is able to adapt to change. We test the system with a workload that

varies over time. This is achieved by running a Netperf TCP stream with

a busy-loop test that sends 2KB messages, while using changing amounts

of busy-loop cycles. The test comprises seven phases, each running for 30

seconds. The throughput is measured every two seconds. The results are

shown in Figure 7.6, with the sbest line showing the optimum performance

achievable in each stage (the static best).

We note that the variance in the transition periods between phases is

large, for the following reasons. The first is that the I/O manager must no-

tice the change in the load and make the decision to adapt. The second, and

69


0

10

20

30

40

50

0 20 40 60 80 100 120 140 160 180

0,2

15K,1

35K,2

1K,1

10K,1

30K,1

0,1Delay,Sidecores

(a) TSO on

Thro

ughput

[Gbps]

Time [secs]

sbest SidecoreM base e1

0

10

20

30

40

50

0 20 40 60 80 100 120 140 160 180

0,2

15K,1 35K,2

1K,1

10K,130K,1

0,1Delay,Sidecores

(b) TSO off

Thro

ughput

[Gbps]

Time [secs]

sbest SidecoreM base e1

Figure 7.6: Dynamic benchmark running of a Netperf TCP stream w. 2KBmessages and seven phases of various busy-loop configurations.

unfortunately more significant reason, is that we had a problem measuring

throughput in the transitional periods. We measured the throughput using

the Netperf periodic results inside each VM guest. These separate Netperf

executions are not entirely synchronized, allowing for an offset between the

times each VM guest starts its own Netperf test. Additionally, each phase

is a separate execution, and as such, there is some offset in the exact time

that each phase starts and ends. This discrepancy hides the time it takes

the I/O manager and the baseline configuration to adjust to changes. We

note, however, that we already showed the time it takes SidecoreM to find

the optimum configuration, as well as the overhead incurred while searching

for said optimum.

Figure 7.6 shows that although both are dynamic in nature, SidecoreM out-

performed the baseline configuration throughout the benchmark. It can also

be seen that SidecoreM moved to the best static configuration in less than

two seconds in all phases, echoing the results of tests in previous sections.

We show the ELVIS configuration, which uses a single sidecore, to illustrate

70


that a static configuration will not surpass a dynamic configuration. We

see that SidecoreM performed at most 6% worse than the static best con-

figuration when the workload had stabilized. The maximum discrepancy is

reached only in the case of a busy-loop of 35000 cycles, which requires a sin-

gle sidecore. This is due to our I/O manager trying to raise the number of

sidecores to two and reverting to the configuration used prior to the change.

We note that SidecoreM outperformed the baseline configuration by a factor

of up to 2x.

71


Chapter 8

Related Work

The sidecore approach is a parallelization technique for accelerating the exe-

cution of virtual machines by polling relevant memory regions via host cores

external to the VMs. Rather than taking exits on the cores that run the

VMs, the host sidecores continuously observe the requests of the VMs and

service them accordingly. This behavior (1) eliminates the direct and indi-

rect overheads of exits; and (2) offloads the corresponding work to sidecores,

thus freeing VM cores to process other activities. In essence, the sidecore

approach offloads work from guest cores to host cores while replacing the

costly trap-and-emulate guest/host communication mechanism with simple,

much faster load/store operations directed at the coherent memory caches.

The sidecore approach was first introduced in 2007 by Kumar et

al. [9, 15]. They used it to accelerate network interrupts and page–table–

management processing of paravirtual guests. In 2009, Liu and Abali [18]

reintroduced the same concept, calling it the Virtualization Polling Engine

(VPE). In addition, they used it to accelerate the receive and transmit net-

work I/O paths. In 2012, Ben-Yehuda et al. implemented a block device

on a sidecore [6]. In 2013, Harel et al. proposed ELVIS [12] – used in this

paper – combining the sidecore paradigm with Exit-less interrupts [3, 11]

and making it applicable to any Virtio device (both block and net) such

that it linearly scales with the number of cores (VMs). Kuperman at el.

suggested vRIO, which takes the sidecore approach a step further by mi-

grating sidecores to another server, reaping the benefits of consolidation in

a rack-scale.

All of the aforementioned sidecore work was done using paravirtual I/O,

72


whereby the guest VM code is changed to explicitly cooperate with a host

that runs on a sidecore, rather than on the same core with regular trap-

and-emulate exits. Conversely, Amit et al. exposed a virtual IOMMU

(“vIOMMU”) to unmodified VMs, and were able to substantially improve

performance by running the vIOMMU code on a sidecore, keeping the VMs

unaware of the fact they are not using the physical IOMMU [2]. Landau

et al. proposed a hardware extension (“SplitX”) that allows the hypervisor

and its unmodified guests to run on disjoint sets of cores, such that ev-

ery exit occurring on a VM core is delivered to, and is serviced by, a host

sidecore [17].

The insight underlying SidecoreM is that the sidecore approach requires

special tuning of its configuration, as it is often used in high performance

environments. Thus, unlike all previous studies, SidecoreM looks at how to

configure the system dynamically.

73


Chapter 9

Conclusions

Paravirtual I/O is the most popular I/O virtualization model. Most notably,

it is utilized in virtualized data centers and in cloud computing sites be-

cause it enables useful features such as live migration and software-defined-

networks. These features, and many more, require the hypervisor to inter-

pose on the I/O activity of its guests. The performance and scalability of

this interposition are extremely important for cloud providers and enterprise

data centers.

The sidecore approach is an effective way of accelerating and optimizing

paravirtual I/O. We showed that in order to reap the benefits of the sidecore

approach, the system must be properly configured. This is to say, the num-

ber of sidecores must suit the given workload correctly. SidecoreM is able to

find the optimal number of sidecores for throughput intensive workloads.

74


Bibliography

[1] Keith Adams and Ole Agesen. A comparison of software and

hardware techniques for x86 virtualization. In ACM Architec-

tural Support for Programming Languages & Operating Systems

(ASPLOS), 2006.

[2] Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir, and Assaf Schus-

ter. vIOMMU: efficient IOMMU emulation. In USENIX Annual

Technical Conference, 2011.

[3] Nadav Amit, Abel Gordon, Nadav Har’El, Muli Ben-Yehuda,

Alex Landau, Assaf Schuster, and Dan Tsafrir. Bare-metal per-

formance for virtual machines with exitless interrupts. Commun.

ACM, 59(1):108–116, December 2015.

[4] Apachebench. http://en.wikipedia.org/wiki/ApacheBench.

[5] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim

Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew

Warfield. Xen and the art of virtualization. In ACM Symposium

on Operating Systems Principles (SOSP), 2003.

[6] Muli Ben-Yehuda, Eran Borovik, Michael Factor, Eran Rom,

Avishay Traeger, and Ben-Ami Yassour. Adding advanced stor-

age controller functionality via low-overhead virtualization. In

USENIX Conference on File & Storage Technologies (FAST),

2012.

[7] Yaozu Dong, Xiaowei Yang, Xiaoyong Li, Jianhui Li, Kun Tian,

and Haibing Guan. High performance network virtualization with

75


SR-IOV. In IEEE International Symposium on High Performance

Computer Architecture (HPCA), 2010.

[8] Roy T. Fielding and Gail Kaiser. The Apache HTTP server

project. IEEE Internet Computing, 1(4):88–90, 1997.

[9] Ada Gavrilovska, Sanjay Kumar, Himanshu Raj, Karsten

Schwan, Vishakha Gupta, Ripal Nathuji, Radhika Niranjan, Adit

Ranadive, and Purav Saraiya. High-performance hypervisor ar-

chitectures: Virtualization in HPC systems. In Workshop on

System-level Virtualization for HPC (HPCVirt), 2007.

[10] Github Repository. Virtual I/O acceleration technolo-

gies for KVM. https://github.com/abelg/virtual_io_

acceleration. (Accessed May, 2014).

[11] Abel Gordon, Amit Nadav, Nadav Har’El, Muli Ben-Yehuda,

Alex Landau, Assaf Schuster, and Dan Tsafrir. ELI: Bare-metal

performance for I/O virtualization. In ACM Architectural Support

for Programming Languages & Operating Systems (ASPLOS),

2012.

[12] Nadav Har’El, Abel Gordon, Alex Landau, Muli Ben-Yehuda,

Avishay Traeger, and Razya Ladelsky. High performance I/O

interposition in virtual systems. In USENIX Annual Technical

Conference, 2013.

[13] 6WIND Intel Corperation. Dpdk – data plane development kit.

http://dpdk.org. dpdk website.

[14] Rick A. Jones. A network performance benchmark (revision 2.0).

Technical report, Hewlett Packard, 1995.

[15] Sanjay Kumar, Himanshu Raj, Karsten Schwan, and Ivan Ganev.

Re-architecting VMMs for multicore systems: The sidecore ap-

proach. In Workshop on Interaction between Operating Systems

& Computer Architecture (WIOSCA), 2007.

[16] Yossi Kuperman, Eyal Moscovici, Joel Nider, Razya Ladelsky,

Abel Gordon, and Dan Tsafrir. Paravirtual remote I/O. In ACM

76


Architectural Support for Programming Languages & Operating

Systems (ASPLOS), 2016.

[17] Alex Landau, Muli Ben-Yehuda, and Abel Gordon. SplitX: Split

guest/hypervisor execution on multi-core. In USENIX Workshop

on I/O Virtualization (WIOV), 2011.

[18] Jiuxing Liu. Evaluating standard-based self-virtualizing devices:

A performance study on 10 GbE NICs with SR-IOV support. In

IEEE International Parallel & Distributed Processing Symposium

(IPDPS), 2010.

[19] David Matlack. kvm: adaptive halt-polling toggle. https://

lkml.org/lkml/2015/9/1/540. (Accessed July, 2016).

[20] David Matlack. Message passing workloads in kvm.

http://events.linuxfoundation.org/sites/events/files/

slides/Message%20Passing%20Workloads%20in%20KVM%

20(SLIDES).pdf. (Accessed July, 2016).

[21] Rusty Russell. virtio: towards a de-facto standard for virtual

I/O devices. ACM SIGOPS Operating Systems Review (OSR),

42(5):95–103, 2008.

[22] Arvind Seshadri, Mark Luk, Ning Qu, and Adrian Perrig. SecVi-

sor: a tiny hypervisor to provide lifetime kernel code integrity

for commodity OSes. In ACM Symposium on Operating Systems

Principles (SOSP), pages 335–350, 2007.

[23] Amit Singh. An introduction to virtualization. http:

//www.kernelthread.com/publications/virtualization/,

2004. (Accessed May, 2016).

[24] Herbert Bos Tomas Hruby and Andrew S. Tanenbaum. When

slower is faster: On heterogeneous multicores for reliable systems.

In USENIX Ann. Technical Conf. (ATC), pages 255–266, 2013.

[25] VMware. ESX server 2 - architecture and performance implica-

tions. Technical report, VMware, 2005.

77


[26] Andrew Whitaker, Marianne Shaw, and Steven D. Gribble. De-

nali: Lightweight virtual machines for distributed and networked

applications. Technical Report 02-02-01, University of Washing-

ton, 2002.

[27] Paul Willmann, Jeffrey Shafer, David Carr, Aravind Menon, Scott

Rixner, Alan L. Cox, and Willy Zwaenepoel. Concurrent direct

network access for virtual machine monitors. In IEEE Interna-

tional Symposium on High Performance Computer Architecture

(HPCA), 2007.

[28] Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella, and

Dongyan Xu. vTurbo: Accelerating virtual machine I/O process-

ing using designated turbo-sliced core. In USENIX Ann. Technical

Conf. (ATC), pages 243–254, 2013.

78


עיבוד עבור מעבדי־צד מנהל לקראתבמערכות וירטואלי קלט/פלט

וירטואליות

מושקוביץ אייל



עיבוד עבור מעבדי־צד מנהל לקראת

במערכות וירטואלי קלט/פלט

וירטואליות

מחקר על חיבור

התואר לקבלת הדרישות של חלקי מילוי לשם

המחשב במדעי למדעים מגיסטר

מושקוביץ אייל

לישראל טכנולוגי מכון – הטכניון לסנט הוגש2016 אוגוסט חיפה התשע״ו אב



בטכניון המחשב למדעי בפקולטה צפריר, דן פרופ. של בהנחייתו בוצע המחקר

השתלמותי במשך הנדיבה התמיכה על לטכניון מודה אני



תקציר

המכונה וירטואליות. מכונות להריץ מודרנית מיחשוב מערכת של היכולת היא וירטואליציהרשת, כרטיס כמו האורחות המכונות עבור וירטואליים קלט/פלט התקני חושפת המארחתלמכונות ממשק להצגת נפוצה טכניקה הוא פרא־וירטואלי קלט/פלט וכו׳. קשיח דיסקהתקני לחשוף בבואם בטכניקה משתמשים הענן מחשוב מרכזי מרבית כיום האורחות.זהה לא אך הזומה ממשק חשופים זו מבטכניקה שלהן. הוירטואליות למכונות קלט/פלטוירטואליים, קלט/פלט התקנים נקראים אלו ממשקים המארח. במכונה הנמצאת לחומרהבעלת להיות צריכה הנ״ל האמולציה המארחת. המכונה ע״י מאומלצים אלו התקנים התנהגותגבוה. מאוד בקצב קלט/פלט המייצרות סלקבילית בצורה וירטואליות מערכות לשרת היכולתלריק. מערכת משאבי לבזבז לא האמולציה על נמוך בקצב קלט/פלט מייצרת המערכת כאשר

ולעמוד המארחת למכונה מאפשרת שהיא היא, הפרא־וירטואלית השיטה של הגדול יתרונהקיים לא זה יתרון החיצון. והעולם הפיזית החומרה לבין הוירטואלית המכונה בין בתווחישירה גישה מקבלת הוירטואלית המכונה זו בשיטה ישירה, התקנים השמת המתחרה, בשיטההוכחה אומנם ישירה התקנים השמת גישת המארחת. במכונה הפיזית לחומרה מפוקחת ולאהקלט/פלט על שליטה המארחת למכונה מאשפרת לא היא אך ביצועים מבחינת יותר כיעילהמיגרציה כגון מועילות, מאוד פעולות לבצע מאוד מקשה ובכך הוירטואלית מהמכונה היוצא

מארחות. מכונות בין וירטואלית מכונה של

וגישת המסורתית הגישה פרא־וירטוליים: התקנים למימוש מובילות גישות שתי ישנן היוםמטופלות אשר קלט/פלט בקשות שולחת הוירטואלית המכונה המסורתית, בגישה מעבד־צד.של הקצה־קדמי גם וקרוי האורחת, במכונה שוכן ההתקנים מנהל המארחת. המכונה ע״ישל האחורי הקצה הנקרא המארחת במכונה יעודי לחוט קלט/פלט בקשת כל שולח ההתקן,התקן כל בנוסף הקדמי. לקצה תשובה ומחזיר הבקשה את מעבד האחורי הקצה ההתקן.

המארחת. במכונה ייעודי חוט יש פרא־וירטואלי

של גדולים נפחים עיבוד בעת בביצועים הרעה כגון, חסרונות, ללא אינה המסורתית הגישההן יציאות, משימות. ותזמון יציאות מקורות: משני נובעת בביצועים ההרעה קלט/פלט.

א


משמשות אלו יציאות המארחת. והמכונה האורחת המוכנה בין המתבצת התוכנית החלפתמשימות בתזמון הבעייה הפרא־וירטואלי. ההתקן קצוות שני בין לתקשר בכדי הייתר ביןהקצה־ את ובפרט משימות מתזמן אשר המארחת, במכונה הרץ המשימות ממתזמן נובעתיכולה זו תזמון בעיית עליו. שעוברת הקלט/פלט פעילות על ידע ללא ההתקן של האחורי

קודמות. בעבודות שהודגם כפי ירודים, לביצועים לגרום

כמה בוהורדת כיעילה הראתה פרא־וירטואליים, התקנים עבור יושמה אשר מעבד־צד, גישתהמכונה המעבדים את מחלקים זו בגישה פרא־וירטואלית. במערכת הבקבוק מצוואריהמעבדים את רק מריצה המעבדים של הראשונה הקבוצה קבוצות. לשתי המארחתהשנייה הקבוצה מעבד־מ״ו. נקרא זו בקבוצה מעבד האורחות, המכונות של הוירטואליםנקראים זו בקבוצה מעבד הקלט/פלט, התקני של הקצה־האחורי רק מריצה המעבדים שלמעבד. הוא אשר וירטואליים ק/פ התקני של קבוצה מוקצה מעבד־ק/פ לכל מעבד־ק/פ.במעבדי־ השימוש שלו. הק/פ התקני את מתזמן אשר יעודי משימות מתזמן ישנו במעבד־ק/פמאפשרת ההתקני דגימת ההתקנים. של סידרתיות דגימות לבצע אלו למעבדים מאפשר ק/פ

היציאות. מספר את למינימום להוריד

גוררת קבוצות, לשני המעבדים של החלוקה משלה, חסרונות ללא אינה מעבדי־הצד גישתאת מורידה הבתוגה אשר הוירטואליים, המעבדים להרצת זמין עיבוד כוח פחות קיים כיזו בעייה בייחוד, גדולים. קלט/פלט נפחי מייצרת אינה הנ״ל המערכת כאשר המערכת נצילותסטטית. בצורה מראש נקבע קבוצה בכל המעבדי מספר בהם עכשיווים, במימושים מוחמרת,

השילוב בעזרת המערכת משאבי של הניצול את משפר אשר דינאמי פתרון חוקרת זו עבודהיודע אשר קלט/פלט מנהל משתמשת אשר SIDECOREM את מציגים אנו השיטות. שתי ביןבמערכת. המייוצרת הקלט/פלט לכמות בכפוף המועדפת בגישה להשתמש למערכת להגדירשונים. קלט/פלט עומסי תחת המערכת התנהגות את אפינו אנחנו זה מנהל לבנות בכדי

.LINUX ההפעלה מערכת תחת אותה שמימשנו בכך שתכננו המערכת ביצועי את בדקנועבור האופטימלית הקונפיגורציה את למצוא מסוגלת SIDECOREM כי הראתה הבדיקההיותר לכל המערכת ביצועי את הרעתה האופטימלית הקונפיגורציה חיפוש בזמן המערכת.מראה SIDECOREM סטטית. בצורה נקבכה הקונפיגורציה שבה למערת בהשוואה ב־6%בעת SIDECOREM של התקורה המסורתית. לגישה בהשוואה 2.2X עד של בביצועים שיפורקונפיגורציה שינוי יש כאשר בשניה. מיקרושניות 280 היא קונפיגורציה שינוי ללא רגיל שימוש

מילישניות. לכמה להגיע יכולה התקור

ב


Documents

Towards Sidecore Management for Virtualized Environments … · to a statically tuned system. SidecoreM shows up to a 2.2x performance gain over the traditional approach. SidecoreM