25
Survey of State-of-the-art in Inter-VM Communication Mechanisms Jian Wang September 27, 2009 Abstract Advances in virtualization technology have focused mainly on strength- ening the isolation barrier between virtual machines (VMs) that are co- resident within a single physical machine. At the same time, a large cat- egory of communication intensive distributed applications and software components exist, such as web services, high performance grid applica- tions, transaction processing, and graphics rendering, that often wish to communicate across this isolation barrier with other endpoints on co- resident VMs. This report presents a survey of the state-of-the-art re- search that aims to improve communication between applications on co- located virtual machines. These efforts can be classified under two broad categories: (a) shared-memory approaches that bypass the traditional net- work communication datapath to improve both latency and throughput of communication. (b) improvements to CPU scheduling algorithms at the hypervisor-level that address the latency requirements of inter-VM communication. We describe the state-of-the-art approaches in these two categories, compare and contrast their benefits and drawbacks, and out- line open research problems in this area. 1 Introduction Virtual Machines (VMs) are rapidly finding their way into data centers, enter- prise service platforms, high performance computing (HPC) clusters, and even end-user desktop environments. The primary attraction of VMs is their ability to provide functional and performance isolation across applications and services that share a common hardware platform. VMs improve the system-wide utiliza- tion efficiency, provide live migration for load balancing, and lower the overall operational cost of the system. Hypervisor (also sometimes called the virtual machine monitor) is the soft- ware entity which enforces isolation across VMs residing within a single physical machine, often in coordination with hardware assists and other trusted software components. For instance, the Xen hypervisor runs at the highest system priv- ilege level and coordinates with a trusted VM called Domain 0 (or Dom0) to enforce isolation among unprivileged guest VMs. 1

Survey of State-of-the-art in Inter-VM Communication Mechanisms

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Survey of State-of-the-art in Inter-VM Communication Mechanisms

Survey of State-of-the-art in Inter-VM

Communication Mechanisms

Jian Wang

September 27, 2009

Abstract

Advances in virtualization technology have focused mainly on strength-ening the isolation barrier between virtual machines (VMs) that are co-resident within a single physical machine. At the same time, a large cat-egory of communication intensive distributed applications and softwarecomponents exist, such as web services, high performance grid applica-tions, transaction processing, and graphics rendering, that often wish tocommunicate across this isolation barrier with other endpoints on co-resident VMs. This report presents a survey of the state-of-the-art re-search that aims to improve communication between applications on co-located virtual machines. These efforts can be classified under two broadcategories: (a) shared-memory approaches that bypass the traditional net-work communication datapath to improve both latency and throughputof communication. (b) improvements to CPU scheduling algorithms atthe hypervisor-level that address the latency requirements of inter-VMcommunication. We describe the state-of-the-art approaches in these twocategories, compare and contrast their benefits and drawbacks, and out-line open research problems in this area.

1 Introduction

Virtual Machines (VMs) are rapidly finding their way into data centers, enter-prise service platforms, high performance computing (HPC) clusters, and evenend-user desktop environments. The primary attraction of VMs is their abilityto provide functional and performance isolation across applications and servicesthat share a common hardware platform. VMs improve the system-wide utiliza-tion efficiency, provide live migration for load balancing, and lower the overalloperational cost of the system.

Hypervisor (also sometimes called the virtual machine monitor) is the soft-ware entity which enforces isolation across VMs residing within a single physicalmachine, often in coordination with hardware assists and other trusted softwarecomponents. For instance, the Xen hypervisor runs at the highest system priv-ilege level and coordinates with a trusted VM called Domain 0 (or Dom0) toenforce isolation among unprivileged guest VMs.

1

Page 2: Survey of State-of-the-art in Inter-VM Communication Mechanisms

Enforcing isolation is an important requirement from the viewpoint of se-curity of individual software components. At the same time enforcing isolationcan result in significant communication overheads when different software com-ponents need to communicate across this isolation barrier to achieve applicationobjectives.

For example, a distributed HPC application may have two processes runningin different VMs that need to communicate using messages over MPI libraries.Similarly, a web service running in one VM may need to communicate with adatabase server running in another VM in order to satisfy a client transaction re-quest. Or a graphics rendering application in one VM may need to communicatewith a display engine in another VM. Even routine inter-VM communication,such as file transfers or heartbeat messages, may need to frequently cross thisisolation barrier. Different applications have different requirements for commu-nication throughput and latency based on their objectives. For example, filetransfer or graphics rendering applications tend to require high throughput, butcommunication latency can be more critical in web services and MPI-based ap-plications. In all the above examples, when the VM endpoints reside on the samephysical machine, ideally we would like to minimize the communication latencyand maximize the bandwidth, without having to rewrite existing applications orcommunication libraries.

This report surveys the state-of-the-art research in improving communica-tion performance between co-located virtual machines. The major obstacles toefficient inter-VM communication are as follows.

• Long communication data path: A major source of inter-VM com-munication overhead is the long data path between co-located VMs. Forexample, the Xen platform enables applications to transparently commu-nicate across VM boundaries using standard TCP/IP sockets. However,all network traffic from the sender VM to receiver VM is redirected viaDom0, resulting in a significant performance penalty. Packet transmissionand reception involves traversal of TCP/IP network stack and invocationof multiple Xen hypercalls. To overcome this bottleneck, a number ofprior works such as XenSocket [12], IVC [2], XWay [4], MMNet [9], andXenLoop [11] have exploited the facility of inter-domain shared memoryprovided by the Xen hypervisor. Using shared memory for direct packetexchange is much more efficient than traversing the network communica-tion path via Dom0. Each of the above approaches offer varying tradeoffsbetween communication performance and application transparency.

• Lack of communication awareness in CPU scheduler: The CPUscheduler in the hypervisor also has a major influence on the latency ofcommunications between co-located VMs. If the CPU scheduler is un-aware of communication requirements of co-located VMs, then it mightmake non-optimal scheduling decisions that increase the inter-VM com-munication latency. For example, the Xen hypervisor currently has twoschedulers: the simple earliest deadline first (SEDF) scheduler and the

2

Page 3: Survey of State-of-the-art in Inter-VM Communication Mechanisms

Credit scheduler. The SEDF scheduler makes each VM specify a requiredtime slice in a certain period; a (slice, period) pair represents how muchCPU time a domain is guaranteed in a period. The SEDF scheduler pref-erentially schedules a domain with the earliest deadline. It requires afinely tuned parameter configuration for meeting VMs’ performance re-quirements. On the other hand, the Credit scheduler is a proportionalshare scheduler with a load balancing feature for SMP systems. The creditscheduler is simple but provides reasonable fairness and performance guar-antee for CPU-intensive guests. Both the Credit and SEDF schedulerswithin Xen focus on fairly sharing processor resources for all the domains,and they perform well for compute-intensive domains. However, neitherscheduler can provide good performance when faced with I/O intensiveVMs or those with extensive inter-VM interactions. This is because thetwo schedulers are agnostic to inter-VM communication requirements.

A number of prior works have attempted to partly address this problem.For example, [1] and [3], have exploited the I/O statistics of the guestoperating system to obtain greater knowledge of the VM’s internal behav-ior and to guide scheduling decisions. Xen’s Credit scheduler itself uses aboosting mechanism to preferentially schedule I/O bound VMs that receiveevent notifications. Such approaches serve as useful heuristics, but do notfully address the broader problem of optimizing inter-VM interactions.

• Absence of real-time inter-VM interactions: Another problem withcurrent virtualization platforms is the lack of support for real-time inter-VM interactions. For example, after one VM transmits a packet to anotherco-located peer VM, the peer VM must be scheduled as quickly as possibleto guarantee lowest possible message latency. Also when the peer VM isscheduled, the CPU scheduler within the peer VM needs to ensure thatthe incoming message is processed in a timely manner. The unpredictabil-ity of current VM scheduling mechanisms makes it hard to provide anyform of timeliness guarantees. Since the hypervisor lacks knowledge oftiming requirements of applications within each VM, it cannot meaning-fully provide timing guarantees for interactions between co-located VMs.This semantic gap is the root cause of poor support for real-time inter-VMinteractions.

The rest of the report is organized as follows. Section 2 describes the state-of-the-art techniques that use inter-VM shared memory mechanism to improveboth communication throughput and latency. Section 3 describes the state-of-the-art in reducing inter-VM communication latency by optimizing CPUscheduling policies in the hypervisor. Section 4 describes some of the openresearch problems in inter-VM communication and Section 5 concludes the re-port.

3

Page 4: Survey of State-of-the-art in Inter-VM Communication Mechanisms

VNIC

ENDFRONT

VNIC

ENDFRONT

RX

TXRX

TX

NATIVE NIC DRIVER

VNICBACKEND

BRIDGES/WBACK

END

VNIC

SAFE H/W I/FXEN HYPERVISOR

PHYSICAL NIC

NETWORK DRIVER DOMAIN GUEST VM 2GUEST VM 1

GRANTTABLE

GRANTTABLE

DESCRIPTORRINGS

DESCRIPTORRINGS

Figure 1: Split Netfront/Netback driver architecture in Xen. Network trafficbetween VM1 and VM2 needs to traverse via the software bridge in driverdomain.

2 Background

2.1 Xen Network Interface Background

Xen virtualization technology provides close to native machine performancethrough the use of para-virtualization – a technique by which the guest OSis co-opted into reducing the virtualization overhead via modifications to itshardware dependent components. In this section, we review the relevant back-ground of the Xen networking subsystem. Xen exports virtualized views ofnetwork devices to each guest OS, as opposed to real physical network cardswith specific hardware make and model. The actual network drivers that inter-act with the real network card can either execute within Dom0 – a privilegeddomain that can directly access all hardware in the system – or within IsolatedDriver Domains (IDD), which are essentially driver specific virtual machines.IDDs require the ability to hide PCI devices from Dom0 and expose them toother domains. In the rest of the paper, we will use the term driver domain torefer to either Dom0 or the IDD that hosts the native device drivers.

The physical network card can be multiplexed among multiple concurrentlyexecuting guest OSes. To enable this multiplexing, the privileged driver domainand the unprivileged guest domains (DomU) communicate by means of a splitnetwork-driver architecture shown in Figure 1. The driver domain hosts thebackend of the split network driver, called netback, and the DomU hosts thefrontend, called netfront. The netback and netfront interact using high-levelnetwork device abstraction instead of low-level network hardware specific mech-anisms. In other words, a DomU only cares that it is using a network device,but doesn’t worry about the specific type of network card.

4

Page 5: Survey of State-of-the-art in Inter-VM Communication Mechanisms

Netfront and netback communicate with each other using two producer-consumer ring buffers – one for packet reception and another for packet trans-mission. The ring buffers are nothing but a standard lockless shared memorydata structure built on top of two primitives – grant tables and event chan-nels. Grant table can be used for bulk data transfers across domain boundariesby enabling one domain to allow another domain to access its memory pages.The access mechanism can consist of either sharing or transfer of pages. Theprimary use of the grant table in network I/O is to provide a fast and securemechanism for unprivileged domains (DomUs) to receive indirect access to thenetwork hardware via the privileged driver domain. They enable the driver do-main to set up a DMA based data transfer directly to/from the system memoryof a DomU rather than performing the DMA to/from driver domain’s memorywith the additional copying of the data between DomU and driver domain.

The grant table can be used to either share or transfer pages between theDomU and driver domain. For example, the frontend of a split driver in DomUcan notify the Xen hypervisor (via the gnttab grant foreign access hyper-call) that a memory page can be shared with the driver domain. The DomUthen passes a grant table reference via the event channel to the driver do-main, which directly copies data to/from the memory page of the DomU. Oncethe page access is complete, the DomU removes the grant reference (via thegnttab end foreign access call). Such page sharing mechanism is useful forsynchronous I/O operations, such as sending packets over a network device orissuing read/write to a block device.

At the same time, network devices can receive data asynchronously, thatis, the driver domain may not know the target DomU for an incoming packetuntil the entire packet has been received and its header examined. In thissituation, the driver domain first DMAs the packet into its own memory page.Next, depending on whether the received packet is small, the driver domain canchoose to copy the entire packet to the DomU’s memory across a shared page.Alternatively, if the packet is large, the driver domain notifies the Xen hypervisor(via the gnttab grant foreign transfer call) that the page can be transferredto the target DomU. The DomU then initiates a transfer of the received pagefrom the driver domain and returns a free page back to the hypervisor.

Excessive switching of a CPU between domains can negatively impact theperformance due to increase in TLB and cache misses. An additional sourceof overhead can be the invocation of frequent hypercalls (equivalent of systemcalls for the hypervisor) in order to perform page sharing or transfers. Securityconsiderations may also force a domain to zero a page being returned in exchangeof a page transfer, which can negate the benefits of page transfer to a largeextent [5].

2.2 Xen CPU Scheduler Background

Xen hypervisor has two major CPU schedulers – Simple Earliest Deadline First(SEDF) and Credit schedulers. SEDF was designed for the earliest versionsof Xen. This scheduler provides weighted CPU sharing and uses soft real-time

5

Page 6: Survey of State-of-the-art in Inter-VM Communication Mechanisms

algorithms to partition the CPU time. With the SEDF scheduler, each domainis allocated processing resources according to two parameters: period and slice.The SEDF scheduler assures that a domain can get the CPU resources for theamount of time given by its slice during every interval of time given by its periodas long as the domain is runnable and not blocked. SEDF scheduler maintainsa deadline and a time slice for each domain. The deadline is the time at whichthe domain’s current period ends. And the time slice is the amount of pro-cessing time that the domain is supposed to have before the deadline passes.Run queue are ordered according to the deadlines of the domains and the do-main with the earliest deadline will be scheduled first. I/O-intensive domainsthat consume little processing time will typically have earlier deadlines, andthus higher priority, than CPU-intensive domains, which allows I/O-intensivedomains to achieve lower latency. The SEDF scheduler can work fine for work-loads that combine CPU- and I/O-intensive domains on a single core machine,but it cannot support multi-processors well, which is platform on which virtu-alization is predominantly deployed.

Xen Credit scheduler is currently used by the Xen hypervisor as the defaultCPU scheduler. The main goal of the Credit scheduler is to allocate the proces-sor resources fairly. Each domain is given a certain number of credits initially.The credits that were deducted periodically from the running domains standfor the cost that each domain should pay for its processor resources. Therefore,the domains should have an equal fraction of processor resources if each domainis given the same number of credits. The Credit scheduler works as follows.Domains can be in one of two states: OVER or UNDER. UNDER state meansthat there are credits remaining for domains. OVER state means that they havegone over their credit allocation. Credits are charged on the currently runningdomain periodically(every 10ms via scheduler interrupts). If there are domainsin OVER state waiting to get CPU resources, domains in UNDER state cannotget scheduled. That means a domain cannot use more than their fair share ofthe processor resources unless the processor(s) would otherwise have been idle.When the sum of the credits in the system goes negative, new credits will beadded for all domains. Run queue is organized in a first-in, first-out manner fordomains in the same state. The domain at the head of the run queue will bescheduled next to execute. A domain will be removed from the run queue whenit gets its turn to run, and be reinserted into specific run queue(depends onits state) after execution. The Credit scheduler replaced the SEDF scheduleras the default scheduler in recent versions of Xen because the Credit sched-uler improves scheduling fairness on multiprocessors and provides better QOScontrols.

Both SEDF and Credit schedulers provide only coarse-level scheduling fordomains but do not have the notion of timeliness or performance assurance forinter-VM interactions.

6

Page 7: Survey of State-of-the-art in Inter-VM Communication Mechanisms

3 Shared Memory Techniques for Inter-VM Com-munication

The default communication mechanism between two co-located VMs in virtu-alized settings is through their virtual network interfaces. This default networkdatapath typically results in low network throughput and high CPU usage. Theroot cause for this poor performance is because each packet needs to traversethrough an intermediate driver domain besides requiring the use of multiplehypercalls. One approach to reduce this overhead is to use a simple sharedmemory channel between the communicating domains for exchanging networkpackets, which require fewer hypercalls and bypasses part or whole of the de-fault network datapath. In this section, we describe and compare a number ofproposed techniques employing shared memory communication.

3.1 XenSockets [12]

XenSocket is a high throughput unidirectional inter-VM communication mech-anism that uses a shared memory buffer between communicating VMs to com-pletely bypass the network protocol stack. Receiver VM allocates 128 KB poolof pages and ask the Xen hypervisor to share those pages to the sender VM.These pages are reused in a circular buffer. XenSocket provides a one-waycommunication pipe. Once the sender writes the data to the shared memorychannel, the data is visible immediately to the receiver. The core piece of im-plementation is an efficient data transfer algorithm that uses one shared controlvariable which indicates number of bytes available for writes in a circular buffer.Sender and receiver maintain local read/write offsets into the circular buffer.

A new “XenSocket”-type socket family is used to communicate across theshared memory channel. Network applications/libraries have to use this newsocket type in order to take advantage of XenSocket. The core operations onXenSocket are bind() and connect(). These are mandatory steps for the senderand receiver to negotiate the control information and setup the shared memory.After the two functions are successfully called, the communication channel isset up and ready to exchange data packets.

Receiver calls bind() API to set up shared memory before sender calls con-nect(). The bind() operation does the following tasks:

• Bind the socket to an address

• Allocates physical memory for descriptor page and shared circular buffer

• Returns grant table reference of the descriptor page

• Uses sender domain ID to allocate event channel for communication withsender

Sender calls connect() API to map the shared memory to its own address space.The connect() operation does the following tasks:

7

Page 8: Survey of State-of-the-art in Inter-VM Communication Mechanisms

• Uses the supplied receiver domain ID and grant table reference of shareddescriptor page.

• Maps physical pages of the shared circular buffer into the virtual addressspace of sender

• Establishes the other end of event channel to communicate events

XenSocket needs to invoke hypercalls to notify the other peer about read/writeevents. Every time after the sender/receiver writes/reads packets from theshared circular buffer, it sends out a event channel notification to the otherpeer.

In summary, XenSocket provides a shared memory communication mecha-nism that exports a new socket-like interface. Test results show that XenSocketinterdomain throughput approaches that of interprocess communication usingUNIX domain sockets. However, applications and libraries need to be modi-fied to explicitly invoke these calls. In the absence of support for automaticdiscovery and migration, XenSockets is primarily used for applications that arealready aware of the co-location of the other VM endpoint on the same physicalmachine, and which do not expect to be migrated. XenSockets is particularlytargeted for applications in high-throughput distributed stream systems, wherelatency requirement are not as strict and that can perform batching at thereceiver side.

3.2 XWay [4]

XWay is another shared memory based inter-domain communication mecha-nism which provides an accelerated communication path between VMs on thesame physical machine. Unlike XenSocket, which cannot provides binary com-patibility and only provides a one way communication channel, XWay providesfull binary compatibility for applications communicating over TCP sockets. Inother words, user applications using TCP sockets don’t need to be modified inorder to use the shared memory channel.

As depicted in Figure 2, XWay architecture is composed of three layers:switch, protocol and driver. It takes a significant amount of effort to implementa new socket module that supports all socket options for a new transport layer.To reduce the development effort as much as possible, a virtual socket, calledthe XWay socket, is introduced. The XWay switch layer transparently switchesbetween TCP socket and XWay protocol. It chooses which lower layer protocolshould be called whenever a message is sent. At the very first packet deliveryattempt, XWay switch determines whether the destination domain resides inthe same physical machine by consulting a static file that lists all co-locatedVMs. XWay then creates a shared memory channel, and binds it to the XWaysocket. In order to set up the shared memory channel and an event channel,the two VMs have to exchange the grant reference (or identity) of the sharedmemory and the port number of the event channel. This is normally donethrough another control channel. All the subsequent packets from the same

8

Page 9: Survey of State-of-the-art in Inter-VM Communication Mechanisms

XWAY Protocol

XWAY Switch

DomU1

XWAY Protocol

XWAY Switch

DomU2

Shared Memory

Control Channel

Kernel Driver

VMM

Kernel Driver

Shared Memory

Shared Memory

Event Channel

Figure 2: Architecture of XWay.

connection are redirected to the shared memory channel. Otherwise, the switchlayer simply forwards the requests to the TCP layer.

XWay protocol layer supports TCP socket semantics for data send/receiveoperations such as blocking and non-blocking mode I/O. XWay defines a virtualdevice and a device driver to represent XWay socket. The virtual device isdefined as a set of XWay channels, where each channel consists of two circularqueues (one for transmission and the other for reception) and one event channelshared by the two domains. The event channel is used to send an event tothe opposite domain. An event is generated if either a sender places data onthe send queue or a receiver pulls data out of the receive queue. The basicflow-control is performed through by means of a circular buffer.

In summary, like XenSockets, XWay achieves high performance by bypassingTCP/IP stacks and providing a direct shared-memory communication channelbetween VMs in the same machine. However, XWay can only support TCPcommunication, and requires significant changes to the Linux kernel code.

3.3 Inter Virtual Machine Communication (IVC) [2]

A distributed message-passing paradigm provides the most useful transitionpath for minimally-modified distributed software architectures into a multi-coreenvironment. With the great benefits for system management like fault toler-ance, performance isolation and load balancing, VMs are an attractive solutionfor HPC. But VM technologies haven’t yet been widely deployed in HPC envi-ronments due to performance overheads.

This work proposed the Inter-VM communication (IVC) library, which pro-

9

Page 10: Survey of State-of-the-art in Inter-VM Communication Mechanisms

vides direct shared memory communication between co-located VMs. It modi-fies MVAPICH2, an MPI library over InfiniBand, to allow user applications touse a socket-style user API for IVC-aware applications. MVAPICH2-ivc extendsMVAPICH2. It automatically chooses between IVC and network communica-tion and hides the complexities of IVC-specific APIs from user applications.IVC differs from the previous XenSocket and XWay in that it provides migra-tion support and automatic domain discovery.

IVC sets up shared memory regions through the Xen grant table mechanisms.IVC consists of two parts: a user space communication library and a kerneldriver. The client process calls the IVC user library to initiate the connectionsetup. When connection setup is finished, client process will get notified ofthe result. Internally, the IVC user library allocates a shared memory region.The user library then calls the kernel driver to grant the remote VM the rightto access those pages and returns the grant table reference handles from Xenhypervisor. The reference handles are sent to the IVC library on the destinationVM. The IVC library on the destination VM then maps the communicationbuffer to its own address space through the kernel driver.

The kernel driver uses an algorithm similar to XenSocket to communicatevia shared memory connection between the two VM domains, which providesa socket style interface. Communication establishment follows a client-servermodel, and communication uses producer/consumer circular ring buffer.

IVC supports automatic discovery of peers on the same node. An IVCbackend driver is run in the privileged Domain-0. Each parallel job that intendsto use IVC is assigned a unique magic ID by the cluster administrator. Allregistered domains with the same magic ID form a local communication groupin which processes can communicate through IVC. When a computing processinitializes, it notifies the IVC library of the magic ID through a function call.This process is then registered to the backend driver. Assignment of this uniquemagic ID across the cluster can be provided by batch schedulers though thepaper does not provide additional details

A communication coordinator is introduced to handle migration via dynam-ically creating and tearing down IVC connections as the VM migrates. Thecommunication coordinator keeps lists of channels and outstanding packets thatare being actively used on the same host. IVC kernel driver gets a callback fromthe Xen hypervisor once the VM is about to migrate. It then notifies the IVCuser library to stop writing data into the send ring to prevent data loss duringmigration; After that, the IVC kernel notifies all other VMs on the same hostthrough event channels that the VM is migrating. IVC then tells user programsthat the IVC channel will be torn down due to migration. Finally, IVC unmapsthe shared memory pages. Once migrated, IVC will be available to peers onthe new host. The coordinator will get a callback from IVC and setup IVCconnections to eligible peers.

The paper evaluates the design on multi-core computing systems, showingthat MVAPICH2-ivc performs largely close to the performance on native plat-forms. Migration tests show that MVAPICH2-ivc automatically switches to IVCwhenever the target peers are on the same physical nodes.

10

Page 11: Survey of State-of-the-art in Inter-VM Communication Mechanisms

3.4 MMNet [9]

MMNet is another inter-VM communication mechanism that differs from theprevious approaches in that it eliminates data copies across VM kernels by map-ping the entire kernel address space of one VM into the address space of it’scommunicating peer VM in a read-only fashion. During data transfer, typicallyshared-memory mechanisms either setup a designated, limited shared memorysegment which entails copying costs or employ hypervisor calls to enable map-ping of pages to avoid copying. Both these approaches add latency in the criticaldata path. MMNet eliminates data copies and hypervisor calls in the criticalpath by mapping in the entire kernel address space of the peer VM. The dis-advantage of this approach is that it relaxes memory isolation between VMs.Although a bug in one VM cannot corrupt the peer VM’s memory because ofthe read-only mapping, it could potentially crash the peer VM if it attempts totraverse corrupted data structures such as linked lists.

Linux kernel enables extensibility by exporting a standard generic link layerdevice interface. This allows MMNet to make its link-layer driver a loadablekernel module without affecting the behavior of other system functions. MMNetimplements and exports a standard Ethernet interface. MMNet interjects atthe link-layer by updating routing tables. Since this is the lowest layer in theprotocol stack, most of the OS subsystems and all the user applications canwork transparently.

VM discovery does not require a coordinating MMNet module in Domain 0.Each joining VMs write to global XenStore directory to advertise its presence.Each member VM watches the directory for membership updates. When a VM iscreated, new MMNet connection channels are established dynamically betweenthe new VM and other co-located VMs and new IP routing table entries areadded to route traffic to the MMNet Ethernet interface. Within a VM, runningapplications can seamlessly switch to (or from) the MMNet connection pathby updating the IP routing tables appropriately. VM destruction will triggerremoval of the routing table entries and tear down of the MMNet channel.

Within the MMNet channel, data is encapsulated in a scatter-gather (SG)array. The shared meta-data segment which contains pointer to SG array andlength is similar to Xen I/O rings which uses circular buffer consumer/produceralgorithm. Translation between sk buff and SG array happens before packetsend and after packet receive, similar to the mechanism in netfront but withoutcopying and page flipping.

Security can be a major concern for MMNet when the communicating VMscannot fully trust each other.

3.5 Efforts in Improving Xen Network Performance

An application’s performance in a virtual machine environment can differ re-markably from its performance in a non-virtualized environment. The followingthree papers try to diagnose and solve the performance issues with network I/Oin a virtualized environment. Since network I/O typically requires communica-

11

Page 12: Survey of State-of-the-art in Inter-VM Communication Mechanisms

tion between a guest VM and the Domain-0, the techniques proposed here canlead to insights into inter-VM communication mechanisms.

3.5.1 Diagnosing Performance Overheads in the Xen VM [7]

This work introduces Xenoprof, a statistical profiling toolkit implemented forthe Xen virtual machine environment. The toolkit enables coordinated profil-ing of multiple VMs in a system to obtain the distribution of hardware eventssuch as clock cycles and cache and TLB misses. By using Xenoprof to ana-lyze performance overheads incurred by network I/O, it shows that networkingworkloads in the Xen environment can suffer significant performance degrada-tion. The results identify the main sources of this overhead as page-sharingand page-flipping mechanisms, which are the focus of subsequent optimizationefforts.

3.5.2 Optimizing Network Virtualization in Xen [6]

This paper proposes and evaluates three techniques for optimizing network per-formance in the Xen virtualized environment. First, it redefines the virtualnetwork interfaces of guest domains to incorporate high-level network offloadfeatures available in most modern network cards. Second, it optimizes the im-plementation of the data transfer path between guest and driver domains. Theoptimization avoids expensive data remapping operations on the transmit path,and replaces page remapping by data copying on the receive path. Finally, itprovides support for the guest operating systems to effectively utilize advancedvirtual memory features such as superpages and global page mappings.

The optimizations improves transmit performance of guest domains by afactor of 4.4. The receive performance of the driver domain is improved by 35%and reaches within 7% of native Linux performance. The receive performancein guest domains improves by 18%, but still trails the native Linux performanceby 61%.

3.5.3 Safe and Transparent Network Interface Virtualization [10]

Although full 10Gbps line rate is achievable in native Linux, it is not yet achiev-able in virtualized systems due to poor network I/O performance. For example,Xen can only achieve 2.9 Gbps. This paper mainly focuses on improving thereceive data path throughput for VMs in machines with 10Gbps Ethernet in-terface.

Currently, the basic steps involved in the receive data path are (1) hardwaredevices DMA packets to virtualization system and (2) the virtualization sys-tem de-multiplexes and copies the packets to the guest VM. De-multiplexing insoftware and consequent packet copying are common sources of overhead.

The paper proposes hardware-assisted (direct) I/O virtualization model inwhich customized NICs with I/O virtualization support can be used to allowdirect access to guest memory. Hardware takes care of de-multiplexing and

12

Page 13: Survey of State-of-the-art in Inter-VM Communication Mechanisms

DMAs packets directly to guest domain without hardware direct access fromguest domain.

I/O virtualization uses Multi-Queue NICs to obtain the benefits of the directI/O model which eliminates de-multiplexing and packet copying overheads. Keyidea behind Multi-Queue NICs is to allocate an unique receive queue configuredwith a MAC address for each guest. And each receive queue is populated with itsguest buffers. Packets are de-multiplexed based on MAC address in hardware,and they are directly placed into guest memory after de-multiplexing. Hardwarewill fall back on default queue when all queues are used up

Although this approach can compromise device transparency and fault isola-tion, it has the benefit of near native performance without using memory sharingmechanism. Some additional guest domain optimizations are also introduced toimprove network I/O performance.

Grant Reuse: The Grant Reuse mechanism reduces memory sharing overheadby reusing the grant references. The traditional mechanism for sharingmemory pages would create and destroy grants for the same page overand over again. The basic idea behind Grant Reuse is to reuse existinggrants in order to amortize the cost of creating grants over multiple I/Ooperations.

LRQ: Guest domain optimizations also utilizes Large Receive Offload (LRO)to improve the network performance. LRQ aggregates many small packetsinto fewer large packets before passing the large packet to the stack whichreduces packet processing overheads in the stack.

SPF: Software Prefetching (SPF) is used to prefetch socket buffer structuresand packet headers.

HalfPG: Half Page Buffer Allocation (HalfPG) modifies the driver to allocatesmaller buffers instead of page sized buffers to reduce TLB misses.

In summary, the paper proposes new optimizations to receive packets at10Gbps line rates. These optimizations can bring about 85% reduction in driverdomain packet processing costs and 30% reduction in guest domain packet pro-cessing costs. Total CPU cost is reduced from 4.7 to just 1.55 times the costin native Linux. However the approach achieves the performance boost at theexpense of device transparency and fault isolation. Live migration also becomedifficult, since I/O devices are close tied to guest domains.

3.6 XenLoop

XenLoop provides a fully transparent and high performance inter-VM networkloopback channel that permits direct network traffic exchange between two VMsin the same machine without the intervention of a third software component,such as Domain-0, along the data path. XenLoop operates transparently be-neath existing socket interfaces and libraries. Consequently, XenLoop allowsexisting network applications and libraries to benefit from improved inter-VM

13

Page 14: Survey of State-of-the-art in Inter-VM Communication Mechanisms

APPLICATIONS

TRANSPORT LAYER

SOCKET LAYER

NETWORK LAYER

XENLOOP LAYER

NETFRONT DRIVER

XENLOOPSOFTWAREBRIDGE

APPLICATIONS

TRANSPORT LAYER

SOCKET LAYER

NETWORK LAYER

XENLOOP LAYER

NETFRONT DRIVER

XENLOOPSOFTWAREBRIDGE

BRIDGE

DOM 0

NETBACK

NETBACK

DOM 0GUEST VM 1 GUEST VM 2

LISTENER(ID = 1)

INTER VM CHANNEL

(ID = 2)EVENT CHANNEL

FIFO 1

FIFO 2OUT

DISCOVERYMODULE

IN

IN

OUT

SOFTWARE

CONNECTOR

Figure 3: Architecture of XenLoop .

communication without the need for any code modification, recompilation, or re-linking. Additionally, XenLoop does not require any changes to either the guestoperating system code or the Xen hypervisor since it is implemented as a self-contained Linux kernel module. Guest VMs using XenLoop can automaticallydetect the identity of other co-resident VMs and setup/teardown XenLoop chan-nels on-the-fly as needed. Guests can even migrate from one machine to anotherwithout disrupting ongoing network communications, seamlessly switching thenetwork traffic between the standard network path and the XenLoop channel.

Whenever two guest VMs within the same machine have an active exchangeof network traffic, they dynamically set up a bidirectional inter-VM data channelbetween themselves using a handshake protocol. This channel bypasses thestandard data path via Dom0 for any communication involving the two guests.Conversely, the guests can choose to tear down the channel in the absence ofthe active traffic exchange in order to conserve system resources.

Figure 3 shows that the heart of XenLoop module is a high-speed bidi-rectional inter-VM channel. This channel consists of three components twofirst-in-first-out (FIFO) data channels, one each for transferring packets in eachdirection, and one bidirectional event channel. The two FIFO channels are setup using the inter-domain shared memory facility, whereas the event channelis a 1-bit event notification mechanism for the endpoints to notify each otherof the presence of data on the FIFO channel. The XenLoop module containsa guest-specific software bridge that is capable of intercepting every outgoingpacket from the network layer in order to inspect its header to determine thepacket’s destination. Linux provides a netfilter hook mechanism to perform thistype of packet interception.

XenLoop also employs a soft-state domain discovery mechanism to enabletransparent setup and teardown of inter-VM channels. The Domain0 discov-

14

Page 15: Survey of State-of-the-art in Inter-VM Communication Mechanisms

6 8 10 12 14Message size (log2(bytes))

0

1000

2000

3000

4000

5000

6000

Thr

ough

put (

Mbp

s)

netback/netfrontnative loopbacknative inter-machinexenloop

Message Size vs Bandwidth

Figure 4: Throughput versus UDP message size using netperf benchmark.

ery module transmits an announcement message to each guest that resides inthe same machine periodically. Only the guests that are alive with an activeXenLoop module can participate in communication via inter-VM channels.

XenLoop also permits guest VMs to migrate transparently across machineswhile seamlessly switching between the standard network data path and thehigh-speed XenLoop channel, depending on whether they reside on the same ordifferent machines.

Evaluation using a number of unmodified benchmarks demonstrates a signif-icant reduction in inter-VM round trip latency and increase in communicationthroughput.

Figure 4 plots the bandwidth measured with netperf’s UDP STREAM testas the sending message size increases. We can see from the figure that Xen-Loop can deliver significant improvement in bandwidth over the netfront/netback,ranging from a factor of 1.55 to 6.19.

As part of future enhancements, we are presently investigating whether Xen-Loop functionality can be implemented transparently between the socket andtransport layers in the protocol stack, instead of below the network layer, with-out modifying the core operating system code or user applications. This canpotentially lead to elimination of network protocol processing overhead from theinter-VM data path.

15

Page 16: Survey of State-of-the-art in Inter-VM Communication Mechanisms

XenSosket XWay IVC MMNet Xenloop

User Transparent X √ X √ √

Kernel Transparent √ X X √ √

Transparent

Migration Support

X X Not fully transparent

X √

Standard protocol X Only TCP Only MPI or √ √Standard protocol

support

X Only TCP Only MPI or app protocols

√ √

Auto VM Discovery

& Conn. Setup

X X X √ √

Complete memory

isolation

√ √ √ X √

Location in

Software Stack

Below

socket layer

Below

socket layer

User Library

+ syscalls

Below

IP layer

Below IP

Copying Overhead 2 copies 2 copies 2 copies 2 copies 4 copies at present

Figure 5: Comparison of mechanisms that utilize shared memory XenLoop .

3.7 Summary of Shared Memory Inter-VM Communica-tion Mechanisms

State-of-the art works achieves better inter-VM communication performance bysnooping on every packets and short-circuiting those destined to co-located VMsvia shared memory. Table 5 compares the main features of the mechanisms thatuse the shared memory to improve the communication throughput of co-locatedVMs.

User transparency: Network applications and/or communication librariesneed to be rewritten against new APIs and system calls for XenSocketand IVC, thus giving up user-level transparency.

Kernel transparency: With XWay and IVC, guest operating system codeneeds to be modified and recompiled, thus giving up kernel-level trans-parency.

Transparent VM Migration: IVC supports VM migration but is not fullytransparent, since it still needs to deliver callbacks to the application whenthe VM is about to be suspended. Only Xenloop supports fully transpar-ent VM migration.

16

Page 17: Survey of State-of-the-art in Inter-VM Communication Mechanisms

Location in software stack: XenSocket uses a new socket protocol which isparallel to socket layer and XWay intercepts socket layer calls and redirectsit to its own protocol. Both MMNet and XenLoop are below the IP layer.

Protocol support: XWay is restricted to applications that use TCP. andIVC is specially designed for parallel protocols like MPI, while MMNetand Xenloop are below the IP layer.

Auto discovery: IVC use a magic ID to support discovery, but further de-tails are not available. MMNet and Xenloop support auto discovery viaXenStore.

Copying overhead: Xenloop currently uses 4 copies while all others use 2copies.

4 Scheduler Optimizations for Inter-VM Com-munication

A key shortcoming in existing virtual machine monitors (VMMs) is that theCPU schedulers are agnostic to the communication behavior of modern, multi-tier applications, where each tier could potentially execute in different co-locatedVMs. This section reviews some of the recent research on inter-VM communication-aware CPU scheduling algorithms.

4.1 Xen & Co: Communication-aware CPU Scheduling[1]

The idea behind this work is to introduce short-term unfairness in CPU al-locations by preferentially scheduling communication-oriented VMs over theirCPU-intensive counterparts. When the scheduler makes decision on which do-main should be chosen out of multiple ready domains, it picks the one that hasmost pending packets – this could be the domain that either received the mostnumber of packets or the domain that has most number of packets to send.The goal is to reduce the overall scheduling-induced delay in I/O processing.An upper bound on the time-granularity is also set to assure fairness of CPUallocations.

Xen and Co. uses a “book-keeping” page in Domain0 to keep track of thenumber of packets received and sent for each guest domain. It uses the numberof pages flipped with each domain as an indicator of the number of packetsreceived by that domain and increments the reception-intensity variable of thecorresponding recipient domain in the book-keeping page of Domain0. As forpacket transmission, it uses a simple last-value-like prediction in which the num-ber of packets sent by a domain during its last transmission is used as a predictorof the number of packets that it would transmit the next time it is ready to send.The scheduler picks the domain with the highest network intensity to run next,which is the sum of reception-intensity and transmission-intensity.

17

Page 18: Survey of State-of-the-art in Inter-VM Communication Mechanisms

A domain receiving high-intensity traffic won’t starve other domains. Sincethe algorithm continues to provide the reservations guaranteed by the defaultSEDF, such a domain will only be preferentially chosen so long as it has receivedless than its slice for the ongoing period.

Experimental results shows that not only is the average latency of the net-work communications reduced, but also the number of context switches arereduced significantly. The possible pitfall is that it preferentially schedulesa domain receiving high-intensity traffic even if another domain had receivedpackets earlier, which could result in some level of unfairness.

4.2 Scheduling I/O in Virtual Machine Monitors [8]

This is the first study of the impact of the VMM’s CPU scheduler on the perfor-mance of multiple VMs concurrently running different types of applications. Inparticular, different combinations of processor-intensive, bandwidth-intensive,and latency sensitive applications are run concurrently to quantify the impactsof different scheduler configurations on processor and I/O performance.

This paper provides solutions for three different problems. The first problemwith old credit scheduler is that a domain has only two states: UNDER andOVER. UNDER domain has credits remaining while credits of an OVER domainhas exhausted its credit allowance. This approach is biased against I/O, becauseeven though I/O guests cannot use up their credit allowance, they still have tocompete with other domains having the same state given that credit schedulerhas no in-built notion of urgency.

In order to assure I/O domains getting timely response, this paper introducesa new state: BOOST, which has higher priority than UNDER and OVER. Whena domain gets a virtual interrupt from another domain, normally as a result ofinter-VM interactions, it will enter the BOOST state if it is presently UNDER.As a result, it doesn’t have to enter the run queue and compete with otherdomains. Since the scheduler will preempt the current domain after it sendsthe virtual interrupt, the boosted domain is very likely to be scheduled immedi-ately. In this manner, the boosting mechanism allows domains performing I/Ooperations to achieve lower response latency.

The second problem is that I/O domains tend to block often. When the I/Oevent arrives to wake up the domains, they have to be re-inserted at the endof the run queue even though they may have a lot of credits remaining. Oftenthey have to wait for the domains with fewer credits to finish before they can bescheduled again. To solve the problem, the paper suggests sorting domains bycredits remaining within each state. Thus short-running I/O domains will bere-inserted near the head of the run queue, hence getting timely response. Thistechnique allows infrequently running but latency-sensitive domains to executesooner.

Consider another frequent occurrence: when multiple packets destined fordistinct domains are delivered to the driver domain, the driver domain will bescheduled. After the driver domain forwards each packet to the correspond-ing guest domain, the driver domain could get preempted even though there

18

Page 19: Survey of State-of-the-art in Inter-VM Communication Mechanisms

might be time-sensitive packets waiting for delivery. This is because each timea domain delivers a virtual interrupt to another domain, the CPU scheduler is“tickled”, or prematurely invoked, to attempt scheduling the target domain ofthe virtual interrupt. In order to solve this problem, this paper suggests thattickling not be performed while sending virtual interrupts. Consequently allrecipient domains will be runnable after the driver domain completes, and thescheduler will choose the domain with the highest priority to run.

The performance evaluation shows that the above optimizations greatly im-prove the I/O fairness among co-located VMs. Test result also show thatlatency-sensitive applications will perform best if they are not combined inthe same domain with a compute-intensive application, but instead are placedwithin their own domain. The paper discussed next provides algorithms to solvethis mixed domain scheduling problem.

4.3 Task-aware Virtual Machine Scheduling [3]

Current CPU schedulers in the hypervisor have the limitation that they donot track the internal activities of a VCPU. For example, an I/O-bound taskrunning in the idle VCPU will be boosted, while non-idle VCPU with I/O andCPU intensive mixed tasks cannot be boosted and consequently suffers fromlow responsiveness. This paper tries to determine the trade-off between fairnessand responsiveness. To improve responsiveness, the scheduler boosts a VCPUwhenever a corresponding event arrives. Although the responsiveness of a VMis improved, the CPU fairness could be compromised.

To achieve both low I/O latency and fairness of CPU allocation, the VMMneeds to supplement the boosting mechanism with knowledge about the charac-teristics of guest-level tasks. The main goal is to boost a VCPU when a pendingevent is destined for an I/O bound task in the VCPU while guaranteeing overallCPU fairness.

A priority-based preemptive scheduler preferentially schedules an I/O-boundtask when a corresponding I/O event occurs for low latency since an I/O-boundtask typically consumes very little CPU time. So the algorithm in this paperdetermines which VCPU is an I/O-bound VCPU, and boosts its priority. TheVMM assumes that a task is I/O-bound only if its degree of confidence is largerthan some certain threshold.

To improve I/O responsiveness while keeping CPU fairness, the paper uses apartial boosting mechanism. When an event is pending for a VCPU, the VMMinitiates partial boosting for the VCPU regardless of its priority if the VCPU hasat least one inferred I/O-bound task. A partially boosted VCPU can preempt arunning VCPU and handle the pending event. When the guest operating systemschedules a task that is not inferred as I/O-bound, the VMM will revoke CPUfrom the partially boosted VCPU. The paper employs a correlation mechanismto correlate an event with I/O boundness of a task. Partial boosting based ononly the I/O-boundness of tasks has limitations due to the lack of correlationbetween an event and a task. For example, partial boosting could be initiatedin response to an event that is destined for a CPU intensive task without the

19

Page 20: Survey of State-of-the-art in Inter-VM Communication Mechanisms

correlation information. With correlation information, we can further improvethe accuracy of the partial boosting mechanism. But accurate correlation couldbe challenging because of I/O queuing delay or insufficient correlation clues. TheVMM has to use some kind of fuzzy theory to predict the correlation possibility.As a simple method, the VMM correlates a requested block I/O with the taskrunning at the request time. The VMM regards a block request as sent byan I/O-bound task if at least one inferred I/O-bound task is inside inspectionwindow at the request time. For network I/O, N-bit saturating counter associatewith port number is used to predict whether the incoming packet is associatedwith I/O-bound task.

Adopting task-awareness in VM scheduling adds intelligence to the VMMto favor I/O performance while guaranteeing CPU fairness. The inference tech-nique for tracking I/O-boundness and the correlation mechanisms are lightweightand best-effort. Test results show that the task-aware scheduling is effective forunpredictable and varying workloads such as virtual desktop or cloud computingenvironments.

4.4 Atomic Inter-VM Control Transfer(AICT)

A key shortcoming in existing virtual machine monitors is that CPU schedulersdo not consider the communication behavior of co-located VMs. AICT bringsforth a new inter-VM communication-aware CPU scheduling algorithm to solvethis problem.

I/O bound domains are unable to use up their time slice most of the time.When an I/O domain communicates with a co-located domain, it needs tofrequently yield the CPU control in order to wait for the peer VM to processits request. If unrelated VMs get scheduled frequently in between the twointeracting domains, the request-response latency of the two domains will getlengthened.

Another problem of the Credit scheduler is accounting accuracy. Creditscheduler charges credits from the running domains every 10ms, but there isa chance that this period may not be consumed completely by the runningdomain. For example, another domain could be scheduled and gave up CPUbefore this running domain takes over. Figure 6 illustrates this situation for twodomains Dom1 and Dom2 which experience additional communication delaysdue to DomX being scheduled in between. Ideally, we would like to minimize(t2−t1) and (t4−t3) if we want to ensure timely communication between Dom1and Dom2. Currently neither credit scheduler or SEDF can solve the problem.

The basic idea behind AICT is for source domain (Dom1 in the above case)to donate its unused time slice to the target domain (Dom2 in the above case)with which it communicates. The target domain will get scheduled immediatelyafter the source domain if it is runnable and executes for the time slice left overfrom the previous domain.

As depicted in Figure 7, there are two approaches for the AICT algorithm.

One way AICT: Source domain donates time slice to target guest when it

20

Page 21: Survey of State-of-the-art in Inter-VM Communication Mechanisms

DomXDom1 Dom2 Dom1…... ……DomX

t1 t2 t3 t4

Figure 6: Scheduling latency between co-located domains: Dom1 and Dom2.

has unused timeslice. With this approach, we can guarantee low-latencycontrol transfer from the source to target. But we cannot guarantee thesame from target to source.

Two way AICT: Source domain donates time slice to target and target do-main gives back control to source if it is not used up. With this approach,we can get better round-trip response time, since one way AICT cannotassure lowest latency possible from target to source.

We use modified accounting algorithm to assure fairness and fine-grainedaccounting. When source domain donates time slice to target guest, we willstill charge credits to source domain instead of target domain. In our AICTcase, when source domain is donating time slice to others, all the credits shouldbe charged on the source domain. AICT algorithm can thus be used for inter-VM interactions which require low-latency communication between co-locatedguests.

4.5 Summary

These previous works all recognize that virtual machine scheduling has criti-cal impact on inter-VM communication and I/O responsiveness. However, thevirtual machine monitor is still agnostic with respect to the internal workloadsor the communication behavior of the virtual machines. The Xen & Co papergives higher priority to network I/O intensive domains. But it also has number

21

Page 22: Survey of State-of-the-art in Inter-VM Communication Mechanisms

Dom1 Dom2

One time slice(30ms)

Dom1 Dom2 Dom1

One way AICT

Two way AICT

Figure 7: Atomic Inter-VM Control Transfer

of limitations. Firstly, the algorithm relies on the Domain0 and page flippingmechanism to keep track of network intensity. If the packets are exchanged byco-located VMs which bypass Domain0, the algorithm won’t work. The paperassumes that for default Xen each packet gets an entire page regardless of itssize. And then uses number of pages to predict the network intensity. Thisassumption won’t hold in some optimized versions of Xen.

Scheduling I/O in virtual machines [8] provides several specific optimizationsto improve the I/O fairness of the credit scheduler. However, it doesn’t givesolution for scheduling mixed workloads problem neither does it take co-locatedVM communications into consideration.

Task-aware scheduling [3] focuses on improving the performance of I/O-bound tasks within heterogeneous workloads by lightweight partial boostingmechanism. Both this paper and the Xen & Co paper give new algorithmsvia exploiting more I/O information within VM guests to help scheduler makedecisions.

None of the previous works actually tackle with the inter-VM communi-cation workloads. AICT implements a new algorithm specially for inter-VMcommunication workloads. It also resolves the accounting accuracy for creditscheduler when two VMs interact with each other.

22

Page 23: Survey of State-of-the-art in Inter-VM Communication Mechanisms

5 Current Challenges in Inter-VM Communica-tions

5.1 Real-time Guarantees

Since the hypervisor lacks knowledge of timing requirements of applicationswithin each VM, it cannot meaningfully provide timing guarantees for interac-tions between co-located VMs. AICT can partially solve the problem by reduc-ing the latency of communication between co-located VMs. But two problemsstill remain to be solved;

The first problem is that AICT cannot guarantee that no other domain willbe scheduled in between the communication domains, since the CPU creditsmay be exhausted by either the source domain or the target domain. We cansolve this problem by giving more credits to the domain temporarily and chargeit later when it does not need real-time guarantees.

The second problem is that even though the target domain may be scheduledas intended, it is still possible that the specific real-time process within theguest still cannot obtain the CPU time slice within the guest. In order to assuretimely execution for the real-time process within the guest, the hypervisor’s CPUscheduler needs to coordinate with the guest’s CPU scheduler to communicatethe real-time requirements.

5.2 Compositional VM systems

Compositional VM system is a multi-tier system of multiple co-located domainsinteracting with each other to complete certain tasks. For example, consider anapplication server sitting between a web server and a database server. The roleof application server is to handle the complex logic required to process the webrequest. Often, the request will involve access to a database server.

We want to improve the per-transaction responsiveness of the whole compo-sitional VM system. Suppose there is one source domain (Dom1) and two targetdomains (Dom2 and Dom3, where Dom2 uses the service provided by Dom3).The first option is to charge the credits to the source domain irrespective ofhow many target domains the compositional VM system has. The second op-tion is to charge the credits to the domain that wants to use the service thatis provided by next target domain. In the previous example, when Dom2 isproviding service for Dom1, it will charge the credits to Dom1, and when Dom3is providing service for Dom2, the cost is charged to Dom2.

6 Conclusions

This report summarizes the state-of-the-art research in improving the perfor-mance of inter-VM communication. Shared memory can greatly improve through-put and reduce latency because it bypasses the deep network datapath andavoids excessive hypercalls. We compared the different approaches that utilize

23

Page 24: Survey of State-of-the-art in Inter-VM Communication Mechanisms

shared memory, and show that each approach offers a tradeoff between com-munication performance between co-resident VMs and the extent of user-leveland kernel-level transparency. We also described different approaches that tryto optimize the hypervisor’s CPU scheduling algorithm in order to achieve lowcommunication latency between co-resident guest VMs. We also proposed theAICT mechanism which has the potential of minimizing the scheduling latencywithout losing CPU or I/O fairness. Finally, we outlined some open problemsin improving real time responsiveness and supporting compositional VMs.

References

[1] Sriram Govindan, Arjun R. Nath, Amitayu Das, Bhuvan Urgaonkar,and Anand Sivasubramaniam. Xen and co.: Communication-aware CPUscheduling for consolidated xen-based hosting platforms. In Proceedings ofthe 3rd International Conference on Virtual Execution environments, pages126–136, 2007.

[2] W. Huang, M. Koop, Q. Gao, and D.K. Panda. Virtual machine awarecommunication libraries for high performance computing. In Proceedingsof SuperComputing, Reno, NV, Nov. 2007.

[3] Hwanju Kim, Hyeontaek Lim, Jinkyu Jeong, Heeseung Jo, and JoonwonLee. Task-aware virtual machine scheduling for i/o performance. In Pro-ceedings of the 2009 ACM SIGPLAN/SIGOPS International Conferenceon Virtual Execution Environments, pages 101–110, 2009.

[4] Kangho Kim, Cheiyol Kim, Sung-In Jung, Hyu-Supn Shin, and Jin-SooKim. Inter-domain socket communications supporting high performanceand full binary compatibility on Xen. In Proceedings of the fourth ACMSIGPLAN/SIGOPS International Conference on Virtual Execution Envi-ronments, pages 11–20, 2008.

[5] A. Menon, A.L. Cox, and W. Zwaenepoel. Optimizing network virtualiza-tion in Xen. In Proc. of USENIX Annual Technical Conference, 2006.

[6] Aravind Menon, Alan L. Cox, and Willy Zwaenepoel. Optimizing networkvirtualization in xen. In Proceedings of the Annual Conference on USENIX’06 Annual Technical Conference, pages 2–2, 2006.

[7] Aravind Menon, Jose Renato Santos, Yoshio Turner, G. John Janakira-man, and Willy Zwaenepoel. Diagnosing performance overheads in the Xenvirtual machine environment. In Proceedings of the 1st ACM/USENIX In-ternational Conference on Virtual Execution Environments, pages 13–23,2005.

[8] Diego Ongaro, Alan L. Cox, and Scott Rixner. Scheduling i/o in virtual ma-chine monitors. In Proceedings of ACM SIGPLAN/SIGOPS InternationalConference on Virtual Execution Environments, 2008.

24

Page 25: Survey of State-of-the-art in Inter-VM Communication Mechanisms

[9] Kiran Srinivasan Prashanth Radhakrishnan. Mmnet: An efficient inter-vmcommunication mechanism. In Proceedings of Xen Summit, 2008.

[10] Kaushik K. Ram, Jose R. Santos, Yoshio Turner, Alan L. Cox, and ScottRixner. Achieving 10gbps using safe and transparent network interfacevirtualization. In The 2009 ACM SIGPLAN/SIGOPS International Con-ference on Virtual Execution Environments, 2009.

[11] Jian Wang, Kwame-Lante Wright, and Kartik Gopalan. Xenloop: A trans-parent high performance inter-VM network loopback. In Proceedings of the17th International Symposium on High Performance Distributed Comput-ing (HPDC), pages 109–118, 2008.

[12] X. Zhang, S. McIntosh, P. Rohatgi, and J.L. Griffin. Xensocket: A high-throughput interdomain transport for virtual machines. In Proceedings ofMiddleware, 2007.

25