Using DRBD over Wide Area Networks

This publication is licensed under Creative Commons Attribution 3.0 Unported.

More information on this license can be found at http://creativecommons.org/licenses/by/3.0/

Using DRBD over Wide Area Networks Project : GigaPort3 Project Year : 2010 Project Manager : Rogier Spoor Author(s) : Arjan Peddemors, Rogier Spoor, Paul Dekkers, Christiaan den Besten Completion Date : 2011-3-8 Version : 1.1 Summary This document describes our experiences with using the Distributed Replicated Block Device (DRBD) open-source storage technology in a Wide Area Network (WAN). It focuses on two aspects: data replication (the basic DRBD functionality) and virtual machine migration (using DRBD as a supporting technology). We provide an overview of existing literature and describe our own experiments with using DRBD in the SURFnet WAN environment, showing that DRBD is a versatile product that can be applied to a range of different applications. The experiments give administrators and users of DRBD a clear impression of what can be expected from this technology when storage nodes are far apart and show that open-source technology can be successfully applied to build sophisticated storage environments.

Colophon Programme line : Enabling Dynamic Services Part : Task 3 - Storage Clouds Activity : Activity II PoC Storage Clouds Deliverable : Proof of Concept on Storage Clouds report Access rights : Public External party : Novay, Prolocation This project was made possible by the support of SURF, the collaborative organisation for higher education institutes and research institutes aimed at breakthrough innovations in ICT. More information on SURF is available on the website www.surf.nl.

Contents 1 Introduction .................................................................................. 4

2 Overview of VM migration in WANs .............................................. 6 2.1 Reasons for live migration over a WAN ....................................................... 6

2.2 Live migration steps .................................................................................... 7

2.3 Specific aspects of WAN migration .............................................................. 7 2.3.1 VM connectivity ....................................................................................................................................... 7 2.3.2 Access to data .......................................................................................................................................... 8 2.3.3 Migration performance .......................................................................................................................... 10

3 Overview of DRBD usage in WANs ............................................... 12

4 Experiences in the SURFnet WAN ................................................ 14 4.1 Test environment ...................................................................................... 14

4.2 Data replication with DRBD ...................................................................... 15

4.3 Virtual machine migration using DRBD ..................................................... 20

5 Conclusions ................................................................................. 24

6 References .................................................................................. 25

4

1 Introduction Online data storage has become within reach of many users and organizations over the last number of years. The exponential increase of Internet bandwidth is an important enabler for this trend. SURFnet is investigating various technologies that can be used to create an open storage architecture for higher education and research in the Netherlands, by leveraging its high-speed state-of-the-art national network infrastructure. This effort focuses primarily on providing high-performance and highly available online storage facilities to its participants in the SURFnet Wide Area Network (WAN). As part of this effort, SURFnet has executed a scan on existing solutions and technologies for storage in Wide Area Networks (WANs) (see the survey document at [13]). The Distributed Replicated Block Device (DRBD) [4] is identified as an interesting storage technology that may be used for WAN data replication. In this document, we focus on DRBD usage in WANs: in parallel, we are evaluating other technologies identified as interesting by the scan (GPFS and Gluster). DRBD is a Linux kernel module that synchronizes the data on two block devices installed in two distinct Linux machines. It provides functionality similar to RAID 1, using the network between those nodes. A DRBD configuration is visible as a logical block device to the operating system, and thus may be used for a wide range of purposes. Under normal operation, a DRBD configuration consists of a primary node and a secondary node, where all access goes through the primary node and data on the secondary node is not accessed directly by applications. All changes on the primary node are replicated at the secondary node, using different replication protocols: asynchronous replication (protocol A), memory synchronous replication (protocol B), and fully synchronous replication (protocol C). With protocol C, a write operation only finishes when the data is stored on disk at both nodes, and therefore is most safe. In case of failure on the primary node (e.g., due to a disk crash), the primary node redirect all read and write access to the secondary node. In a typical setup, the DRBD nodes are connected through a high-speed, low-latency Local Area Network (LAN) or cluster network, which ensures good performance. When operating in a WAN, the performance degrades due to generally lower bandwidth and (much) higher latency. Bandwidth can be expected to rise, but latency has a lower limit imposed by the speed of light. In this document, we provide a more in-depth analysis of the various uses of DRBD in WANs (focusing on VM migration as an application) and report on our own experiences with DRBD. We consider data distribution within the SURFnet infrastructure (which is deployed in the Netherlands on a national scale) and therefore use WANs with diameters up to several hundreds of kilometres (km) and maximum latencies of around 20 milliseconds (ms). This document does not provide a general introduction into DRBD or VM migration and assumes that the reader has a basic understanding of these two technologies.

5

The outline of this document is as follows. Section 2 and Section 3 describe the results of a literature scan, while Section 4 summarizes our own experiences. In Section 2, we give an overview of virtual machine migration over long distances (inside a WAN) and the implication of this migration for VM storage access. Section 3 provides an overview of the usage of DRBD in WAN environments. In Section 4, we describe the results obtained from experiments with DRBD and DRBD-based VM migration in the SURFnet WAN environment. Section 5 discusses the conclusions.

6

2 Overview of VM migration in WANs Virtualization platforms have long been able to move a guest VM from one host system to another. By means of the common abstraction offered to guest operating systems by the virtualization platform, a guest can be 1) suspended on one host, 2) moved to the other host, and 3) have its operation resumed there. This type of VM migration can take considerable time which is, for many tasks, not acceptable. More recently, many virtualization platforms have added functionality for live migration, to support moving a running VM from one host to another with a very short interruption time. For example, VMware [10], Xen [3], Microsofts Hyper-V [9], and VirtualBox [18] all now support live migration. The typical target environment for live migration is a cluster of host machines interconnected by a fast, low-latency network (such as a high-speed LAN), and in this setting VMs may be migrated with very little application level down time. In this section, we discuss various aspects and relevance of executing live VM migration beyond this target environment from one host to another over a WAN. We use existing literature and documentation as input for our discussion here: our own experiences with WAN migration are described in Section 4.

2.1 Reasons for live migration over a WAN Live WAN migration may be desirable for a number of different reasons (see also [17], section 2):

Bring the computational resources to the data. Operations on (large amounts of) data may be most efficient when executed without moving the data over long distances.

High availability is served by migration over a long distance. This may be the case when computational resources and data networks in a large geographical area are expected to see diminished availability or outages, for instance in case of natural disasters.

Load balancing. When administering resources spread over considerable distances, for instance in case of managing machines in multiple datacenters, moving VMs from one location to another may achieve a better balance of the computational load over these locations.

Maintenance. Large maintenance on a datacenter or WAN network may favor a migration to another location.

All these reasons are valid in the SURFnet context, i.e., administrators of parties connected to the SURFnet infrastructure may move VMs over long distances for all the above reasons. Obviously, when migrating VMs from one location to another in the Netherlands, network distances are not very high (in the order of hundreds of kilometers), although they exceed the metropolitan area network (MAN). Even when considering these limited WAN sizes, it is likely that latency has a strong impact on migration performance.

7

2.2 Live migration steps The main steps involved in the live migration of a VM from a source host to a destination host (for all virtualization platforms) are the following:

Initialize migration. Reserve resources on the destination host to make sure it will be able to run the VM.

Pre-copy the memory state. While running the VM on the source host, transfer all memory pages from the source host to the destination host and keep track of pages that are written. Updated pages will be retransferred.

Stop the VM on the source host and copy the non-memory state (CPU state, device state) and remaining dirty memory pages to the destination host.

Start the VM at the destination host, transfer any pending state on the source side, and release resources at the source host.

Note that the activity on the VM during the pre-copy phase determines how efficient the actual transfer of operation from one side to the other can be executed: when the VM continuously makes changes to many memory pages the fraction of dirty pages will not become very small. In that case, the interruption time between stop and start is relatively long, because many memory pages still need to be transferred. The efficiency of the migration process therefore depends on the application(s) running on the VM. Also note that the steps above concern the migration of the state of the VM itself and do not cover the VM disk image. In LANs or clusters, access to the VM disk file may be arranged through a network file system or a SAN accessible by both the source and destination host. Furthermore, the steps do not indicate how to deal with external storage accessed by the VM.

2.3 Specific aspects of WAN migration When executing live migration over long distances, multiple aspects require special attention. These aspects are discussed in various research papers (e.g., [17], [6], [8], [20]), product manuals ([9], [18]) and reference architecture documents ([7], [19]). So far, live migration is not widely done over long distances and little further data is available about the operational aspects of WAN migration. Also, many of the experiments done on research prototypes have been executed in simulated environments, i.e., by using WAN emulators such as NetEm [11] and Dummynet [5], not in real WAN testbeds. We now list the main aspects that make WAN migration different from regular VM migration, as arises from these documents.

2.3.1 VM connectivity To maintain the transport layer protocol state of the VM during migration (e.g., keep state information about open TCP connections), it is necessary that the VM has the same IP address before and after migration. When moving a VM inside a local network or cluster, this is straightforward to achieve: the VM may keep its MAC address (in an Ethernet network) and rely on the network switch to detect the new point of attachment, or it may send out ARP announcements to update the ARP table at the router and other hosts involved in communication with the VM [3].

8

In case of WAN migration, the VM is very likely to move to a network served by a different router and a different network address range, which would result, without extra arrangements, in assignment of another IP number to the VM. As this would break protocol state and application level communication, it is necessary to have a mechanism in place to preserve the IP number. The most common proposed solution is to configure a VLAN or VPN that stretches over the different sites (e.g., [20], [19], [7], [17]). The drawback of this solution is that it leads to suboptimal routing of IP packets in case of external IP nodes communicating with the VM. The VPN router which handles all traffic between the VM and external nodes, for example, may be located close to the source host and far from the destination host. Even when external nodes are close to the destination host, the traffic goes through the VPN router. Some technologies exist that target this problem, but these have their own problems: Mobile IP, for example, has provisions for route optimization, but these rely on extra functionality at both peers. VMs with a lot of external communication are therefore less suitable to migrate in a WAN environment. Within the SURFnet backbone, a good solution is to use BGP routing (the default routing protocol for client locations). BGP supports routing packets with the same IP address to more than one physical location. The approach consists of using optical lightpaths (fixed or dynamic) [16] to create a VLAN between those locations, enabling the machines within this VLAN to use the same subnet numbering. Some applications, however, do not require external communication (for instance, because they operate on data within local networks or have local clients), or have short-lived external connections that are allowed to break (e.g., a web crawler). For some of these cases, the VM may be assigned a new IP number, which simplifies the requirements imposed on the migration functionality.

2.3.2 Access to data The VM at the destination host requires access to the VM disk file (or disk image). Also, many VMs have access to storage external to the own system, i.e., connect to network drives and block devices available through the LAN or SAN. When migrating locally, it is likely that access to these storage resources is possible at the destination host. A long-haul migration, however, makes access more problematic (although not impossible). A number of different approaches can be taken to keep external storage available (i.e., external to the destination host running the VM) to a VM migrated over a long distance, assuming that connectivity issues as described above have been successfully addressed:

1. No special provisions are made and access to external storage is over the WAN. This is easy to configure (identical to the LAN case) but obviously has serious drawbacks in the form of degraded performance. Read and write access is much slower, because in general the WAN bandwidth is low compared to clustered/LAN environment, but also because the latency

9

between the destination host and the storage server is high. Even in case of a high bandwidth WAN, throughput may still be affected by the latency (i.e., throughput and latency are coupled), depending on the transport layer protocol used. The TCP protocol, for instance, uses a maximum congestion window size which is related to the buffer size allocated for a TCP socket. Default TCP buffer sizes may result in congestion windows that are too small, resulting in only partial usage of available bandwidth, for instance in case of transferring data with FTP (see also the discussion in [15] on tuning TCP parameters for GridFTP). In case of chatty storage protocols such as CIFS and iSCSI, the latency degrades performance even more. This approach is used when running virtualization platforms with off-the-shelf regular live migration functionality.

2. The external data is copied to the destination side before and during the migration process. The data always moves with the VM, and is available locally when the VM resumes operation at the destination side, which is optimal in terms of storage access performance, but takes considerable time and WAN resources during migration. This approach is common for functionality focusing on live WAN migration (both research prototypes and products).

3. The external data is fetched on demand after the VM migration process and cached close to the new location of the VM. This eventually leads to accessed data being stored close to the destination host, and requires little effort prior to migration. The drawback of on-demand retrieval is the diminished performance after migration, especially directly after resuming the VM at the destination because then no data is available locally. This approach is followed by the live storage migration mechanism described in [6].

4. The external data is continuously replicated at multiple sites. This allows for fast migration but may result in diminished overall storage access performance (especially write actions) because data must be replicated at both sides. Also, it consumes WAN resources all the time, which makes this approach expensive and only attractive in cases of high-availability requirements or in case of highly frequent back and forth migration. The Storage VMotion technology [19] of VMware is capable of supporting this approach (as well as the other approaches).

Note that in this discussion there is an implicit assumption that the migration is taking place between two predefined locations. In cases where VMs may migrate to multiple sites, some external storage approaches may be less feasible. For instance, when data must be continuously replicated at many sites, performance and overhead may be substantially impacted, rendering the fourth approach less attractive. Also, the impact of a second migration to a third site, while data is spread over the two initial sites, is not further considered. A number of technologies exist that may support the basic operations necessary to implement the above approaches, by combining them with off-the-shelve live migration functionality of existing virtualization platforms. The different modes of DRBD (see Section 3) may be used to realize the approaches 2 and 4. A modified

10

version of the Network Block Device (NBD, [12]) is used in [6] to implement the third approach. The chosen storage migration approach has substantial impact on the time it takes to complete a full migration (including the storage). Note that under no circumstances an instant migration can take place, because there is always some time needed to transfer the VM state: a sudden outage or crash leaves no time to migrate. By choosing approach 3 or approach 4, the migration time can be kept short. The choice between approach 3 and 4 depends on whether it is acceptable to have diminished performance after migration (approach 3) or to suffer the overhead of continuous replication (approach 4).

2.3.3 Migration performance The smaller bandwidth and higher latency of WANs (compared to LANs and clustered environments) result in a longer pre-copy phase and a slower transfer of the VM state and external storage. Obviously, the size of VM (memory and image) as well as the size of any external (non-system) storage data that must be transferred influences the time to migrate considerably. The research system proposed in [20] (called CloudNet) discusses optimizations to improve these effects. VMware has set requirements for its live migration functionality (called VMotion) to work over long distances: the minimum network bandwidth must be 622 Mbps and the latency between the source host and destination host must be less than 5 ms.

Act

ivity

Dis

k

Mem

ory

C

ontro

l

Time

asynchronous replication

synchronousreplication

memory transfer

pause VM

clean up at source

reserve resources at destination; prepare

network

Figure 1: time required for disk, memory, and control activities during WAN VM

migration1

1 Modified from Figure 2 in

[20].

11

The experimental results on migration times in the research papers mention in multiple cases a VM downtime of a few seconds or less ([17], [20]), which makes it only a little slower compared to local migration. The total migration takes much longer, and is especially influenced by storage data transfer (assuming that the storage capacity is well over the memory capacity). The steps taken in time to execute the migration are depicted in Figure 1.

Appl

icat

ion

resp

onse

tim

e

Time

normal operation(source)

disk replication

disk + memory

replication

normal operation

(destination)

peak due to VM pause

Figure 2: impact of migration activities on the VM application performance2

(time is not to scale)

During most of the migration, the VM application is up and running, although the performance of the application is negatively influenced by the migration activities. When copying storage data, the disks are partly used for migration, leaving less capacity for the application. Likewise, during pre-copy of the VM memory, CPU and other system resources are allocated to the migration process. These effects are indicated in Figure 2, showing that the application response time is higher during migration than under regular circumstances.

2 Modified from Figure 13a in [20]

12

3 Overview of DRBD usage in WANs The Linbit Distributed Replicated Block Device (DRBD) is a network block device that supports replication of data between two sites [4]. A system running a DRBD node uses the provided block device like any other block device, e.g., to create a file system. DRBD is targeted at replication in LAN environments, supporting high availability by failover. It can run in different modes, representing different types of synchronization: asynchronous replication (protocol A), memory synchronous replication (protocol B), and fully synchronous replication (protocol C). Protocol C is the safest, as it guarantees at the end of a write operation that the data is stored on disk on both sides. DRBD is able to recover from network outages and other failures. It is important to note that it is the responsibility of the functionality on top of DRBD (such as a distributed file system) to handle concurrent access (i.e., simultaneous writes at the nodes). In many configurations, DRBD runs with one primary node that handles all application requests and one secondary node that gets updates from the primary. DRBD seems less widely used in WAN environments, although Linbit offers an extension called DRBD Proxy specifically targeted at WAN replication. Contrary to the core DRBD product, DRBD Proxy is not available as an open source product. DRBD Proxy essentially works as a large buffer, to more evenly spread network load over time. As a consequence, the nodes may be considerably out of sync. Also, it is able to compress the block data, resulting in less traffic over the WAN.

DRBD

network for replicationand migration(wide area)

move VM

LVM

disk

virtual machine

DRBD

LVM

disk

virtual machine

replicate data

Figure 3: typical DRBD configuration for supporting live VM migration over a WAN

In this document, our objective is to summarize and share experiences with using the basic DRBD functionality (not DRBD Proxy) in WANs. When looking at the sparse use of DRBD in WAN settings, it turns out that DRBD is applied a number of times to support the live migration of VMs over long distances (as described in the previous

13

section). The research prototypes described in [8] and [20] do this. We have not found documents describing experiences with using DRBD Proxy. DRBD may be used to continuously synchronize data between the sites, which makes the migration process fast. It may also be used on demand, when it is necessary to migrate a VM from one node to another (i.e., data between sites is not replicated continuously). This is the case for the storage migration in the CloudNet prototype [20], which takes the following steps:

1. Reserve disk space at the destination host and pre-copy the storage data from source host to destination host in asynchronous mode.

2. Switch to synchronous mode during the VM state transfer, to make sure that all writes are propagated.

3. Just before resuming the VM at the destination, switch DRBD to dual primary mode (i.e., enable access at both sides concurrently), to allow the VM access to storage at the destination host.

4. Discard the DRBD node at the source host.

14

4 Experiences in the SURFnet WAN This section describes our experiences with running DRBD in the SURFnet WAN. It covers two experiments: 1) straightforward DRBD data replication between two sites connected through the WAN, and 2) using DRBD as a supporting technology for VM migration between two sites connected through the WAN. These two experiments both run in the same test environment, which is described first. The basic DRBD data replication test provides an indication of the performance of DRBD in our WAN test environment. The performance numbers obtained for this experiment can be used to assess the influence of different network technologies and configuration settings on the basic DRBD operation in a WAN. The VM migration experiment gives an indication of what can be expected of DRBD as an enabling technology for a sophisticated, real-world application.

4.1 Test environment To investigate DRBD WAN properties and the usage of DRBD as a building block for VM WAN migration, we have built a test environment consisting of multiple machines at two geographical locations. Data traffic over fiber optic cables between these two locations travels a distance of slightly more than 200 km. The test setup consists of two different types of machines a virtual machine server and a storage server connected using a number of different network technologies. This enables us to test DRBD and compare performance under different conditions. The two locations each have one server machine with capacity to run multiple virtual machines (the virtual machine server configuration). Location A has two, and location B has one server machine with powerful CPUs and fast storage (the storage server configuration). These machine configurations can be summarized as follows. Virtual machine server:

- 2x Intel Xeon Duo E5130 2.0 GHz CPU - 8 GB RAM memory - system harddisk (for host OS / virtualization software) - 1 GBit/s Ethernet RJ45

Storage server:

- 2x Intel Xeon L5410 Quad Core 2.33 GHz CPU - 16 GB RAM memory - 2x 32 GB solid state harddisk - 8x 300 GB SAS disks (2.4 TB in total) - 1 GBit/s Ethernet RJ45 - 20 GBit/s Dolphin DXH510 - 10 GBit/s Ethernet optical fiber

15

The machines are interconnected as indicated in Figure 4. The distance between the two locations results in a minimum round trip delay of ~2 ms (considering the speed of light in fiber, not counting delays caused by repeaters, switching, etc.). The measured ping delay under Linux is approximately 3.5 ms.

location A1 Gbit/s Ethernet

20 Gbit/s Dolphin

1 Gbit/s Ethernet

10 Gbit/s Ethernet (Lightpath)

location B

VM server 1

storageserver 1

storageserver 2

storageserver 3

VM server 2

Figure 4: test environment consisting of two types of machines (storage server and virtual machine server) spread over two different locations (location A and location B). The machines are interconnected using different network technologies (1 GBit/s Ethernet over copper, 20 GBit/s Dolphin, !0 Gbit/s Ethernet over optical fiber).

4.2 Data replication with DRBD The basic DRBD tests provide an indication of disk read and write performance under relatively simple conditions, using a file system on top of a DRBD

16

configuration. They are all executed using the storage server machines, showing performance numbers for:

native (local) disk access (single node) access to a dual-node DRBD configuration using the 1 GBit/s LAN (Ethernet)

at location A access to a dual-node DRBD configuration using the 20 GBit/s LAN (Dolphin)

at location A access to a dual-node DRBD configuration using the 10 GBit/s WAN

(Lightpath) one node at location A and one node at location B

These tests are performed using the Bonnie++ benchmark software [1]. The storage machines operate with the two solid-state disks in a RAID0 configuration (striped) and with the eight SAS disks in a RAID1+0 configuration (mirrored & striped). They run a Linux 2.6.18 kernel and DRBD version 8.3.9rc1. The most interesting performance indicator is write speed, because for writing, the data must be transferred over the link between the two DRBD nodes (which are high-end server machines). Reading is less interesting because it always operates on local data (does not require communication between the two DRBD nodes). We consider the following configurations: Configuration name DRBD protocol Network / Nodes native none, no DRBD

used storage server 1single node, no network

1gb_local_A A 1 Gbit/s Ethernet

(LAN)

storage server 1

storage server 2

1gb_local_B B

1 Gbit/s Ethernet (LAN)

storage server 1

storage server 2

1gb_local_C C

1 Gbit/s Ethernet (LAN)

storage server 1

storage server 2

10gb_remote_A A

10 Gbit/s Ethernet (WAN)

storage server 1

storage server 3

10gb_remote_B B


storage server 1

storage server 3

10gb_remote_C C


storage server 1

storage server 3

17

dolphin_local_A A 20 Gbit/s Dolphin

(LAN)

storage server 1

storage server 2

dolphin_local_B B

20 Gbit/s Dolphin (LAN)

storage server 1

storage server 2

dolphin_local_C C

20 Gbit/s Dolphin (LAN)

storage server 1

storage server 2

The native configuration has a file system directly on top of the disks (without DRBD in between). The other 9 configurations have a file system on top of DRBD, with DRBD using disks on two server nodes connected by different kinds of interconnections. The results of the Bonnie++ v1.03 tests are depicted in Figure 5 - Figure 8. We ran the tests on the two different types of disks available on the server nodes: SAS and SSD. Additionally, we used two different file systems on top of DRBD: ext3 and xfs. We used a file size of 32GBytes and a chunk size of 64 kBytes for running Bonnie++. Most Bonnie++ runs were repeated 5 times: we observed that the standard deviation of these runs was low mostly well within 10% of the mean and therefore have not executed all test multiple times. Given this test configuration, we expect the following results. - The native performance should be highest, as no DRBD is used and all file

operations are local (no replication over the network). - The raw disk performance is more than 1GBit/s, which implies that for

configurations using the 1GBit/s Ethernet LAN, the network should be the bottleneck.

- Comparing the Dolphin link with the fiber optic WAN link, the Dolphin link has a greater throughput and a lower latency and therefore should support the higher performance of these two.

- Under otherwise equal circumstances, a DRBD configuration using protocol A should have a higher write performance than one with protocol B. Likewise, using protocol B should result in a higher performance than protocol C.

The results in Figure 5 are largely in line with what is expected. The native performance is clearly highest, and the performance over 1Gbit/s Ethernet is limited by the network. Also, the write speed of the configuration using Dolphin is higher than the configuration using the WAN link. The performance of configurations with DRBD protocol B, however, is lower than the same configuration using protocol C (for 10GBit/s WAN and Dolphin). A possible explanation is that the implementation of protocol C is more fine tuned than the implementation of protocol B. Protocol C is lower than protocol A, which is expected.

18

Figure 5: sequential block write performance reported by Bonnie++, for configurations with an ext3 file system using DRBD on SAS disks

Figure 6: sequential block write performance reported by Bonnie++, for configurations with an xfs file system using DRBD on SAS disks The results in Figure 6 are also mostly in line with what can be expected. A noticeable difference with Figure 7 is the much higher native and Dolphin write speed of the xfs file system. Again, 1 GBit/s Ethernet is the bottleneck for 1gb_local_{A, B, C} configurations and shows similar performance numbers as under ext3. Dolphin has a much better performance than the WAN configurations. When comparing DRBD protocol A, B and C, we observe that the 10gb_remote_A

19

performance is lower than 10gb_remote_B, which is unexpected. Configurations with the other two network types do show expected relative performance numbers in this respect.

Figure 7: sequential block write performance reported by Bonnie++, for configurations with an ext3 file system using DRBD on SSD disks

Figure 8: sequential block write performance reported by Bonnie++, for configurations with an xfs file system using DRBD on SSD disks In Figure 7, we show the results of writing to the ext3 file system running on top of a DRBD/SSD setup. The first result that stands out is the low native write performance, which is even lower than the distributed setup using Dolphin. We do

20

not have an explanation for this result. Other aspects are mostly in line with what is expected: only dolphin_local_A too low compared to dolphin_local_B. The result in Figure 8 again show that xfs is faster than ext3. Here, native performance is higher than the rest, but Dolphin configurations perform less than the WAN configuration. This could be caused by the fact that the WAN and Dolphin configuration are close to the native performance. Also, 10gb_remote_C is too high compared to 10gb_remote_B, although this may be caused variations between runs (the performance is only slightly higher). Compared with the same configurations in the other figures, 1GBit/sec Ethernet is performing under the maximum imposed by 1GBit/s Ethernet.

4.3 Virtual machine migration using DRBD For the VM migration experiments, we focus on a setup where virtual machines run on two different locations with each location using separate servers for virtual machine execution and virtual machine disk storage. The VM disk of each VM is synchronously replicated over the WAN to the storage server at the other location. This means that at any time, (almost all) the data on a virtual disk belonging to a VM is available at both locations, which supports the fast migration of that VM to the other location. DRBD is used to keep a VM disk synchronized (i.e., in this configuration, DRBD uses protocol C for fully synchronized replication). This setup is depicted in Figure 9. The virtualization layer on the VM server accesses the virtual disks on the storage server using iSCSI over the LAN, with fallback access over the WAN to the replica disk a multipath iSCSI configuration for high availability. A straightforward approach to organize the distributed iSCSI storage is to define one iSCSi logical device (LUN) which stores all the VM disks at both sides, and uses DRBD to keep VM disk replicas in sync between the two sides. DRBD must be configured to run in dual primary mode to support concurrent writes to the DRBD disk at both sides (but never concurrently to the same VM disk, because the VM is only active at a single location at any given time). Unfortunately, a problem with this approach occurs in case of a split-brain situation: when the WAN connection is unavailable for a prolonged duration and DRBD switches to a standalone mode at both sides, the sides go out-of-sync. When later the two sides reconnect, recovery from the spit-brain situation results in loss of data (i.e, recent changes at one of the locations are lost). We tackled this problem in the following way. Instead of using one LUN, we use a configuration with two LUNs, with each LUN dedicated to storing VM disks running primarily on one location. After migration of the VM, the VM disk is still on the LUN tied to the source side (remote side). Immediately after migration, however, this VM disk is migrated to the (local) destination LUN, i.e. the storage is migrated locally from one LUN to another LUN, while the VM keeps running on the same VM server. This situation is illustrated in Figure 10.

21

location A

VM server 1

storage server 1

virtualization layer

guestguest

guest

location B

VM server 2

storage server 3


guestguest

guest

replicate

replicate

Figure 9: VM migration test configuration with, at both locations, a VM server and a

storage server. VM disks are synchronously replicated to the other location. When a VM migrates from one location to another, the virtualization software on the destination side uses the replica VM disk on the destination storage server. We use multipath iSCSI where each VM has a primary path to the local iSCSI target (the local storage server) and a standby path to the iSCSI target at the other location (the remote storage server). During migration, the primary and standby paths switch roles, which lets the VM use local storage at the destination through the same LUN when resuming on the destination side.

22

The test configuration is implemented on the hardware configuration described in Section 4.1. The server machines run CentOS [2] with Linux kernel 2.6.18, and the iSCSI target is based on SCST [14] using CentOS packages and exports two LUNs. Each LUN corresponds with a DRBD distributed block device that runs in dual primary mode. The DRBD version is 8.3.9rc1; the DRBD configuration uses SAS disks. The virtualization layer is implemented by the ESX product of VMWare. The VMs run Linux and use the ext3 file system on top of vmfs/iSCSI/DRBD. The size of the VM images is 40 GByte.

location A

VM server 1

storage server 1


guest

location B

VM server 2

storage server 3


guest

replicate

replicate

LUN A

LUN B

1a

1b

2a

2b

Figure 10: VM migration in two steps: step 1 moves the VM from the source to the

destination, and step 2 moves the storage from source LUN to destination LUN.

23

The first test considers performance of the basic, static configuration, with one VM running on the VM server on location A, without migrating the VM to the other site. We execute this test to validate the basic configuration and get an indication of its performance. Note that the setup here is quite different from the test configuration in Section 4.2: here we use separate storage machines and iSCSI, and access storage through a VM. Also note that, although the VM is not migrated, DRBD is configured to replicate the virtual disk to the other location (location B) over the WAN. We test the performance in under normal network conditions, i.e., when DRBD continuously replicates the virtual disk to location B. Additionally, we test under the condition when the link between the locations is down, i.e., when DRBD is not able to replicate to location B. The test considers disk write (blocks) performance results using Bonnie++ v1.0.3 with a file size of 16GBytes and a chunk size of 64 kBytes. In connected mode, i.e., when DRBD synchronously replicates data between the two sites, the block write performance is ~110 MByte/s. In disconnected mode, when DRBD runs without replicating to the other side, the block write performance is ~100 MByte/s. This performance is roughly half of the performance of ext3 straight on DRBD and around the maximum that can be expected given the Gigabit Ethernet link between VM server and storage server. It is remarkable, however, that the connected mode performs slightly better than the disconnected mode. The second test is an endurance test, which concurrently migrates two VMs back and forth between the two locations, during one day. This includes the two steps as described above: migrating the VM itself and then migrating the storage from one LUN to the other (at a single location). The VMs run idle, i.e., do not heavily use the file system. The concurrent migration of the VMs is performed successfully during the test day. The average time for step 1 (VM migration, consisting of memory transfer and a short stop/resume) is ~15 seconds. The average time for step 2 (live storage migration, consisting of moving the virtual disk data locally from one LUN to the other) is ~30 minutes. Our preliminary conclusion is that, given the results from the test above, DRBD can be successfully used as an open source technology to support fast VM migration over WANs. More elaborate experiments (for instance, running with many, occasionally migrating VMs, each with high load) must be executed to support this conclusion.

24

5 Conclusions This report evaluated the usage of DRBD in WANs, from a perspective of existing work as well as from the perspective of our own experiences with running DRBD in the SURFnet WAN environment. We considered the basic DRBD operation in WANs as well as the usage of DRBD as enabling technology for wide-area virtual machine migration. Existing work shows that DRBD is not (yet) widely used in WANs and also not widely used for supporting VM migration in WANs (as discussed on Section 3). A number of research prototypes use DRBD for VM migration. The approach taken here is to continuously replicate VM storage between two locations, which should enable fast VM migration. The first experiment (Section 4.2) shows file system write performance for a number of basic DRBD configurations deployed in our WAN test environment (two sites with end-to-end delay of ~3.5 ms). DRBD performs reasonably well in our test bed, when comparing WAN performance with local network performance. A notable result is that our measurements are not always consistent with theoretically expected relative performance. For example, in some circumstances, WAN performance is better than high speed local link (Dolphin) performance. Also, for certain configurations, a higher level of DRBD synchronization provides better performance than a lower level of synchronization (which may be explained by more optimized implementation of the synchronous DRBD mode). The second experiment (Section 4.3) considers the application of DRBD to VM migration over WANs. We proposed a new setup based on multi path iSCSI, and using a dedicated LUN per WAN site. A number of initial tests running under load in a static configuration, and running an endurance test with two concurrently migrating VMs provide a strong indication that DRBD can be used for fast migration in our proposed setup. Additional tests are necessary, however, to give insight in aspects we did not cover here (such as application performance under migration). Considering the results described in this report, we conclude that DRBD is a versatile basic storage technology, showing that an open source product may be used to support sophisticated storage solutions.

25

6 References [1] Bonnie++ benchmark suite, http://www.coker.com.au/bonnie++/ [2] CentOS: The Community ENTerprise Operating System,

http://www.centos.org/ [3] C. Clark, K. Fraser, S. Hand, J.G. Hansem, E. Jul, C. Limbach, I. Pratt, and A.

Warfield. Live Migration of Virtual Machines, In Proceedings of the 2nd Symposium on Networked Systems Design and Implementation (NSDI05), May 2005

[4] DRBD Home Page, http://www.drbd.org/ [5] Dummynet Home Page, http://info.iet.unipi.it/~luigi/dummynet/ [6] T. Hirofuchi, H. Nakada, H. Ogawa, S. Itoh, and S. Sekiguchi, A Live Storage

Migration Mechanism over WAN and its Performance Evaluation, In Proceedings of the 3rd International Workshop on Virtualization Technologies in Distributed Computing (VTDC09), 2009

[7] Hyper-V Live Migration over Distance, Retrieved on 5/7/2010 from http://www.hds.com/assets/pdf/hyper-v-live-migration-over-distance-reference-architecture-guide.pdf

[8] M. van Kleij and A. de Groot, Virtual Machine WAN Migration with DRBD, OS3 report, Retrieved on 5/7/2010 from https://www.os3.nl/_media/2008-2009/students/attilla_de_groot/virt_migration.pdf

[9] Microsoft Hyper-V Server Home Page, http://www.microsoft.com/hyper-v-server/en/us/default.aspx

[10] M. Nelson, B.-H. Lim, and G. Hutchins, Fast Transparent Migration for Virtual Machines, In Proceedings of the USENIX Annual Technical Conference, 2005

[11] Netem Home Page, http://www.linuxfoundation.org/collaborate/workgroups/networking/netem

[12] Network Block Device Home Page, http://nbd.sourceforge.net/ [13] A. Peddemors, C. Kuun, R. Spoor, P. Dekkers, and C. den Besten, Survey of

Technologies for Wide Area Distributed Storage, Retrieved on 17/8/2010 from http://www.surfnet.nl/nl/innovatie/gigaport3/Documents/EDS-3R%20open-storage-scouting-v1.0.pdf

[14] SCST: A Generic SCSI Target Subsystem for Linux, http://scst.sourceforge.net/

[15] H. Stockinger, A. Samar, K. Holtman, B. Allcock, I. Foster, and B. Tierney, File and Object Replication in Data Grids, Cluster Computing, Vol. 5, No. 3, pp. 305-314, 2002

[16] SURFnet lightpaths, http://www.surfnet.nl/en/diensten/netwerkinfrastructuur/Pages/lightpaths.aspx

[17] F. Travostino, P. Dasit, L. Gommans, C. Jog, C. de Laat, J. Mambretti, I. Monga, B. van Oudenaarde, S. Raghunath, and P. Wang, Seamless Live Migration of Virtual Machines over the MAN/WAN, Future Generation Computer Systems, Vol. 22, Issue 8, 2006

[18] VirtualBox Home Page, http://www.virtualbox.org/

26

[19] Virtual Machine Mobility with VMware VMotion and Cisco Data Center Interconnect Technologies, Retrieved on 5/7/2010 from http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns836/white_paper_c11-557822.pdf

[20] T. Wood, K. Ramakrishnan, J. van der Merwe, and P. Shenoy. CloudNet: A Platform for Optimized WAN Migration of Virtual Machines, University of Massachusetts Technical Report TR-2010-002, January 2010

1 Introduction2 Overview of VM migration in WANs2.1 Reasons for live migration over a WAN2.2 Live migration steps2.3 Specific aspects of WAN migration2.3.1 VM connectivity2.3.2 Access to data2.3.3 Migration performance

3 Overview of DRBD usage in WANs4 Experiences in the SURFnet WAN4.1 Test environment4.2 Data replication with DRBD4.3 Virtual machine migration using DRBD

5 Conclusions6 References