dual-primary-think-twice.pdf

Embed Size (px)

Citation preview

  • 1DualPrimary:ThinkTwiceMartin Loschwitz

    Edited by Florian HaasCopyright 2010, 2011 LINBIT HA-Solutions GmbH

    Trademark noticeDRBD and LINBIT are trademarks or registered trademarks of LINBIT in Austria, the United States, and othercountries. Other names mentioned in this document may be trademarks or registered trademarks of their respectiveowners.

    License informationThe text and illustrations in this document are licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported license ("CC BY-NC-ND").

    A summary of CC BY-NC-ND is available at http://creativecommons.org/licenses/by-nc-nd/3.0/.

    The full license text is available at http://creativecommons.org/licenses/by-nc-nd/3.0/legalcode.

    In accordance with CC BY-NC-ND, if you distribute this document, you must provide the URL for the original version.

    1. Active/Active vs. Dual-Primary .................................................................................. 12. What a filesystem does ............................................................................................ 23. Requirements for Dual-Primary DRBD ........................................................................ 24. Cluster filesystems to the rescue ............................................................................... 35. Design-related issues ............................................................................................... 36. Fencing mechanisms ................................................................................................. 37. Dual-Primary and long-distance links ......................................................................... 48. Valid Dual-Primary scenarios ..................................................................................... 49. Conclusion .............................................................................................................. 410. Feedback .............................................................................................................. 4

    This document explains pros and cons of DRBD Dual Primary configurations. It illustrateswhy relatively often Dual Primary configurations can do more harm than good, and suggeststo give a good second thought before one creates this type of DRBD setup.

    1.Active/Activevs.Dual-PrimaryFirst of all, lets get down to some basics and create a convention for describing specific typesof setups.

    Active/Active simply means that in a two-node-cluster, both cluster nodes run certainapplications. This does not necessarily imply identical applications, and certainly does not implyconcurrent data access. Think of an HA cluster where a test database and a production databaseare set up; if both cluster nodes are up and running, one database will run on one node and theother database will run on the other. If one node is down, only the production database will be upon the remaining node. The test database will be down and remain down until the other clusternode comes back online.

    Here, different DRBD resources can be in the Primary role on different nodes, but one DRBDresource does not necessarily need to be Primary on both nodes at the same time. In fact, itusually will not be.

    Dual-Primary, in contrast, means that the same DRBD resource has the Primary role on bothcluster nodes at the same time, so that is becomes possible to access the resource in read/writemode simultaneously.

  • Dual Primary: Think Twice

    2

    The official LINBIT naming scheme is like this: When a DRBD resource is in Primary mode on twonodes at the same time, we will refer to it as Dual-Primary. When a two node cluster is set up torun different applications on different nodes given that both nodes are available, then this will bean Active/Active setup in LINBIT speak.

    2.WhatafilesystemdoesFilesystems exist for exactly one reason: They organize storage devices. Standard storage devicesfollow no specific structure. Adding a structure is what a filesystem does: It makes it possible toquickly find information written to the device in the past. When you open a file in your favouriteeditor, its the filesystems task to know which specific area of your storage device it has to accessto open exactly the file you want to see. And when you write changes to the disk, the filesystemmakes sure the newly created information are integrated properly into the filesystems structure.

    Imagine the following situation: Two applications on your system try to access the same regionon your storage device at the same time. Both write requests hit the filesystem. The filesystemwill process one request while queueing the other, then process the other request. It would knowall the time which piece of information it wrote to which part of the disk. The filesystem wouldremain consistent.

    Keep in mind that its not the filesystems task to actually validate the content of a specific file itwrites down to the disk. If one application in your system overwrites the contents of a file createdby another application, making that other application go haywire, then from the filesystems pointof view everything is still fine as long as it knows which piece of information it wrote down whereand when. Administrators, on the other hand, are obviously free to disagree.

    The opposite of a consistent filesystem is a corrupt filesystem. Filesystems might be damagedby a number of factors, including hardware problems, general stability problems or even ineptadministrators. Recovering a corrupt filesystem is a tricky task and often does not work asexpected. That is why administrators need to guard against filesystem corruption in advance (andwhile we are at it: the best way is and will always be keeping good backups).

    3.RequirementsforDual-PrimaryDRBDWhen a DRBD resource is in Dual-Primary mode, the DRBD kernel driver allows write attemptsto happen to this specific DRBD resource on both nodes sharing the resource. By design, a DRBDresource is supposed to have the same contents on both nodes of a cluster. Changes from nodeA are replicated over to node B and the other way around.

    With a standard filesystem (Ext3/4, XFS ), the following situation could arise: Imagine a DRBDresource with an Ext4 filesystem on it. DRBD is in Dual-Primary mode, the Ext4 filesystem ismounted on both cluster nodes. Application A writes something down to the filesystem residingon the DRBD resource on node A, which then gets written to the physical storage device. At thevery same time, application B tries to write something down to the filesystem on node B, whichgets written down to exactly the same region on the storage device of node B.

    DRBD replicates the changes from node A to node B and the other way around. It changes thecontents of the physical storage device. However - as DRBD resides under the mentioned Ext4filesystems, the filesystem on the physical disk of node A does not notice the changes comingfrom node B (and vice versa). This process is called a concurrent write. Starting from now, theactual content of the storage device differs from what the filesystem there thinks it should be.The filesystem is corrupt.

    Because of this, normal filesystems just can not be used in Dual-Primary setups not even whenin read-only-mode (on one of the two nodes): Even then, the filesystem meta data might still bechanged. Additionally, Linux assumes that when it mounts a filesystem in read-only mode, therewill be absolutely nothing else changing that filesystem. This leads to massive cache coherencyproblems: Imagine a file that was read into the cache on the read-only node somewhen in the past

  • Dual Primary: Think Twice

    3

    and is re-accessed later while in the meantime, it was changed on the node where the filesystemis mounted in read-write mode. The Secondary node is not going to notice these changes.

    4.ClusterfilesystemstotherescueCluster filesystems are an attempt to face that challenge. You might have heard of them: GFS,GFS2, OCFS2 there are quite a few of them around. What they all have in common aremechanisms to avoid concurrent writes. With a cluster filesystem in place, every server exportsits storage devices (a DRBD resource can be such a storage device, too). Every client whichwants to use the cluster filesystem has to mount it with the cluster filesystem software itself asthe appropriate interface. All write attempts to the cluster file system are coordinated by thatsoftware.

    Numerous components needed for this process are part of recent linux kernels already, amongstthem the Distributed Lock Manager (DLM), which provides locking-related functions to thecluster filesystems.

    5.Design-relatedissuesGiven that cluster filesystems exist, you might wonder why using Dual-Primary setups should be aproblem. The consistency of a filesystem residing on a storage device is not the only condition thatneeds to be met in order for a storage device to be considered usable. There are other challengesrendering the task of running a DRBD resource in Dual-Primary mode complicated, many of themrelated to the design of modern storage solutions.

    There is, for instance, the All or nothing rule. That rule explicitly defines that a set of data canonly be considered consistent if its really known that every single bit on that set of data is what itssupposed to be (for example, in DRBD speak, this would mean a resource is UpToDate). As soonas a device is being written to in an uncoordinated manner (caused by hardware errors, a clusterfilesystem going berserk or whatever), a set of data has to count as inconsistent (because wecan not definitely assume that is is consistent). It then is worthless.

    In a simple Primary-Secondary setup with DRBD, even if one node crashes, the DRBD on theremaining machine will still have a consistent filesystem. We can safely assume that because inthis sort of setup, there is nothing that could have written to the Secondary DRBD drive exceptfor the Primary DRBD drive, which went away.

    When having a Dual-Primary resource, we principally have to assume that as soon as the twonodes sharing that DRBD drive get disconnected from each other, uncoordinated write attemptscan happen on either of them. Measures need to be taken to make sure that when node is introuble, that node can not cause corruption of a set of data anymore welcome to fencing.

    6.FencingmechanismsYou might have heard of fencing mechanisms. Doing some fencing generally is a good idea notonly in Dual-Primary setups. However, in Dual-Primary configurations, fencing is vital andabsolutely necessary for proper cluster functionality.

    One of the well-known fencing mechanisms is STONITH (which stands for Shoot the other nodein the head). Using STONITH means that as soon as the cluster-manager detects that a node hasproblems, it will reboot it to make sure that no further changes of data happen on that clusternode. Setting up fencing is not only complicated for Dual-Primary-Setups (aside from the factthat in most cases, it needs special harware), it also renders a cluster setup more complicated andadds considerable complexity to the cluster.

    Last but not least, there are practical impacts caused by the internal fencing mechanisms used byall cluster filesystems: Whenever the cluster manager notices that a node needs to be fenced, the

  • Dual Primary: Think Twice

    4

    cluster filesystems will halt all I/O operations on all nodes, leading to possible I/O wait (dependingon how long it takes for the fencing to happen).

    7.Dual-Primaryandlong-distancelinksAs pointed out previously, a Dual-Primary DRBD resource is way more prone to connectionhickups and connection breakdowns than a standard resource. For that reason, Dual-Primarysetups should only be run in environments where a back-to-back connection ("cross-link")between the two nodes sharing a DRBD resource is available. Doing Dual-Primary setups over along distance link is begging for trouble as long as the link is not dark fibre.

    Running DRBD in Dual-Primary mode requires that the affected resource uses protocol C. DRBDwill return an error message if you try to put a resource into Dual-Primary mode which usesprotocol A or B.

    8.ValidDual-PrimaryscenariosBuilding a reasonable Dual-Primary setup is a complex and difficult task as stated previously.Nevertheless, this document does not want to put in question that usage scenarios for it do exist.Here are some scenarios where LINBIT considers to be worth the effort.

    Clustered Samba (CTDB): Clustered Samba is a special flavour of Samba built for cluster setups.It depends on a clustered filesystem.

    Oracle: When one needs to use Oracles database with the RAC feature, a clustered filesystem isneeded (obviously, OCFS2 is the preferred one by Oracle).

    NoteConsult with an Oracle support representative about support considerations forOracle RAC on Dual-Primary storage with DRBD and OCFS2.

    Live migration for VMs: When using virtual machines (be it KVM or Xen), there is the possibilityof using live migration to move a running VM from one host to another without having it to poweroff and on again. During the live migration, for a short period of time, obviously both nodes needto be able to access the underlying storage device in read/write mode. That can be achieved byusing DRBD in Dual-Primary mode. Please note: You dont need a clustered filesystem in this case,libvirt can take care of this for you.

    Parallel NFS: NFS starting with version 4.1 supports pNFS, which stands for Parallel NFS. pNFScan be run on Top of a Dual-Primary DRBD, too.

    9.ConclusionAs pointed out, there are some scenarios where using Dual-Primary DRBDs makes sense CTDB,Oracle, Live Migration with virtual machines, and pNFS, to name a few examples. If you areabout to use Dual-Primary for other setups, it might well be worth the effort to think about thatagain in most cases, the time invested into avoiding Dual-Primary setups pays off double andtriple when the solution is in production use.

    If you, however, still think you need a Dual-Primary setup, give us a call to find out whether LINBITis able to help you with whatever you are about to implement.

    10.FeedbackAny questions or comments about this document are highly appreciated and much encouraged.Please contact the author(s) directly; contact email addresses are listed on the title page.

  • Dual Primary: Think Twice

    5

    For a public discussion about the concepts mentioned in this white paper, you are invited tosubscribe and post to the drbd-user mailing list. Please see http://lists.linbit.com/listinfo/ drbd-user for details.

    Dual Primary: Think TwiceTable of Contents1.Active/Active vs. Dual-Primary2.What a filesystem does3.Requirements for Dual-Primary DRBD4.Cluster filesystems to the rescue5.Design-related issues6.Fencing mechanisms7.Dual-Primary and long-distance links8.Valid Dual-Primary scenarios9.Conclusion10.Feedback