|
|
Subscribe / Log in / New account

DRBD: a distributed block device

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

April 22, 2009

This article was contributed by Goldwyn Rodrigues

The three R's of high availability are Redundancy, Redundancy and Redundancy. However, on a typical setup built with commodity hardware, it is not possible to add redundancy beyond a certain limit to increase the number of 9's after your current uptime percentage (ie 99.999%). Consider a simple example: an iSCSI server with the cluster nodes using a distributed filesystem such as GFS2 or OCFS2. Even with redundant power supplies and data channels on the iSCSI storage server, there still exists a single point of failure: the storage.

The Distributed Replicated Block Device (DRBD) patch, developed by Linbit, introduces duplicated block storage over the network with synchronous data replication. If one of the storage nodes in the replicated environment fails, the system has another block device to rely on, and can safely failover. In short, it can be considered as an implementation of RAID1 mirroring using a combination of a local disk and one on a remote node, but with better integration with cluster software such as heartbeat and efficient resynchronization with the ability to exchange dirty bitmaps and data generation identifiers. DRBD currently works only on 2-node clusters, though you could use a hybrid version to expand this limit. When both nodes of the cluster are up, writes are replicated and sent to both the local disk and the other node. For efficiency reasons, reads are fetched from the local disk.

The level of data coupling used depends on the protocol chosen:

  • Protocol A: Writes are considered to complete as soon as the local disk writes have completed, and the data packet has been placed in the send queue for the peers. In case of a node failure, data loss may occur because the data to be written to remote node disk may still be in the send queue. However, the data on the failover node is consistent, but not up-to-date. This is usually used for geographically separated nodes.
  • Protocol B: Writes on the primary node are considered to be complete as soon as the local disk write has completed and the replication packet has reached the peer node. Data loss may occur in case of simultaneous failure of both participating nodes, because the in-flight data may not have been committed to disk.
  • Protocol C: Writes are considered complete only after both the local and the remote node's disks have confirmed the writes are complete. There is no data loss, so this is a popular schema for clustered nodes, but the I/O throughput is dependent on the network bandwidth.

DRBD classifies the cluster nodes as either "primary" or "secondary." Primary nodes can initiate modifications or writes whereas secondary nodes cannot. This means that a secondary DRBD node does not provide any access and cannot be mounted. Even read-only access is disallowed for cache coherency reasons. The secondary node is present mainly to act as the failover device in case of an error. The secondary node may become primary depending on the network configuration. Role assignment and designation is performed by the cluster management software.

There are different ways in which a node may be designated as primary:

  • Single Primary: The primary designation is given to one cluster member. Since only one cluster member manipulates the data, this mode is useful with conventional filesystems such as ext3 or XFS.
  • Dual Primary: Both cluster nodes can be primary and are allowed to modify the data. This is typically used in cluster aware filesystems such as ocfs2. DRBD for the current release can support a maximum of two primary nodes in a basic cluster.

Worker Threads

A part of the communication between nodes is handled by threads to avoid deadlocks and complex design issues. The threads used for communication are:

  • drbd_receiver: handles incoming packets. On the secondary node, it allocates buffers, receives data blocks and issues write requests to the local disk. If it receives a write barrier, it sleeps until all pending write requests have been finished.
  • drbd_sender: Sender thread for data blocks in response to a read request. This is done in a thread other than drbd_receiver, to avoid distributed deadlocks. If a resynchronization process is running, its packets are generated by this thread.
  • drbd_asender: Acknowledgment sender. Hard drive drivers are informed of request completions through interrupts. However, sending data over the network in an interrupt callback routine may block the handler. So, the interrupt handler places the packet in a queue which is picked up by this thread and sent over the network.

Failures

DRBD requires a small reserve area for metadata, to handle post failure operations (such as synchronization) efficiently. This area can be configured either on a separate device (external metadata), or within the DRBD block device (internal metadata). It holds the metadata with respect to the disk including the activity log and the dirty bitmap (described below).

Node Failures

If a secondary node dies, it does not affect the system as a whole because writes are not initiated by the secondary node. If the failed node is primary, the data yet to be written to disk, but for which completions are not received, may get lost. To avoid this, DRBD maintains an "activity log," a reserved area on the local disk which contains information about write operations which have not completed. The data is stored in extents and is maintained in a least recently used (LRU) list. Each change of the activity log causes a meta data update (single sector write). The size of the activity log is configured by the user; it is a tradeoff between minimizing updates to the meta data and the resynchronization time after the crash of a primary node.

DRBD maintains a "dirty bitmap" in case it has to run without a peer node or without a local disk. It describes the pages which have been dirtied by the local node. Writes to the on-disk dirty bitmap are minimized by the activity log. Each time an extent is evicted from the activity log, the part of the bitmap associated with it which is no longer covered by the activity log is written to disk. The dirty bitmaps are sent over the network to communicate which pages are dirty should a resynchronization become necessary. Bitmaps are compressed (using run-length encoding) before sending on the network to reduce network overhead. Since most of the of the bitmaps are sparse, it proves to be pretty effective.

DRBD synchronizes data once the crashed node comes back up, or in response to data inconsistencies caused by an interruption in the link. Synchronization is performed in a linear order, by disk offset, in the same disk layout as the consistent node. The rate of synchronization can be configured by the rate parameter in the DRBD configuration file.

Disk Failures

In case of local disk errors, the system may choose to deal with it in one of the following ways, depending on the configuration:

  • detach: Detach the node from the backing device and continue in diskless mode. In this situation, the device on the peer node becomes the main disk. This is the recommended configuration for high availability.
  • pass_on: Pass the error to the upper layers on a primary node. The disk error is ignored, but logged, when the node is secondary.
  • call-local-io-error: Invokes a script. This mode can be used to perform a failover to a "healthy" node, and automatically shift the primary designation to another node.

Data Inconsistency issues

In the dual-primary case, both nodes may write to the same disk sector, making the data inconsistent. For writes at different offset, there is no synchronization required. To avoid inconsistency issues, data packets over the network are numbered sequentially to identify the order of writes. However, there are still some corner-case inconsistency problems the system can suffer from:

  • Simultaneous writes by both nodes at the same time. In such a situation, one of the node's writes are discarded. One of the primary nodes is marked with the "discard-concurrent-writes" flag, which causes it to discard write requests from the other node when it detects simultaneous writes. The node with discard-concurrent-writes flag set, sends a "discard ACK" to other nodes informing them that the write has been discarded. The other node, on detecting the discard ACK, writes the data from first node to keep the drives consistent.
  • Local request while remote request in flight This can happen when the disk latency exceeds the network latency. The local node writes to a given block, sending the write operation to the other node. The remote node then acknowledges the completion of the request and sends a new write of its own to the same block - all before the local write has completed. In this case, the local node keeps the new data write request on hold until the local writes are complete.
  • Remote request while local request is still pending: this situation comes about if the network reorders packets, causing a remote write to a given block to arrive before the acknowledgment of a previous, locally-generated write. Once again, the receiving node will simply hold the new data until the ACK is received.

Conclusion

DRBD is not the only distributed storage implementation under development. The implementation of Distributed Storage (DST) contributed by Evgeniy Polyakov and accepted in staging tree takes a different approach. DRBD is limited to 2-node active clusters, while DST can have larger numbers of nodes. DST works on client-server model, where the storage is at the server end, whereas DRBD is peer-to-peer based, and designed for high-availability as compared to distributing storage. DST, on the other hand, is designed for accumulative storage, with storage nodes which can be added as needed. DST has a pluggable module which accepts different algorithms for mapping the storage nodes into a cumulative storage. The algorithm chosen can be mirroring which would serve the same basic capability of replicated storage as DRBD.

DRBD code is maintained in the git repository at git://git.drbd.org/linux-2.6-drbd.git, under the "drbd" branch. It contains the minor review comments posted on LKML incorporated after the patchset was released by Philipp Reisner. For further information, see the several PDF documents mention in the DRBD patch posting.

Index entries for this article
KernelBlock layer
KernelClusters
KernelDRBD
GuestArticlesRodrigues, Goldwyn


(Log in to post comments)

Sounds like fun :-)

Posted Apr 23, 2009 15:44 UTC (Thu) by felixfix (subscriber, #242) [Link]

Aside from sounding like it might actually be practical, it also sounds like fun. What if the remote node (or the local node, if it is allowed) is an NFS volume, which is itself under DRBD control? The devil in me wonders about a chain of these beasties, especially if you can see or hear drive activity, activating one after the other, fractions of a second apart.

How much delay is there from node protocol? If you had an infinitely fast network, how much difference is there among the various Protocols and compared to a standard local filesystem not under DRBD control?

Is it possible to choose whether a disk is under DRBD control? COuld you run benchmarks both ways with just a simple umount / mount between?

Sounds like fun :-)

Posted Apr 23, 2009 18:57 UTC (Thu) by bronson (subscriber, #4806) [Link]

DRBD has a steeper learning curve than you'd expect but, yea, it is pretty fun.

You'd need a filesystem between DRBD and NFS. If you want a primary-primary DRBD (data locally accessible on both nodes), that filesystem must be distributed, which restricts you to GFS, OCFS2, etc, which don't play well with NFS. If you set up a primary-secondary DRBD, you can use ext3 in the middle, which works great with NFS, but then the only benefit DRBD brings is hot failover. And there are MUCH easier ways to set up HA NFS. So... Probably not the best way to go.

Yes, you can stack DRBD: http://www.drbd.org/users-guide/s-three-nodes.html But, unless you're just asynchronously mirroring a volume for hot backup, which works great, stacking tends to be pretty cranky. Definitely don't think, "hey, I can create a 7 layer stack and distribute a single block device to all my satellite offices!" You won't be happy.

The delay is entirely dependent on your network; DRBD itself is pretty light. But remember that block devices tend to use TONS of bandwidth. DRBD includes a userspace proxy that will do compression to make things more tolerable over wan links but it makes things more complex... Only use it if you need to.

The node protocol just specifies how long the primary needs to wait for an ack. It allows you to trade a small risk of data loss for a large improvement in write latencies. The remote can ack immediately (A), keeping less data in flight, but there's a slightly higher risk of data loss. Or the remote can delay the ack until the data is actually in the oxide (C), reducing your potential data loss to pretty much nil, but then writes to the primary will take a lot longer and there will be a lot more data in flight at any one time.

So, with an infinitely fast network, there's basically no downside to going with C. Over a WAN, C would probably be intolerable.

> Is it possible to choose whether a disk is under DRBD control?

What do you mean? You can put pretty much any block device under DRBD control. Right now my stack is SATA > LVM > DRBD > OCFS2. Putting DRBD on top of LVM means that I can grow DRBD+OCFS2 just by attaching more storage anywhere on the system. It's pretty nice. But you could just as easily go SATA > DRBD > LVM > OCFS2 (if you will be snapshotting a lot), or SATA > LVM > DRBD > LVM > etc...

Sounds like fun :-)

Posted Apr 23, 2009 19:35 UTC (Thu) by felixfix (subscriber, #242) [Link]

You've answered all my questions and more. As for what I meant by whether a disk is under DRBD control or not, I was thinking of LVM. You can't choose at mount time whether a disk is under LVM control or not -- it is built one way or the other. I realize now I meant file system, not disk ... the B stands for Block, so a little thought and a little more clarity in my question would have answered it right from the start.

Thanks.

DRBD learning curve

Posted Apr 26, 2009 22:25 UTC (Sun) by giraffedata (guest, #1954) [Link]

DRBD has a steeper learning curve than you'd expect

From context, I'm sure you mean shallower. A steep learning curve (the graph of productivity vs time) indicates that you learn everything there is to know quickly.

DRBD learning curve

Posted Apr 27, 2009 1:45 UTC (Mon) by mgb (guest, #3226) [Link]

The familiar expression "steep learning curve" may refer alternately to rapid learning that is easy, or especially hard, or to steady progress that is increasingly difficult. Which is referred to needs to be clarified by context. The difference is specifically whether one is referring to the rate of learning or the rate of investment needed to learn.

[Wikipedia]

DRBD learning curve

Posted Apr 27, 2009 3:03 UTC (Mon) by giraffedata (guest, #1954) [Link]

Yes, "steep learning curve" is frequently (usually, I think) used to mean it's hard to learn or there's lots to learn. So one has to know that when reading the term. But that doesn't change the fact that it's wrong and it would be better if people didn't write the term that way.

The learning curve is a well known name for a useful concept with a clear history. It was invented by industrial engineers to describe the effect of introducing a new process or machine and is always a graph of productivity vs time. It slopes upward because learning takes place.

One could imagine a graph which shows, as Wikipedia suggests, a rate of learning or of investment, but you won't find anyone drawing such graphs anywhere, unlike true learning curves. One could imagine a graph in which steepness reflects something that is hard to learn too, but no one ever uses those either (and if someone does, I'm sure he would give it a name that isn't already taken for something else).

DRBD learning curve

Posted Apr 27, 2009 3:09 UTC (Mon) by bronson (subscriber, #4806) [Link]

Time/effort required to learn is on the Y axis. It takes some studying to wrap your head around DRBD, thus the curve rises fast. Was this unclear?

Some reading material: http://www.google.com/search?q="steep+learning+curve"

DRBD learning curve

Posted Apr 27, 2009 15:26 UTC (Mon) by giraffedata (guest, #1954) [Link]

Was this unclear?

It was not unclear. You'll note I didn't ask for a clarification.

DRBD: a distributed block device

Posted Apr 23, 2009 19:04 UTC (Thu) by bronson (subscriber, #4806) [Link]

Are there any opinions as to the best POSIXy distributed filesystem to use on top of DRBD? GFS and OCFS2 looks more or less identical. Both grow but don't shrink, both are pretty close to POSIX, both are in kernel and in reasonably active development. How am I supposed to choose?

Also, has anybody played with DST? Just merged in 2.6.30, seems pretty exciting.

http://tservice.net.ru/~s0mbre/old/?section=projects&...

DRBD: a distributed block device

Posted Apr 24, 2009 10:48 UTC (Fri) by nix (subscriber, #2304) [Link]

I was jumping up and down when DST went in because it portends good things
for POHMELFS, which knocks the *socks* off NFS speedwise.

But now Evgeniy is making noises about moving POHMELFS to run atop the
elliptics distributed hashtable, and, well, if that means that it can't
export an ordinary filesystem anymore, it's useless to me, and to anyone
else who wanted the same ease-of-exporting that NFS was good at.

A shame.

(but perhaps I misread and it's just gaining the *ability* to use
elliptics as its backend, without losing the ability to use a regular
filesystem as well.)


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds