Saturday, September 19, 2009

Head to the Cloud for Storage with Linux and DRDB

While some say the cloud doesn't perform well enough and isn't stable enough, others say it is perfect for their usage model. Why not take advantage of such a scalable architecture? It can make a great place to "stuff" data, such as backups.

In this article we use Amazon EC2 (the computing cloud) and S3 (the storage) as the example cloud provider, but any similar cloud service can be used the same way. There are many ways to backup your data to the cloud, including:

  • simple file copies onto the remote OS
  • remote API calls to send data into the cloud
  • normal block-based replication onto volumes in your cloud OS

File Copies

Running a Linux image in the cloud means you have access to implement whatever type of backup strategy you like. The most simple, brute force method, is to create a volume of a few hundred gigabytes, and simply copy files to it. You can also copy disk images, database backup dumps, and anything else that falls outside the traditional category of "file copy.

Many backup programs, especially open source ones, support disk-based backups. You can simply point a backup program at a remote volume with SSH access, and it will treat it as its backup volume. If you're implementing a quick-and-dirty backup solution for the first time, tools like rsync are wonderful. Simply rsync a directory every hour via a cron job, and it will copy only the data that has changed. Many forms of rsync backup scripts exist, posted to various Internet forums and newsgroups, ranging from simple to extremely complex.

Cloud Storage APIs

In addition to simply copying data into your cloud OS, you can also leverage the S3 storage grid directly. API calls will allow you to store and retrieve buckets of data, but that doesn't mean you need to spend time programming a solution.

Included in your EC2 OS image are two scripts: ec2-bundle-vol and ec2-upload-bundle. They can be run in sequence to upload a volume on your EC2 instance into S3 storage. If you're using the EC2 OS to copy in large amounts of backup data, as in the previous example, most likely you will eventually want to ship some of that data onto S3 storage directly. It is less expensive, and also easier to manage.

Aside from running scripts to send huge chunks of data to S3, which quickly becomes a scheduling and management nightmare, you can also just point backup software at your S3 account. Instead of disk-based backups in EC2, it often makes more sense to simply talk straight to S3. For example, the open source backup product, Amanda, supports S3 as a backup device in addition to the traditional disk and tape mediums.

Real-Time Replication With DRBD

Before we get started, let us first point out that replication is not backup. With careful planning, however, you can have easy access to an up-to-date copy of data, and lessen the need for certain types of backup.
Block-level replication can be thought of as RAID-1 over TCP. In Linux, the standard software that implements this type of real-time replication is called DRBD, which stands for Distributed Replicated Block Device [Editor's note: the author works for LINBIT, the authors of DRBD]. Conceptually, data is written to one server, and immediately replicated to a second. DRBD is free and open source, and according to the Linux Kernel Mailing List, slated to be integrated into kernel 2.6.32 this month.

DRBD is most often used to create high-availability clusters, with no shared storage, and therefore no single points of failure. Should a server in the pair crash due to OS bug or failed hardware, the other can be configured to automatically take over its duties--with an up-to-date copy of the application or database data. DRBD's third-node feature is most often used to locate a third copy of data off-site. Should an entire site go down, another up-to-date (as of the last file system write) copy of the data is ready to go. So how does this apply to the cloud?

A few models exist to replicate your data into the cloud. A pair of servers, configured to failover using ClusterLabs' Pacemaker cluster manager software, can have one node local, and one node in the cloud. Getting failover to work in this instance is non-trivial, and requires some fancy BGP routing knowledge, or VPN tunnels to migrate an IP address. The VPN option assumes the local site isn't completely down, though.
More often, a slower failover in the case of a datacenter meltdown is acceptable. Maybe DNS has to be changed, and the propagation time is acceptable. In this case, the general 3-node DRBD setup is used. One highly available cluster (pair of servers) at the local datacenter site serves up MySQL, Oracle, NFS, iSCSI, Apache or whatever application is needed, and a third node sitting in the cloud quietly sits receiving a copy of every transaction.

One problem with real-time replication is that the performance of your Internet connection is extremely important. Writes to a DRBD volume are immediately sent to replica nodes, and the primary node must wait for acknowledgement before proceeding with too many more data writes, to ensure data integrity. This can dramatically slow down file system performance in the face of the most common bottleneck: your Internet connection. Enter DRBD Proxy (not free, not open source) to alleviate this pain. The proxy runs on the cluster, buffering data very much like the write cache of a SAN storage controller. The primary node(s) can now write huge amounts of data, without waiting for the off-site copy to catch up.

In summary, if you need a crude but effective way to simply ensure that some data exists off-site, a quick rsync to the cloud may work. You can also use robust backup software to schedule "real" backups via cloud APIs. If, however, you want to have an up-to-date copy of your data to implement whatever clever real-time failover or disaster-mode fallback mechanism you like, you should consider DRBD.

No comments:

Post a Comment