30 July 2008

Backups Reorganization pt. 5: SCM

There are two Source Control Management (SCM) systems in use by the client: Subversion and Mercurial.

Subversion is a client/server based SCM with a central repository to which clients synchronize. There is a built-in tool available called "svnadmin hotcopy" which, as the "hotcopy" parameter suggests, guarantees that I'll get a consistent, restorable backup of a repository.

Great. Now the problem is, where are the repositories? There are dozens of repositories that need to be dumped and I don't know what they are or where they are located in the root file system. The client creates repositories on the fly wherever they want. That's just the way they do their work. I had to come up with some scripts to help me determine the repository locations by traversing a given part of the file system. Of course, I used Python to do that work.

The gist of the script(s) is that it will enumerate every directory it traverses. It then tries to look for "signature" directories and files. By "signature" I mean that these are specifically named and if you see them then you have probably found a Subversion repository which can be backed up. Python made quick work of this task with the os.walk() method, the Set type, and the accompanying set-based math.

I have put all of my code into a python package so that it'd be easy to build and deploy for the various client sites. I also tried to make this code fairly generic so that it could be re-used with other clients. I used Test-Driven-Development while writing these scripts. Of course, the code is under version control on my company's own Subversion repository.

Is it really worth traversing the file system to search for Subversion repositories? Yes! I found dozens of repositories scattered all around. Each has a file system backup performed on its host partition but none were being dumped as they should to guarantee a consistent, restorable backup.

Mercurial is a distributed revision control system. There is no centralized server by design. I didn't find a lot of good information about getting a good backup. Some items pointed to "hg clone". From some reading, it seems that backups consist of the copies kept on developers' machines. Some folks keep a central computer on which they have a Mercurial client and keep a master copy. I decided that I was just going to let the backups stand as is and rely on the file system level backup. I'm not entirely happy about it and may want to revisit this decision as soon as I take care of the other applications that need attention.

Next up...Oracle.

21 July 2008

Backups Reorganization pt. 4: PostgreSQL

The first thing that I wanted to do in order to backup the data from various applications was to create an identically named space on the file system in which to dump data. I chose /var/data_backup/. Each client machine is using LVM extensively so I had to create the logical volume and auto-mount it to /var/data_backup/.

For the most part, creating the new LV was not a problem because two of the three servers had ample space. The third server, however, needed to have at least one existing LV reduced in order to have enough free space in the volume group.

Here are the steps I used to reduce the size of an existing LVM partition:
  1. locally backup the partition to be reduced using, "rsync --archive"
  2. stop the application that was making use of that partition, in this case, apache
  3. umount the partition
  4. run e2fsck against the partition
  5. run resize2fs to reduce the size of the data on the partition
  6. reduce the size of the partition using lvreduce
  7. mount the partition
  8. start the application that uses the partition
After that I was able to create the standard /var/data_backup/ partition as normal using LVM since there was now enough free space in the volume group.

Now it was time to actually backup the PostgreSQL database. After some research, it seemed that the preferred way to do this was to use pg_dumpall since it dumps both the application databases as well as the system database containing database roles and permissions. At this point I simply had to create the appropriate user and accesses in PostgreSQL and then setup the cron job on each target machine running PostgreSQL. Along the way, I had to set up automatic login on two of the boxes using a .pgpass file.

I tested the restore of the data on a throw away server back at the office. The idea is to read the dumped data into template1 for v. 7.3.* and postgres for 8.1.* on PostgreSQL. The restore appeared to be fine.

Lastly, of course, I needed to perform the one-time setup of the backup servers to grab the /var/data_backup/ directory on each machine. I also grabbed the /etc/ directory too since there is a lot of application configuration information there including pg_hba.conf file.

Next up will be...Subversion.

15 July 2008

Backups Reorganization pt. 3: Sidestepping LVM Snapshots

The previous backup strategy had been to create LVM snapshots on the target machines of the various filesystems, mount the individual snapshots on the filesystem, and then back up the files on the mount point. There was quite a bit of scripting complexity to create and mount the snapshots safely.

As far as I can tell, the advantage of using a snapshot is that rdiff-backup won't complain that files may be actively changing while the backup is in the process of being created. (If someone thinks of other reasons to use LVM snapshots then please comment.)

What using the snapshot does not address is the fact that many applications cannot be backed up directly. Backing up certain application files directly may be reported as successful but would, in fact, yield a corrupted backup. In other words, the backup system backs up the files successfully but then they are not able to be used by the application when they are restored. The client has several applications that fall into this category including: Oracle, PostgreSQL, Zope, Subversion, and others.

So, the next task is a system administrative one of dumping application data, backing up the dumps, and excluding the underlying source files. First up...PostgreSQL.

13 July 2008

Backups Reorganization pt. 2 - Partitioning Overload

The first step in getting a handle on the data backup and recovery strategy is to untangle some of what has been done already. The key problem is that each backup server's volume group was partitioned into dozens of logical volumes. The result is that certain logical volumes are filling up (or are full) while others are under utilized.

The quickest solution would be to reduce a largely empty partition and resize the full one(s). That works but forces to me to guess at if enough space has been allocated or too little. It puts me in the position of having to actively manage space in the logical volumes.

A better approach would be to condense all of the logical volumes into one BIG one. The hope is that there will be enough overall free-space for the medium-term. That's the approach that I took. It meant recreating all of the backups and that took a few days.

In the process of condensing the logical volumes, I learned quite a bit about LVM at this Debian link and liked what I learned. Along the way, I had to readdress the backup scripts that were used. That'll be the topic of a later post.

12 July 2008

Backups Reorganization pt. 1: Introduction

Over the next couple of weeks, I plan to blog with a series of articles describing a project at work to reorganize the backups for a client. The client runs a number of Debian and Fedora Core servers which are backed up remotely. The previous sysadmin. had done a really good job of planning it out and provided some decent documentation. The problem is that the solution doesn't scale and is just complicated enough that another sysadmin. (such as me) has to spend a lot of time getting up to speed.

My goals in reorganizing their backups are as follows:
  1. Ensure that the data being backed up is actually restorable. Perform tests to confirm. For example, one cannot simply backup database files and expect the data to be restorable. Usually, with databases, one has to dump the data into a file and then backup the file. This will be true of other applications as well such as souce control management systems.
  2. Simplify the backup process where possible. For example, dozens of scripts on multiple machines need to be invoked on a daily basis to perform backups. That could all be condensed into one machine and one or two scripts.
  3. Document both the backup and restore procedure for the client. Plan for disaster recovery of the data.
  4. Plan for future requirements, including backing up Windows machines.
  5. Give the client a real confidence that their data is being adequately safeguarded. Again, this comes from periodically testing restores.
  6. Centralize the backup job control onto one machine (currently, they are duplicated across four).
  7. Provide reliable alerting of both failures and successes of backup jobs via configurable methods including syslog, email, and even rss.