Archive for the ‘UNIX’ Category

Amanda on Mac OS X

Thursday, July 22nd, 2010

Given that Retrospect 8 is essentially a piece of crap, I’ve been searching for an alternative I can use when Time Machine is not an option for backing up Macs. The main two points I’m focused on is reliability and speed. I want a backup system I can trust that won’t take the age of the universe to recover a file.

I’ve been using Amanda for a while now to backup all the Macs in our workgroup (10 machines) and so far I’m nothing but happy. Amanda is an open source backup system for UNIX-based operating systems, Mac OS X included (I believe it can also backup Windows clients, but I couldn’t care less).

Amanda was originally designed to backup to tapes. Today, since hard drives became cheap and are a great media for backup, Amanda also supports virtual tapes. A virtual tape is a directory on disk that essentially acts as a tape, storing raw information. While configuring an Amanda server, you create a set of tapes of an arbitrary size (100 GB, for instance) and Amanda will use at least one tape per day, eventually rotating trough all of them.

I’ll now lay down some considerations about how Amanda works and how does that reflect on usage and backup planning. Also, I’ll do some comparisons with Time Machine.

Scheduling

In amanda, you define one or more configuration files. Each configuration can have a set of hosts (a backed-up machine), and for each host a set of disks (a disk is any directory you want to backup, it’s not needed to match Amanda disks with physical disks or logical HFS+ volumes). You generally run a backup operation that will backup all the disks of all the hosts on that configuration file.

Launching a backup operation is done by simply executing an UNIX command from whatever mechanism you prefer (cron, launchd, manually, etc). This means you can define how often does your backup run, and at what time. Usually, a daily backup is performed, but you may want to backup every 6 hours or only once per week, depending on how often does your data change and how bad is it to loose a few hours of work. This means that, in Amanda, the server is proactive and initiates the backup procedure whenever you set it to. In Time Machine, backup operations are initiated by the clients every hour. The server is simply a dumb file server where backups are stored.

Storage

Amanda uses the tar format to store data on a virtual tape. For each backed-up disk Amanda creates a tar file inside the tape with data. Amanda stores data in incremental fashion, using some redundancy as well. This is implemented trough the concept of backup levels. The base backup (where all the files on a disk are backed-up) is level 0. From there, backup levels are incremented, and each level essentially means it contains the incremental changes relative to an archive in the immediate lower level. This means a level 3 backup contains the changes relative to a level 2 backup. That level 2 backup contains the changes relative to a level 1 backup which contains, as expected, the changes relative to the level 0 backup.

Amanda decides on the level is should operate on based on a complex planner that considers several factors that result in a decision. You may add some configuration options to customize the weight of some of those factors in the final planner decision. Essentially, Amanda tries to obtain the right balance between reliability (which, in this context, means the probability of recovering a backup successfully) and used disk space. Reliability decreases as the backup level increases, due to the fact that, if you want to recover a level 5 backup, you need to have the consistent level 0, 1, 2, 3, 4 and obviously 5 backups, because each of those build upon the previous one. A level 0 backup is more reliable in the sense that you only need the level 0 backup itself to recover. Of course, a level 0 backup is an “everything” backup. If you do a level 0 backup every day, you’ll use a lot of disk space.

This is an example of a production Amanda installation for a host called bergman and volume / (the root):


date                host    disk lv tape or file file part status
2010-07-01 06:05:24 bergman /     2 DAILYS-6       15  1/1 OK
2010-07-02 04:19:11 bergman /     3 DAILYS-7       23  1/1 OK
2010-07-04 04:47:44 bergman /     3 DAILYS-9        9  1/1 OK
2010-07-05 02:51:01 bergman /     0 DAILYS-10      21  1/1 OK
2010-07-06 02:51:34 bergman /     1 DAILYS-11      11  1/1 OK
2010-07-07 04:59:40 bergman /     1 DAILYS-12      12  1/1 OK
2010-07-08 02:59:32 bergman /     1 DAILYS-13      15  1/1 PARTIAL
2010-07-10 02:46:41 bergman /     1 DAILYS-15      20  1/1 OK
2010-07-11 05:25:49 bergman /     2 DAILYS-16       8  1/1 OK
2010-07-12 03:43:04 bergman /     0 DAILYS-17      16  1/1 OK
2010-07-13 04:25:23 bergman /     1 DAILYS-18      11  1/1 OK
2010-07-14 02:56:48 bergman /     1 DAILYS-19      15  1/1 OK
2010-07-15 02:54:01 bergman /     2 DAILYS-20      11  1/1 OK
2010-07-16 02:46:30 bergman /     2 DAILYS-1       11  1/1 OK
2010-07-17 05:05:08 bergman /     3 DAILYS-2       12  1/1 OK
2010-07-18 03:00:43 bergman /     3 DAILYS-3        9  1/1 OK
2010-07-19 05:11:36 bergman /     3 DAILYS-4        7  1/1 OK
2010-07-20 02:49:03 bergman /     4 DAILYS-5       24  1/1 OK

The backup level is indicated by the “lv” column. As you can see, I have 20 virtual tapes, and the last one to be used was DAILYS-5. The next one will be DAILYS-6 (its content will be erased and the tape will be reused). You may be asking yourself why are there two level 0 backups, and a lot of consecutive repeated backups with the same level.

Consider two rules of thumb to understand that:

1) A level 0 backup is needed to recover, so Amanda must make sure at least one level 0 backup exists at any time, given the number of tapes and their rotation. Also, bad stuff can happen, like a machine being down or unavailable at the time the backup runs. If that happens during a few Amanda runs, it may happen than you hit a number of rotations where you effectively loose the level 0 backup without creating a new one. That’s why Amanda makes a few level 0 backups among the way, to make sure you still have a second level 0 backup if the first one is destroyed. You can configure the maximum number of runs that go by without a level 0 backup being created. It’s highly recommended that that number is lower than the half of the total amount of tapes you have, so that, in the worst case (like in my example, where I have 20 tapes), you have a level 0 backup in the “middle” of your tape recycling circuit. Now, why did Amanda created two level 0 backups, one in July 5, and another one in July 12, effectively less than the maximum time allowed? See below.

2) Despite what seems intuitive, Amanda does not increase the backup level every run until it gets back to zero again. I don’t know in detail all the data the planner uses to make a decision, but there are at least two interesting considerations the planner takes into account.

The first one, is a level N+1 backup a lot smaller than a level N? If the answer is no, Amanda decides to keep the same level. There’s no point in increasing the level (and lowering reliability) if no significative amount of disk space will be saved. This may happen if you change approximately the same set of files in each run. In that case, the incremental backup from level N to N+1 would be of almost the same size as the N-1 to N.

Second, Amanda sometimes promotes level 0 backups ahead of schedule to spread them trough time. Doing all the level 0 backups in the same day would be very slow and might not fit in the maximum number of tapes allowed for a single run (you can define that value on the config file). There’s also another factor: free space on the tape. Remember that, for now, Amanda cannot use the same tape in two consecutive runs, so there’s no point in leaving unused space on a tape. If there’s enough space to perform a level 0 backup instead of a higher level, Amanda may decide to do it.

As you can see by now, data storage is handled quite differently in Amanda and Time Machine. I’ll assume you know how Time Machine works, so I’ll get straight to the point: in Amanda you may have more than one copy of the entire volumes you are backing up (ie, several level 0 backups of the same data), so you need to take that in consideration while planning storage space. On the other hand, amanda allows gzip compression (and encryption) of backup data, so it’s not completely obvious how much more (or less) space you need for Amanda backups compared to Time Machine. In some situations, if your data is highly compressible, you may even end up with two level 0 backups taking less space than a single Time Machine backup, although the opposite will happen most of the time. What’s cool is that if you use compression, Amanda learns with time how compressible your data is, and adjusts it’s planning according to that.

Another interesting note is Amanda using standard file formats for storing backups (tar and optionally gzip). This allows recovering of data even in machines where Amanda is not present. If you navigate into a virtual tape directory on your file system and run the “head” command in one of the stored files, you’ll see something like this:

AMANDA: FILE 20100715010001 serpa /Library  lev 2 comp .gz
  program /usr/bin/gnutar
DLE=<<ENDDLE
<dle>
  <program>GNUTAR</program>
  <disk>/Library</disk>
  <level>2</level>
  <auth>ssh</auth>
  <compress>FAST</compress>
  <record>YES</record>
  <index>YES</index>
  <exclude>
    <list>.amanda-exclude.list</list>
    <optional>YES</optional>
  </exclude>
</dle>
ENDDLE
To restore, position tape at start of file and run:
	dd if=<tape> bs=32k skip=1 | /usr/bin/gzip -dc |
          /usr/bin/gnutar -xpGf - ...

The first part is metadata used by Amanda. After the metadata ends, there are two lines used to tell you how to restore the data stored on that file with standard UNIX tools. This means you don’t need to worry about being able to recover your backups if Amanda development happens to stop for some reason. As long as you have an UNIX machine, you’ll be able to restore your data.

Security

There are two points I want to mention about security: transport and storage encryption, and system architecture. I’ll assume we’re always talking about having a backup server and several clients. If the problem you’re trying to solve is so simple that it can be fixed with an USB disk and Time Machine, you’re just wasting your time reading this. :)

Amanda supports encryption both during the data transport and on data storage. Transport security is guaranteed by using ssh with public/private key pairs. It also supports data encryption on the data storage by using symmetric private-key based encryption of public/private key pairs. Encrypting your backups is important, specially if you store them in an offsite location (either via network transfer to a remote data center, or by physically storing hard drives or real tapes in a safe). In the event data ends up in the wrong hands, encryption will rend it useless for the bad guys. Things are substantially different in Time Machine. Data storage encryption is simply non-existing (unless you use a lower level encryption method, like PGP). Transport encryption may be provided by the AFP protocol, used by Time Machine to connect to the backup server, depending on your server configuration.

More interesting, in my opinion, is the implicit security resulting from the client/server architecture used by Amanda. A backup server should be one of the most safer and well guarded machines you have in your network. All your data will be there. If the backup server gets compromised, it means all the data from all the backed-up machines might now be in the wrong hands. You would want your backup server to run as few services as possible, and to allow access to the server (like ssh) only by trusted admins from controlled networks.

In Amanda, this is possible. As I described before, when a backup operation starts, the Amanda server will contact its clients using the method you chose in the configuration (ssh in my case). So the connection is made from the server to the client, and not the opposite. This means your user’s machines (which will naturally be less secure than your server because those pesky users will run all the crap they get from the internets!) will never access the backup server, and though compromise it’s security. Time Machine works in the opposite direction. There’s no concept of a “backup server”. The clients run the show, and the server is simply a file server where backups are stored. This exposes the user backups to whatever malware and trojan horses they may have running on their Macs. If the server is misconfigured, or if some hacker exploits an unknown vulnerability in the AFP protocol, other users backups may be compromised as well.

Conclusions

Amanda may be a great option to backup always-on Macs, like xServes or desktop machines (along with Linux or FreeBSD servers, of course). It offers a vey fast, reliable and secure infra-structure upon you may build your backup system. However, it lacks the user interface and simplicity of Time Machine (most configurations require sysadmin intervention for recovering data from a backup). Also, Time Machine may be more appropriate if you rely heavily on laptops that will not be available on the network on a predictable schedule.

Amanda pros:

  • Works on any UNIX system (and Windows clients) which may help when planning a multi-OS backup scenario.
  • Offers very good control of used disk space.
  • Allows data encryption and compression.
  • Secure client/server architecture.
  • May be used on real tapes, not just hard drive-based backups.

Amanda cons:

  • Not appropriate for laptops that may be out of network reach during backup operations.
  • Step learning curve, a little hard for beginners to configure and get everything running smoothly.
  • Unless your users are computer experts and have access to the backup server, it requires sysadmin intervention for recovering data.
  • Recovering a single file may take some time, because the entire tar archive has to be read until the file is found.
  • Backups are usually less frequent than Time Machine.

Time Machine pros:

  • Very simple setup.
  • Non-expert users may recover lost files and easily browse filesystem history using the stunning user interface, without requiring sysadmin intervention.
  • Works fine with laptops with intermittent network connections.
  • Very well integrated with Mac OS X, makes backing up and recovering very easy and part of normal usage and new machine installation workflow.

Time Machine cons:

  • Works only on Mac OS X, and requires a Mac OS X Server as backup server (unless you go with unsupported devices and face possible consequences).
  • Offers no data storage security.
  • Very hard/impossible to control data storage strategy and used space.
  • Requires clients to access server, which decreases security.
  • Works only with disk-based backups.

There’s no clear winner here, it highly depends on your needs and restrictions. I hope this article gave you a general idea of what Amanda is, how does it work, and it’s advantages and weaknesses. In the next article, I’ll describe how to install a Mac OS X amanda client, and how to recover from a catastrophic drive failure.

Avoid escaping URLs in Apache rewrite rules

Thursday, January 24th, 2008

Today we started hacking some Apache rewrite rules to make some URLs a little more friendly. All of the URLs we are rewriting are entry points for our application, which in the WO world means direct actions.

All of them contain queries parts. These are the arguments in a URL, like http://domain.com/something?arg=value&anotherArg=anotherValue. Well, everything was running fine until we accessed an URL with a escaped character. We had the following rule:


RewriteRule ^/optOut(.*) /cgi-bin/WebObjects/OurApp.woa/wa/optOut$1 [R]

When we tried to access the URL http://domain.com/optOut?email=me%40domain.com, we got:

http://domain.mac/cgi-bin/WebObjects/OurApp.woa/wa/optOut?email=me%2540domain.com

Note the last part or the URL. The problem is that the original % in %40 (the escape code for @) was being itself escaped, leading to a corrupt URL. The result of calling formValueForKey on this was me%40domain.com.

After googling for a while, I found out many people have this problem, but surprisingly I could not find a decent solution. I saw hacks with escape internal functions and PHP weird variables that obviously wouldn’t help me at all, and I was starting to get nervous.

Well, while reading the rewrite module documentation, I found the solution right there: the NE option. According to the docs:

‘noescape|NE’ (no URI escaping of output)
This flag prevents mod_rewrite from applying the usual URI escaping rules to the result of a rewrite. Ordinarily, special characters (such as ‘%’, ‘$’, ‘;’, and so on) will be escaped into their hexcode equivalents (‘%25′, ‘%24′, and ‘%3B’, respectively); this flag prevents this from happening. This allows percent symbols to appear in the output(…)

Well, this is it. I saw so many pages with people asking about this problem without getting any answers that I was not very confident on this, but I tried to change the rule for:


RewriteRule ^/optOut(.*) /cgi-bin/WebObjects/OurApp.woa/wa/optOut$1 [R,NE]

And that’s it. It’s now working as it should, producing the correct URL:

http://domain.mac/cgi-bin/WebObjects/OurApp.woa/wa/optOut?email=me%40domain.com

Apparently, it’s that simple.

Testing memory

Sunday, August 12th, 2007

I wrote some days ago about badblocks for testing a hard drive surface. Now, the same for memory.

As I said, I bought a second-hand PowerMac G5 to replace my old G4. When I got the new machine, I run Apple Hardware Test (AHT), using the Extended Test. AHT tested my hardware, including the 2.5 GB of RAM, taking more than two hours (and making a hell of a noise, because during tests, the G5 ventilation system works in failsafe mode, which means, full power). Everything seemed to be fine. Until I installed Retrospect. I use Retrospect to make all my backups at home, and despite all it’s quirks, it always worked fine on the G4. Since I installed it on the G5, I got strange errors (the famous “internal consistentcy check”) and even crashes.

After nailing down all the possibilities (trashing preferences and existing backup sets, reinstalling Retrospect, etc) I suspected it could be an hardware problem, because I was told that the “internal consistentcy check” appears when the backup set contents are corrupted. So, I thought, my hard drive is corrupting data. I duplicated one backup set with about 80 GB, and surprise – after duplicating and running an md5 checksum on it (and diff), the files were different! This was NOT supposed to happen, naturally. So I tried the same thing on my boot drive – same problem. Ops… it’s not the drives. So, if it’s not the drives, and supposing (more exactly, praying) that it was not a motherboard issue, it must be the memory.

All my collegues at IST System Adminstration team use memtest on PCs to test the memory. This great distribution of memtest has a really nice touch: you can burn this on a CD, and boot the PC from it. It takes less that 200K of RAM, so all the other memory will be tested. Unfortunately, it’s not possible to boot it in Macs (not even Intel Macs – I tried it!). So, you must get the Mac OS X version of memtest, and boot the OS in Single User mode (using command-S during the boot sequence). The OS will take about 50 MB of RAM, which is, of course, much worse than the 200K used by the PC version, because those 50 MB will simply not be tested. But it’s better than nothing.

The official site for the Mac OS X version of memtest is here, but unfortunately, the author requires you to pay a small ammount for the download. I don’t like the approach very much because I don’t really know what I’m buying. the author says that, after paying, he sends a password for the encrypted DMG you downloaded. But I cannot download without paying, because the link is no-where. So… what happens when a new version comes out? Do I have to pay it again? Well, anyway, someone else is distributing memtest for OS X for free. Yes, it’s legal, because the software is under GNU license. So, if you don’t want to pay, just click here and grap your own free copy. Happy testing!

By the way, some tests take a lot of time. Let all of them run. Don’t assume the fact that all the “quick” tests passed means your memory is OK. Some problems may only be found with the more complex and slower tests – that’s why they are there. So, let it run. And if you have a G5, get the hell out of there, or use ear-plugs. It won’t be a nice office to work during testing, trust me.

memtest will detetct lots of common problems in memories, and will probably identify more than 99% of the defective memory modules arround. But never forget: it’s impossible to be entirely sure that a memory module is OK, simply because it’s not possible, in a reasonable time frame, to test all the possible combinations of data. Also, memory may pass all the tests in a day, and fail the next day. There are many factors that may trigger a hidden problem in memory modules: temperature, electrical flutuations, the data it contains, age, etc. If you suspect you have a bad memory module, and if you have time, run memtest for several days in a row, using the option to do many passes.

Bad blocks? badblocks!

Monday, July 16th, 2007

There are not many things I miss from Mac OS 9. But there’s one that was really useful: the ability to test a hard drive surface. OS 9 disk formatter (I don’t even recall it’s name) had a “Test Disk” option that would perform a surface scan of the selected hard drive. That was awsome to test for bad blocks on the drives.

Unfortunately, that’s impossible to do with Mac OS X, at least with it’s built-in software. There are some commerical applications to do that (like TechTool Pro), but I get a little pissed off when I have to spend a lot of money buying a software that does a zillion things when all I want is surface scans, and specially when I could do it with the “old” OS and not with the new powerful UNIX-based one.

Well, Linux has the badblocks command that will do just that: test the disk surface for bad blocks. It’s a simple UNIX command, so I thought there must be a port of that to OS X (and, of course, I could try to compile it in OS X as last resource). After some googling, I found out badblocks is part of the ext2fs tools. And, fortunately, Brian Bergstrand has already done the port to OS X, including a nice installer.

The installer installs all the ext2fs stuff, including an extension that will allow you to access ext2fs volumes on OS X. As always, this is a somewhat risky operation. Personally, I avoid as many extensions as I can, because they run too close to the kernel for me to feel confortable. So, if possible, install it on a secondary OS (like an utility/recover system on an exteral hard drive, or so).

The badblocks command will be installed in /usr/local/sbin/badblocks, and it will probably not be on your PATH, so you have to type the entire path when using, or edit your PATH environment variable.

Usage is simple. First, run the “mount” command, so that you know the device names for the drives you want to test. You can obtain something like this:

arroz% mount
/dev/disk0s3 on / (local, journaled)
devfs on /dev (local)
fdesc on /dev (union)
on /.vol
automount -nsl [142] on /Network (automounted)
automount -fstab [168] on /automount/Servers (automounted)
automount -static [168] on /automount/static (automounted)

The internal hard drive is /dev/disk0 (note that /dev/disk0 is the entire drive, /dev/disk0s3 is a single partition). Imagining you want to test the internal hard drive you would type the command (as root):

badblocks -v /dev/disk0

This would start a read-only test on the entire volume. The -v is the typical verbose setting, so you may follow what’s happening. This will take a long time, depending on the hard drive you use. For a 160 GB hard drive, it took between 2 and 3 hours in a G5 Dual 2 Ghz.

I mention this because time is an important factor when testing hard drives! You should run badblocks on a known-to-be-in-good-condition hard drive, so that you can get the feeling of how fast (or slow) badblocks is. Later, if you test a possibly failing hard drive, and badblocks progresses notably slower, it will probably mean that the hard drive is in bad condition (even if it doesn’t have badblocks).

After running the command, you may get two results: your disk has, or hasn’t badblocks! :) You will see many outputs of a successful surface scan, so I leave here an example of a not-so-successful one:

/usr/local/sbin arroz$ sudo ./badblocks -v /dev/disk0
Password:
Checking blocks 0 to 156290904
Checking for bad blocks (read-only test): 120761344/156290904
120762872/156290904
120762874/156290904
done
Pass completed, 3 bad blocks found.

This is the result of a test on a 160 GB hard drive with 3 bad blocks.

After getting something like this, you may try to run badblocks again, in write mode. Note that this will destroy all the information you have on the hard drive! badblocks won’t copy the information to memory, and than back to disk. It simple destroys it. The point of running a write-enabled badblocks check is forcing the hard drive to remap the damaged sectors. Hard drives have a reserved space to use when bad blocks are found. The bad blocks are remapped to that reserved space, until it fills. And this will only happen on a write. So, run badblocks in write mode, and then again in read-only mode. If badblocks finds no bad blocks, your hard drive is fine (for now). If badblocks still finds bad blocks, it means that there are so many damaged blocks on the disk surface that the reserved area is full. Forget it, and throw the disk away. It’s useless.

Emulating a slow internet link for http

Tuesday, May 22nd, 2007

When developing a Web Application, many times you need to test how will it react then operating over a slow internet connection, like 56k modem, GPRS or any other situation where the bandwidth is tight. This is important not only to test how long will those funky pictures take to load, but as AJAX gets more and more used everywhere, you really need to guarantee that everything works acceptably on slow network links, including AJAX calls, timeouts, animations, and everything else that might depend on the server.

There are some graphical applications for Mac OS X that try to do that, but many don’t work as promised. Well, Mac OS X has the solution built-in, you just have to configure it, and that it the internal firewall (ipfw). Since the introduction of Tiger, ipfw can now do QoS (Quality of Service) stuff, and that means you can create channels (or pipes, in ipfw terminology) and control the priority, bandwith, delay and some other settings of those channels. Than, you simply have to pump the traffic you want to see slowed down thru those channels.

I have done two simple scripts to configure my firewall automatically. They will slow down the http port (80) to a crawl. I do not handle the SSL port for https traffic, but it’s easily achieved by duplicating the script content and adjusting the parameters.

Here are the scripts. The first one, that slows down everything:


#!/bin/sh

/sbin/ipfw add 100 pipe 1 ip from any 80 to any out
/sbin/ipfw add 200 pipe 2 ip from any 80 to any in
/sbin/ipfw pipe 1 config bw 128Kbit/s queue 64Kbytes delay 250ms
/sbin/ipfw pipe 2 config bw 128Kbit/s queue 64Kbytes delay 250ms

And the one that brings your network back to normally:


#!/bin/sh

/sbin/ipfw delete 100
/sbin/ipfw delete 200

As you can see, this slows down all your http traffic to 128 Kbps, and introduces a delay of 250ms (which is roughly what you have on slow, modem-like or wireless connections). The delay can really impact the performance of your application, specially if you do a lot of requests to get information to display on a web page. You can change the values accoring to your needs, and even bring the 128 Kbps down to 56 or even less if you have the guts! I still don’t know exactly what infuence is caused bu changing the queue size, but the 64 KB value produces a nice result.

Don’t forget, if you are just making a static web page, for this to work, you have to turn on the apache web server (the easiest way if to turn on Web Sharing in the System Preferences “Sharing” panel) and browse your pages thru it. Direct file access won’t naturally be affected by the firewall. Also, this slows down ALL your http traffic. Not only the traffic that comes from your computer, but also the traffic that comes from the Internet. So, if you are wondering why is your net really slow, the solution is simple: you forgot to run the second script after finishing your tests! :) If for some reason you cannot recover the normal speed, just reboot and your Mac will be back to normal.

To run this scripts, just copy them to two text files (using any text editor that saves in TXT – RTF won’t do it!) and then make then executable (using the “chmod u-x ” command on the shell). Also, you have to run then as root (using “sudo” or switching to root before running them).

Backups, rsync, and –link-dest not working

Sunday, May 20th, 2007

I use Retrospect to backup most of the machines at GAEL. You may wonder why do I use a commercial tool that still shows it’s OS 9 roots, instead of open source alternatives. Well, Retrospect has some cool advantages (namely the very good support of laptops that may be disconnected abruptely from the network while a backup is in progress). Also, when I first did this setup, Amanda and other tools did not work reliably with Mac OS X file format.

While this works to backup all the desktop workstations and laptops of GAEL members, I have a problem with our xServe. It runs Mac OS X Server, and Retrospect will not backup machines with the Server version of Mac OS X with the license we have. To do that, we would have to buy a much more expensive license.

No problem. A server, due to it’s nature, doesn’t have the “sudden disappearing” problem of the laptops, so I can use a “classic” UNIX approach – and my choice was rsync and the –link-dest option. You may read about this option in the rsync manpage, but in case you don’t know, what it does is the following: instead of synchronizing a directory in the usual way, it will create a new directory with a new file tree. But, to save space, it won’t copy the non-updated files from the old tree to the new one. Instead, it creates hard links, so that both entries in the file system point to the same data on the hard drive (to the same inode), thus saving space. So, everytime you update your backup, you will create a new tree, but you will only waste the space required by the files that were updated since the last backup, and some more space for the filesystem structures that support the directory tree. You can use a command like this:


// rotate old dirs
rm -rf /Volumes/Storage/test/test.5
mv /Volumes/Storage/test/test.4 /Volumes/Storage/test/test.5
mv /Volumes/Storage/test/test.3 /Volumes/Storage/test/test.4
mv /Volumes/Storage/test/test.2 /Volumes/Storage/test/test.3
mv /Volumes/Storage/test/test.1 /Volumes/Storage/test/test.2
mv /Volumes/Storage/test/test.0 /Volumes/Storage/test/test.1

/usr/bin/rsync --rsync-path=/usr/bin/rsync -az -E -e ssh --exclude=/dev/\* --exclude=/private/tmp/\* --exclude=/Network/\* --exclude=/Volumes/\* --exclude=/private/var/run/\* --exclude=/afs/\* --exclude=/automount/\* --exclude=/.Spotlight-V100/\* --link-dest="/Volumes/Storage/test/test.1" "root@my.machine.com:/Users/arroz/TestDirectory" "/Volumes/Storage/test/test.0/"

Side note: the -E option (capital E) is an option present on Mac OS X rsync version, that forces rsync to copy all the extended Mac file system attributes, including resource forks. It only exists in Mac OS X 10.4 (Tiger) or newer versions. If you are still using 10.3 (Panther) or older, use rsyncx. Do not use rsyncx with Tiger.

Until about a week ago, my backup machine (an old PowerMac G4) had an external SCSI Raid with 640 GB, and an internal RAID 0 (2 * 80 GB drives), besides the boot disk. All the Retrospect backups were being placed on the external RAID, and the server backups were going to the internal RAID 0. Now, I know it’s living on the edge to backup to a RAID 0. But there was really no more space, and it was a temporary situation, because the new drives for the external RAID were already ordered.

When the new drives arrived, I stored all the backups where I could for some days (640 GB was huge when we purchased the RAID, but today is relatively managable), switched the drives and created a new fresh RAID 5. Formatted it in the HFS+ file system, and copied back all the backups, including the server backups and finally trashed the internal RAID 0.

Some path adjustements on my server backup scripts, and we are back in business. But the RAID free space was getting dramatically shorter every day. I used the ‘ls -i’ command to compare the inodes of files that were supposed to be unchanged from the backup of a day to the other in the next day, and as I suspected, rsync was duplicating all the files, instead of hard-linking them.

After Googling a lot, I could not find answers for this. I tried to see if ‘cp -la’ would successfully create hard links, but to my surprise, I found out that the Mac OS X built-in ‘cp’ command would not support the “l” option. Nice. Before installing the GNU ‘cp’ version (and because I’m lazy and I didn’t want to do that) I started thinking about everything I had done since the new drives arrived. The OS was the same, the rsync command was the same, it worked before, so it had to work now. The only reason why it could be not working was because rsync, somehow, thought that all the files were changing, even when they did not.

Suddenly, the solution poped up in my head. Mac OS X has an option, associated to every HFS+ volume, called “Ignore ownership on this volume”. This is turned off by default on the boot drive, but it’s turned on by default on all the external drives you format. There’s a good reason for this: Mac OS X is a consumer product. And average users want to buy an external drive, store data on it, bring it to another Mac, and read their data. They don’t care if their UID is the same on both machines or not.

But this causes serious problems to rsync. Althought the file system will store the owner of the files, it probably won’t report it to the applications who try to read it (or will mask them to the user who’s trying to access them). Somewhere this information is filtered, between the file system and application layers. So, rsync was not getting the real UID of the files. As the files that came from the server had real UIDs, both UIDs wouldn’t match, and rsync would create a new copy because, from it’s point of view, the file had been changed.

The solution was simple – just going to the machine console, and “Get Info” of the external volume. I turned off the “Ignore ownership on this volume” setting, and rsync started operating normally again.