What do you do when your VMware server dies?
You start panicking of course

Ok seriously now. You try to fix the problem first. If that doesn’t work then you bring the server back from the backups you are so regularly making. You are aren’t you? And you always make sure that your backup plan is actually working. Don’t you?
About a week ago my virtual Debian server on which this blog is hosted stopped running. The VMware console reported a virtual disk error. I had to shutdown the machine and when I tried to boot it up again it wouldn’t start. There was an error on one of the virtual disks. Hm so what to do? I had a fresh, few days old backup, so I went for that. But first, let me tell you how my virtual machines are set up. I have a Debian host OS (stable Lenny currently) and on top of that I have v VMware server 2.0.1. Then inside I have a virtual Debian server where this blog is hosted alongside my whole code repository. I have a couple of other virtual machines for development and testing (Windows and Linux). All important virtual machines are backup-ed regularly to the NAS machine. I have written about that a while ago here and here.
Ok so I took my last backup which is a tar that is further compressed with 7-zip for reducing size. At first I only had a tar archive. I tested that and it worked just fine. Then at some point I used 7-zip over the tar and because I lacked time I did not test the archives. No errors were reported so I assumed all is well. What a mistake. All 7-zip archives were partially corrupted. But thanks to Sheree dumb luck I could get the VMDK file that was corrupted from the archive. I replaced that file on the virtual machine (the size was the same) and the server booted again. Hooray! Well not quite. Everything was working but SVN was reporting errors. It could not open the database. After investigating the Apache logs (I use DAV SVN and https) it seemed that “db/current” file was corrupted. This file holds the info about the last (current) revision in the SVN. I tried “svnadmin recover” to no avail. I tried removing the file and repeating the recover which then failed elsewhere.
After searching the web and finding nothing I almost gave up. But the I found a little gem on one of the forums. A little Python script that solved my problems. I am posting it here if anybody else will have the same misfortune. If the author finds this offensive I will remove the script.
#!/usr/bin/python def dec_to_36(dec): key = '0123456789abcdefghijklmnopqrstuvwxyz' result = '' while 1: div = dec / 36 mod = dec % 36 dec = div result = key[mod] + result if dec == 0: break return result import os, re, sys repo_path = sys.argv[1] rev_path = os.path.join(repo_path, 'db', 'revs') current_path = os.path.join(repo_path, 'db', 'current') id_re = re.compile(r'^id:\ ([a-z0-9]+)\.([a-z0-9]+)\.r([0-9]+).*') max_node_id = 0 max_copy_id = 0 max_rev_id = 0 for rev in os.listdir(rev_path): f = open(os.path.join(rev_path, rev), 'r') for line in f: m = id_re.match(line) if m: node_id = int(m.group(1), 36) copy_id = int(m.group(2), 36) rev_id = int(m.group(3), 10) if copy_id > max_copy_id: max_copy_id = copy_id if node_id > max_node_id: max_node_id = node_id if rev_id > max_rev_id: max_rev_id = rev_id f = open(current_path, 'w+b') f.write("%d %s %s\n" % (max_rev_id, dec_to_36(max_node_id+1), dec_to_36(max_copy_id+1))) f.close()
This script is a little gem. It goes through all your revisions and reconstructs the “db/current” file. It worked. I lost last 4 or 5 revisions, but that was easily solved as I had them on my computer naturally. So all was well, I made backup of the current state and I was happy. Well that was another false sense of security.
Last night the blog went down again. This time it could not access the database. It showed that the file system inside the virtual machine was corrupted. I ran “fsck” but frankly I was prepared for the worst. To my surprise all errors were corrected and the server is once again running happily. I am suspecting that the physical hard drive of the host is slowly dying so I will migrate to a new drive in the future. But for now I am truly impressed about how sturdy Debian is. The host and the virtual server both run for about 3 years now without a reinstall. Both were migrated from Etch to Lenny (two stable releases) in live mode (no shutdown) and have survived hardware change (CPU, motherboard and RAM) and file system corruptions. Now do that to your Windows if you dare