Server Problems

    The rumors have been flying about the problems plaguing RPGamer's server the last few days. We have compiled the list of questions asked the last few days, and will answer them here.

The downtime was caused by failure in the machine. I will spare you the technical description, unless you really care to know. To clear a few things up, RPGamer would like to make the following statements:

    -> This was not an April Fool's prank. We know the timing seemed suspicious, but it was purely coincidental. The problems actually started over a week ago, and simply became worse as the week progressed.
    -> We expect no more troubles. We have a couple more planned interruptions, but they should be less than 20 minutes each, and during non-peak server hours.
    -> RPGamer is not closing. Many of you wrote the webmasters asking if this was the end of RPGamer. No need to worry, we're here for the long haul.
    -> The downtime was not caused by a malicious attack. Many more of you wrote to ask if the page was compromised, effects from hackers or the infamous Melissa virus. Nothing of the sort happened, and hopefully, never will.

    We apologize for any confusion this may have caused our readers. We are not only back, but will have a nice surprise for you later today (04.05.1999). Keep an eye on the page for more details.

Technical explanation

    Here's the scoop. On Friday, March 26th, dragon (RPGamer was hosted on rebooted without reason. At first this was taken with a grain of salt, but then it happened twice over the weekend, once for almost 10 hours. At this time, Mike became very suspicious, and set up a second machine to act as a serial console, so he could trap the root of the problem. With this in place, he finally tracked it down to SCSI errors. Thinking at first that one of the drives was going bad, plans were made to replace it later that week. Then, in the span of less than 4 hours, dragon became completely unstable, with SCSI errors coming from all drives. It was decided the SCSI controller went bad, and plans were made immediately to move the drives to the working voyager[]. Unfortunately, the RPGamer uses Ultra-Wide SCSI, and voyager's controller didn't have that option.

    Working with what was availible, we cleaned off another HDD, and proceeded to start the restore from tape backup. RPGamer had been using a program called amanda to back up files off the LAN, but unfortunately, this software did not turn out to be as reliable as we hoped. With 5 levels of saved data, each 2 weeks before the last, each level had to be restored separately. What we didn't know is that each level would take slightly more than 5 hours, a total of a 20 hours restoring files. Starting late Wednesday night, it would be restoring until late Thursday. This explains the changing of mains during April 1.

    Two problems came from the restore. The first problem we noticed right away was that some of the data was not restored because of bad sectors on the hard drive. This was fixed by verifying the disk, and saving what we could. Because the original drive is still intact, we did not want to try another 20 hour restore for the lost files. The second problem was many permissions were broken, hence many scripts and other files were unaccessable through the web server. This was eventually straightened out as soon as it was discovered.

    Everything is back to semi-normal at this point. Our new machine (calico) was ordered on Wed, and built on Friday. Because of some minor problems, the machine was not installed until Saturday, but because we needed the hard drives for the new machine, everything was down until the new machine was in place. We have two upgrades left. We are upgrading to the latest FreeBSD OS, which will have support for our onboard Ultra-Wide SCSI adapter, allowing us to use the original RPGamer drive again (we'll copy any updates between now and then when it's reactivated, so nothing new is lost). The other is a quick downtime for a memory upgrade to 256 MB, which has already been ordered, and will arrive later this week. These are the last scheduled downtime for quite a while.

