PSX Extreme recently suffered from a rather severe server failure, which took our website offline between December 30th, and January 6th. This was the longest unplanned outage in our 24 year history.
So, what happened?
To try and simplify a rather long story, the database server that powers PSX Extreme malfunctioned. While we had tried to repair the database server numerous times, our attempts at repairing were unsuccessful. In fact, we actually ended up making things worse. The database that powers our site, became irreversibly corrupted.
Our only real solution at this point, was to completely wipe our server clean, and reinstall everything from the ground up. On paper, this should have been an easy thing to do. Reinstall the operating system, reconfigure our control panel. Easy. Time consuming, no doubt. But most definitely an easy task.
Except, absolutely nothing had gone correctly.
Downloading The Backups
PSX Extreme has four primary and entirely different backup methods. Each method is intended to be used for a different type of hardware or software failure. For instance, we back our database, posts, pages, and primary directories, up to the cloud once every 24 hours. This method of backup is great for when we need to quickly revert back a day or two. The downsides? It doesn’t back the entire website and all directories up, but rather, it’ll only back up what is required to keep the core of our website operational. In other words, the absolute basics.
We also create a full backup of our entire website, and all directories within our main web folder. This method of backup is an exact replica of our website as it appeared on the date the backup was created. Unfortunately, we only run this clone-based backup method once every seven days, which for a high-content website like PSX Extreme, is not the most ideal of solutions. However, it’s a fallback that is nearly guaranteed to work.
That is also the backup method we opted to use.
The actual act of downloading the backups from a server, and storing them on our local drive, took approximately 36 hours. PSX Extreme is a large website, and contains over 400GB of total data.
Easy, but time consuming.
Restoring The Backups
Unfortunately, this is where things started to take a turn for the worse. While the act of downloading the actual backup files wasn’t overly complicated, just time consuming. The same cannot be said for the restoration process.
We had to upload the compressed backup files to the server, and then run a restore command. Sadly, every single time that we attempted to do that, the restore process failed. We attempted to do this several times, wasting approximately three days. Each time, the backup would get to about 95%, and then hang for several hours, before ultimately failing. Since we had to restore a rather large file, having the restore process hang was normal and expected. Having it crash? Not as normal or expected.
Once we got the site restored, we attempted to restore one of our cloud backups, to get as close to our previous live-site as we could. Sadly, restoring the cloud backup ended up corrupting our database, requiring that we wipe the database and reinstall the original backup again. Each time we had to do a new restore, we would have to sit and babysit the restoration process for a whopping four hours.
So now, an additional eight hours have been wasted on just trying to restore a working backup. But finally, it was done. Things were no longer crashing. All was good in the world!
And Now We’re Here
PSX Extreme is back online. Things are not fully stable quite yet, but at the very least, we’re functional. We can once again contribute content to our site, and all core functionality is good to go.
And yet, things are still moderately unstable. We’re slow, and have several visual bugs and glitches that have yet to be fixed, as of this writing. But at least we’re back online, right?
I would like to thank everyone for your patience. Restoring PSX Extreme was no easy task, even if it was supposed to be an easy task on paper.
To try and ensure that this never happens again, we have implemented a new caching method into our site, which should speed things up rather significantly. Beyond that, we are also going to be creating full cloned copies of the entire public directory every 24 hours, to more or less match our cloud-based backup services.
We are also going to rely a lot less on remote cloud backups, considering as how those have not, thus-far, been of any real value. This was supposed to be our most secure, and most reliable, method of backup and restoration. But instead, it became the least reliable of the bunch.
We will also be looking into the possibility of hosting our website on a different hosting network. Right now, we run our own servers, and more or less provide and do everything ourselves. This is fine for when it works, but as we just discovered, is a real pain in the ass for when things hit the proverbial fan.
All in all, we’re back online. Hopefully for good this time around.