LinkaGoGo’s availability
You may have noticed, that we had some downtime problems early last week. What happened was that basically the server had shutdown on Sunday morning and would not reboot.
Once we noticed this, we immediately started using our backup server, which always has the previous nights backup ready to go. But it turned out the software wasn’t configured correctly, it then took a couple of hours before the backup server was configured correctly and we could start the backup server.
Having run that for a couple hours it turned out that some members reported strange behavior such as unfamiliar links appearing on their pages. This turned out to be caused by a buggy (newer !!!) version of the application server we were using. So we had to fall back to an earlier version of the software and then everything started working correctly (but slower) on the backup server.
In the meantime our official server was able to reboot again and after some checks and couple more test reboots continued to act normal again. On Tuesday morning we were back to our official server.
The main lesson I learned from this is that these were two very stressful days and that I don’t intend to get a lot more of these. In the 5 years we are operational this was the second major outage we had (The previous one was in August 2004, I was in Paris at the time of the outage)
So based on this experience and the feedback that I got I’m going to take the following actions:
With action 1 and 2 implemented and working on action 3 and 4 in the next few weeks you should expect less disruptions in the service, and in the case there is a disruption be better informed. It will also give me more peace of mind.
Since I’m concentrating on the implementing these actions, the beta program for the Webservice API and LinkaGoGo Organizer will be extended by 2 more months.
GoGolian
p.s Outages such as this is one of the reasons we implemented the weekly automatic email of your backup (You can find this premium feature under Options/Account)
Once we noticed this, we immediately started using our backup server, which always has the previous nights backup ready to go. But it turned out the software wasn’t configured correctly, it then took a couple of hours before the backup server was configured correctly and we could start the backup server.
Having run that for a couple hours it turned out that some members reported strange behavior such as unfamiliar links appearing on their pages. This turned out to be caused by a buggy (newer !!!) version of the application server we were using. So we had to fall back to an earlier version of the software and then everything started working correctly (but slower) on the backup server.
In the meantime our official server was able to reboot again and after some checks and couple more test reboots continued to act normal again. On Tuesday morning we were back to our official server.
The main lesson I learned from this is that these were two very stressful days and that I don’t intend to get a lot more of these. In the 5 years we are operational this was the second major outage we had (The previous one was in August 2004, I was in Paris at the time of the outage)
So based on this experience and the feedback that I got I’m going to take the following actions:
- As soon as the event occurs, announce on the announcement forum that there is a problem and then start working on it.
- Make an uptime report available that shows the status of the website and some uptime statistics. This report is generated by a third-party monitor service Hyperspin and checks every minute if the site is still up.
- Improve our hardware health monitor, which will allow us to take preventative maintenance.
- Introduce more redundancy into our service infrastructure so we do not rely on one server. This will improve uptime and if we go down should be able to minimize the downtime. We are investigating difference service redundancy options, which hopefully gives us some performance advantages as well and will require additional investments.
With action 1 and 2 implemented and working on action 3 and 4 in the next few weeks you should expect less disruptions in the service, and in the case there is a disruption be better informed. It will also give me more peace of mind.
Since I’m concentrating on the implementing these actions, the beta program for the Webservice API and LinkaGoGo Organizer will be extended by 2 more months.
GoGolian
p.s Outages such as this is one of the reasons we implemented the weekly automatic email of your backup (You can find this premium feature under Options/Account)
