November 4th, 2009
Let's talk about the DFW power outage...
As many of you with Slices in our Rackspace DFW Datacenter are painfully aware, there was a power outage in the early hours of Tue, 3rd Nov.
Firstly, it goes without saying that I personally apologise for the outage and I want to explain what happened. Although this will not change your experience, I hope it will go some way to easing concerns about a repetition of last night.
Secondly, it has taken much of today for me to get accurate information. For those that know me, you know I am a straight talker (to say the least) and I wanted facts before I approached you with details.
So, what happened?
At around 12:29am CST this morning, the DFW data center experienced a power outage during a routine (non impacting) maintenance. Clearly, this non-impacting maintenance became impacting and many Slicehost customers in the DFW DC experienced downtime.
Power was restored within five minutes and most Slices were up and running in a good timeframe. However, some of you did experience lengthy delays while we restored your Slice. We had the Systems team working from the moment we knew something had occurred and they did not stop until everyone was up and running again.
The issue was further complicated by internal DNS issues which were caused by the unexpected power outage.
However, this does bring up a couple of specific points I would like to discuss:
We did not post a notice about this maintenance. Until now, we have restricted notification of maintenance windows to those that might, under normal circumstances, have an impact. We didn't expect this period to be impacting and so we did not post a notice.
This turned out to be very wrong and, as a result of this, I will be posting more routine maintenance warnings on our status blog.
The second point is that I feel our communication lacked in a couple of areas. We did not keep you informed as regularly as we should have done. Even if there was no specific news, I feel regular updates are essential so you know we are dealing with the issue. I have already changed our procedures and there is a clearer, more defined, route for communication.
I make no excuses for what happened. It reflects on us badly and it affected you negatively. This is not something I find acceptable and I will continue to work to provide the best service for you.
I know you will be concerned and if you have any questions then please email us or come to our chatroom.
Paul
November 4th, 2009 at 01:34 AM Katharine Osborne
Wow, thanks for the sincere humility :-) I’m used to dealing with impersonal monoliths of companies.
November 4th, 2009 at 01:35 AM Matt Tessar
Thanks for the information. I know how hard it can be when things like this happen. We have a number of our clients using slicehost, and it was a bit of a surprise when things broke down.
Can you make sure to notify people via email next time something like this happens as soon as it happens? I knew something was wrong from our monitoring software, and because I have worked with you all long enough I knew you would be on it. If I were a newer customer with less experience I might have been more worried, so thats where the communication would have come in handy.
November 4th, 2009 at 01:38 AM Daniel
Excellent post. Thanks for the information, Paul.
November 4th, 2009 at 01:43 AM Ben S
I was, by a fluke, awake and monitoring my slice when it went down. I thought you guys actually did a pretty good job letting people know you were working on it quickly, and providing updates.
The only part I’ve really found lacking, in fact, was this post. I’d like to hear more about precisely what happened, how it happened, and why it won’t happen again in the future.
November 4th, 2009 at 01:43 AM Tim
Just out of curiosity, what was the routine maintenance that was being performed at the time?
November 4th, 2009 at 02:08 AM Garry
Did other companies (not related to Slicehost or their customers) in the DFW data center also experience the outage? That is, did the data center screw up?
November 4th, 2009 at 05:24 AM Zviki
Thanks for the post.
It is still unclear why the power outage impacted the data center. Clearly you should have some combination of UPS & emergency generators, so why those didn’t kick in to save the day?
November 4th, 2009 at 05:30 AM Chris
Thanks for the honesty. Love your level of service! Keep up the hard work!
November 4th, 2009 at 12:40 PM Jason Palmer
Thank you for the post. The humble and sincere message is appreciated.
However, two things happened yesterday that was concerning to me. Firstly, the chatroom was completely overwhelmed and the techs did a pretty terrible job at answering peoples questions. I understand a lot was going on and they were prioritizing, but the customer service was simply abysmal.
Secondly, when a major outage like this happens, I’d prefer (as I’m sure others would) a courtesy TEXT MESSAGE. The technology is there to make this happen. When an outage happens at 1am, an email isn’t good enough.
November 4th, 2009 at 01:25 PM Francisco
Well… it’s all good now. I had a Media Temple server outage that lasted 52 hours. That was BAD service! Not comparing, though. This is just a thumbs up message, I know how frustrating it can be when things go wrong and everyone fall’s down on you.
Keep up the excellent hosting service.
November 4th, 2009 at 04:35 PM PickledOnion
Hi,
There are some good questions here and I want to answer them.
The power outage was in the DFW DC itself and was during a routine maintenance. It did affect other Rackspace products (Cloud Servers and Mail). Again, this does not make it ‘better’ for Slicehost customers who were affected but it does show that every resource at Rackspace is on this issue and are working to prevent a repeat.
We did fall down on communicating with you.
Communication will improve with more notices of routine maintenance periods. These notices will be posted on the status blog. We will also improve the frequency of updates to a blog post during an incident.
We will also increase frequency of updates via our twitter accounts (@slicehost and @slicehoststatus).
I am taking this whole issue very seriously and I will be making every effort to improve our communication with you.
Paul
November 6th, 2009 at 08:32 PM Kevin DeGraaf
I’m with Ben and Zviki on this – how, precisely, did power get interrupted, why didn’t the backup systems function properly, and what assurances do we have that the root cause(s) is/are being addressed?
November 7th, 2009 at 02:09 AM Mark
I thought the incident was actually handled well but if procedures can be tightened, it’ll only make Slicehost more attractive. Aside from that incident, I’ve had perfect uptime, so thank you all for the work you put in to keep the service running.
To those people requesting email or text message alerts of downtime:
I’m not a member of Slicehost staff but I wouldn’t imagine email and text notification of incidents would be particularly helpful. By posting updates on the status blog and Twitter, there is sufficient coverage of updates without creating too much additional, distracting work during incidents for staff. More importantly, issuing updates in central locations ensure any discussions about the incident are available to all.
If anyone does wish to receive email or text notifications of downtime, Pingdom.com are offering a free account with one site monitor. I’m not affiliated with their service either but have found the free monitor to be very useful. Their monitors have a minimum resolution of 1 minute and email notifications are free. Text notifications obviously cost but if this are important, it is probably worthwhile to try the Pingdom service.
Ultimately, if a website is that mission-critical, I would want my own site monitors and failover plans. Slicehost even provide a tutorial on how to set up Heartbeat.
November 7th, 2009 at 06:33 PM Sylvain
This was the weirdest thing ever, I rarely log into my slice but moments before the outage I was in via SSH. I typed ‘uptime’ and contemplated 169d of uptime IIRC. Within 20 seconds the slice was down and so was manage.slicehost.com, I eventually found the notice and just couldn’t believe it! I’m sorry everybody if I somehow jinxed the DC :-)
November 12th, 2009 at 06:00 PM Grant
While I’m a little miffed about the loss of my 2 year uptime, the fact that my slice was up for 2 years is evidence of slicehost’s (normally) rock-solid reliability.
And maybe it’s a blessing in disguise – now I can upgrade my linux kernel. 2.6.16 is getting a bit long in the tooth.