February 16th, 2007
Outage Notes
Technical Nitty Gritty
As scheduled, there was backbone maintenance at our NOC overnight, all went as planned (minor network flaps during the window). Just before the outage, a core router (upstream from us) began to have trouble and ultimately stopped routing. Normally this wouldn’t be a problem, everything would fail over to a secondary core router. However the BGP sessions stayed up to both the failover and Verizon from the primary – so the secondary took no action. Bouncing the BGP sessions had no effect. The fix required removing some routes and restarting the device (earlier bounce attempts had failed).
Post Mortem
We cannot apologize enough for this outage – it is simply unacceptable. Jason and I want Slicehost to be an infrastructure for our customers to build upon. Downtime does not factor into that vision. We’re working on changes which we hope will simplify networking, give us more control and improve reliability at the same time. An offsite server to handle communication during an outage will be in place as soon as possible. It was certainly frustrating for those outside of the chatrooms to go for an hour without updates. We’ll let everyone know how to reach this site once it is online.
Slicehost has experienced unbelievable growth and built an amazing community since we launched last year. Our customers have high standards. They demand technical efficiency, transparency and communication. We are devoted to serving this community and meeting its changing needs. Thanks so much for choosing our services, we know you have many options. Please let us know if you have any questions.
PS – We really appreciate the words of encouragement several folks sent our way, having been in similar situations themselves.
March 17th, 2008 at 10:00 PM Jared Kuolt
You guys rock. Thanks so much.
March 17th, 2008 at 10:00 PM A. Shaw
Kudos for the way you guys handled this. I hopped onto your IRC channel when I noticed the outage, and was pleased to find Slicehost’s dynamic duo, offering total transparency into what was going on.
March 17th, 2008 at 10:00 PM Helena
I really appreciate your openness and I am glad to read that you are going to add an offsite server to handle communication during outages.
March 17th, 2008 at 10:00 PM Mars
I ♥ Slicehost.
March 17th, 2008 at 10:00 PM Carlos
Thank you so much for offering live support at chat service. All I needed to know is that you were working on fixing the issue. I love your transparency.
March 17th, 2008 at 10:00 PM Scott Meade
Thanks guys. I love Slicehost because it feels like a family, not a giant 1,000 person customer service org. Keep up the good work.
March 17th, 2008 at 10:00 PM Tristan
A few hours downtime is not that big a deal, especially the way you handled it. I’m sure you learned from it and it will help build an even more reliable network for the future. :)