September 5th, 2006
Our first outage :(
First – an apology to all of our customers. We experienced 30 minutes of downtime today, around 1pm. That’s unacceptable and we’re working hard to make sure it doesn’t happen again.
Now here’s what happened:- 1245 CST: Slicehoster Matt T. pops into the chatroom and asked if he was the only one who couldn’t reach his box. I sarcastically responded yes, about 15 seconds before our phones started beeping and the alarms sounded. Predictably, all heck broke loose.
- 1247 CST: I call the NOC. We recently doubled bandwidth, but were having trouble seeing the increase on our side. A NOC engineer was plugged into our backbone switch to investigate. I take off in a sprint for the NOC (tall programmers are not the most graceful runners), while Jason and others man the phone/chatroom to relay updates.
- 1252 CST: Huffing and puffing, I arrive at the NOC to discover the backbone router is off. Engineer admits he may have bumped it or maybe it failed. Power is cycled. But this doesn’t look good, a bump shouldn’t put a router on the fritz.
- 1257 CST: After a few failed reboot attempts, it becomes apparent something else is very wrong. More investigation via the console points to a memory problem. We reseat all of the memory in the device.
- 1307 CST: More reboot attempts after reinstalling the router lead nowhere. Cutover to emergency backup device which requires upstream changes.
- 1315 CST: Everything is back online.
- The router did indeed have a bad memory chip. This prevents it from booting. A replacement is on the way.
- HSRP, scheduled for later this month, was not yet in place. This falls on us.
Again, we aplogize for the outage, it should not have happened. Something got fried and we didn’t have the proper failovers in place. This will be remedied soon. In the meantime, everyone is back online and there shouldn’t be any noticeable difference. And the network should feel faster ;). Please contact us with any questions and thanks again for your support and words of encouragement in the chatroom.
March 17th, 2008 at 10:00 PM pedro figueiredo
although the outages are “not a good thing” (tm), this kind of openness is. i hope you continue this policy.
oh, and one other thing: hsrp is a cisco proprietary protocol. would you consider using vrrp?
March 17th, 2008 at 10:00 PM matt
Yes Pedro, VRRP should be used. I use the terms somewhat HSRP and VRRP interchangeably, but open is always the winner in our book.