In a nut, our engineering team moved everything to over to Amazon Web Services, with one developer working overnight despite a tree falling through his roof.
But here’s the longer version, passed on by BuzzFeed’s director of technology, Mark Wilkie.
Two key things helped BuzzFeed recover: After Irene, BuzzFeed commissioned an off-site data center that replicates everything in near real time. More recently, the site started using Akamai to cache content. That means that when Datagram was offline, the site and its pages should have stayed up — and many did.
Unfortunately, the cascading catastrophe at Datagram made the process of getting the site back up fairly complicated. Following Datagram’s initial outage, it managed to get power up in spurts — which actually made things worse, since it resulted in our servers restarting and stopping and crashing and pulling disk errors, meaning Akamai could only pull incomplete, broken versions of the site to cache. Afterward, with the entire Datagram network down — instead of merely BuzzFeed’s servers having crashed — Akamai essentially wasn’t getting the correct error code, so it wiped the good cache Akamai maintained to serve our cached page, despite efforts to keep it from doing so during a brief moment of uptime.
Around midnight, when it became clear that Datagram wouldn’t be up again anytime soon — it had been reported at one point there was 20 feet of water in the basement — the decision was made to rebuild everything on Amazon cloud storage, where the site already hosts all of our static content.
From there, the developers moved everything from the hot-replicating data center to Amazon, which — barring a scary moment when the transfer hung for three minutes at 70 percent of the way through — went smoothly, though they worked through the night and into the morning. Wilkie and his team began configuring a hardware setup that mirrored our system at Datagram, server for server. He and his team, along with director of product Chris Johansen, worked through the night; one developer, Eugene Ventimiglia, kept working after a tree came through his roof. Another, Raymond Wong, got flooded out and knocked offline. And yet another developer, Andy Yaco-Mink, came online around 8:30 this morning.
All that was lost was a small percentage data collection that was happening around 7 p.m. last night. Everything else from Datagram is now up and running at Amazon.
You might wonder why, of course, BuzzFeed, Gawker, and others aren’t already all aboard the cloud train, ready to switch to different servers at the drop of a hat.
The fact is, Amazon cloud service and other services like it weren’t around when BuzzFeed and sites like Gawker and Huffington Post were architected years ago. If the site was built today, the architecture might look a bit more cloud-like than having a huge data center based in downtown Manhattan.
Still, there are few guarantees — Amazon’s had several high-profile outages in the last year. When the developers get a little sleep, they say, they’ll turn to figuring out how to avoid letting this happen again.