Last week, we had an outage on our systems. The reason - our DNS provider, PointDNS, was under DDoS attack. Also, this is the second time this year our systems faced a serious outage, so here's what we've learned.
The web today is a network of brittled, interconnected systems. DNS, physical servers, virtual servers, databases, web servers, database servers, proxy servers, load balancers, CDNs and many more. Keeping all this in check can be a difficult task sometimes.
First off - why bother with server management at all?
We often provide setup, hosting and support for our clients. A lot of people in this business ask us why do we do it. Why do we want to take on the risk of hosting? They usually say, "Why don't you just deliver the project and have the client take care of the hosting and maintenance?"
The reason is - we're aiming for top-tier clients - and these clients don't want to deal with hosting, server setup and maintenance. They want a turn-key solution, something that just works. They want someone who'll solve their problems, not just create more.
If you're not familiar with what a Distributed denial-of-service attack (DDoS) attack is, here's a short explanation. Basically, someone (we'll refer to them as hackers in this article, but you can also call them fucking bastards) gets a hold of a lot of computer power (either by purchasing or infiltrating servers all over the Internet).
They use that processing power to flood your servers with excessive traffic to the point that your servers are rendered unusable.
Imagine if you were trying to run a bar that accommodates 100 people. And someone shoved 200 non-paying customers through the door, while another 200 waited in front of the bar, all this basically preventing your paying customers from entering. If this were to continue for long enough, and you didn't have a way to get them out of the building, they would basically drive you out of business.
Typically, these cyber criminals try to extort money for making the attacks stop. I don't know if this was the case with PointDNS or someone attacked them out of pure spite, but I would guess that it was.
Crisis management protocol
To handle these situations, we decided to develop a crisis management protocol. The protocol is very simple and has five main steps:
- Centralize communication - this is the most important part. Usually, when things start to fall apart, chaos starts inside your organization. Everybody scatters and tries to fix what they can, often without coordination. A centralized communication channel helps keep this chaos under control.
- Assign a crisis leader - this person should be in charge of getting you back on track. Usually, that's someone experienced who knows your architecture best and can make decisions.
- Notify customers - leaving your customers in the dark is the worst thing ever. If you don't keep them in the loop, they are left wondering. Maybe all their data is destroyed? Maybe you went bankrupt? How should they know what happened if you don't notify them. Also, be very upfront and honest with them. Don't sugarcoat information. They'll also probably want an ETA (which is always problematic to hand out in situations like this), so you'll need to communicate something here.
- Prioritize - time is critical, so figure out what's most important. You should have a list of all components/servers somewhere anyway (this is very important). Go over that and see what needs to be up and running ASAP. For example, once we had the DNS fallout, our e-mail dropped. That was the first thing we needed to get up and running.
- Mitigate problems - if you can't fix problems, figure out a temporary solution. Move stuff to a temporary hosting. For us, that was purchasing a DNSimple account and moving our Name Servers there. We were up and running in an hour.
In my career, I have seen a lot of DDoS attacks and they usually last for a long time. Once we started the rescue mission, the first decision was - move stuff to a new DNS provider ASAP.
PointDNS did manage to get their system up and running in around 4 hours. They got a lot of bad rep on Twitter, people threatening to leave and all sorts of bad things.
This isn't very nice in my book. Would you stop going to a local bar because it got extorted by the mafia and got all it's windows broken? Would you blame the owners, hard-working people that are just trying to make a living?
I think the PointDNS guys did a great job and we don't plan to move away from them because of this incident. What we're doing now is setting up replication from PointDNS to DNSimple so we'll have a redundant DNS provider in place if this sort of thing happens again.
I'm also very proud of my team and how they managed to react promptly. I would also like to issue an apology to all our customers. While I can't promise that we'll never have problems again, I can assure you that we'll continually work to improve our service. We've gotten better at handling these issues, but there is still a lot of room for progress. As time goes by, we'll mitigate these problems faster and in a more efficient manner.