It was the best of DNS times, it was the worst of DNS times… ( May 2013 )
In the beginning, Al Gore and a team of computer pioneers decided to create the internet and on the 4th addressing try, the current method of IP was born. It quickly became obvious that telling your friends to visit 188.8.131.52 would keep the internet club esoteric to just us nerds, and from those needs DNS emerged. DNS is a simple concept, roughly you ask the root name servers who is the authority for a domain, then you ask that authority what does that domain translate to. Since most requests ( at the time ) would ask the root name servers bombarding it with packets, decisions had to be made so that users would get a timely response. The solution? Building DNS on the UDP protocol. While the choice at the time seemed obvious ( low latency, connectionless, smaller packet headers, no connection state bottlenecks, low CPU consumption ), it is now the reason why I sit here writing an article after a week of sleepless nights. Before you make any rash presumptions, the problem was not your typical amplification attack, simply a HUGE increase in DNS questions. At worst we were getting about 5MBps of traffic in and out with responses being sent to no consistent range of destinations.
To convey our problem I will need to convey the importance of DNS to our business model. We perform authoritative based answering for any questions sent to us, but do not worry, we are not malicious! This particular use case is for an advertising network of domain parking. SLD’s ( second level domains ) owners point millions of domains to our DNS servers which in return send responses back with varying IP addresses. With such a wide variety of traffic, the DNS back-end must be able to cope with huge amounts of requests of all types regardless of potential maliciousness from the source. Simply put, we can not pick and choose with whom we respond to as all traffic is vital. Our normal qps ( questions per second ) before the jump was around 500 with peaks of about 1,000. Once we knew it was not an amplification problem we were able to start thinking about how to increase capacity.
At the time we were sitting on top of a strong authoritative infrastructure that had chugged along well overpowered for years. The setup was a Netscaler load balancer and behind it, 4 physical 8 core DNS servers. At the 7th layer was PowerDNS utilizing a MySQL backend.
The first step in developing a solution was to have a solid baseline of the problem with an emphasis placed on response times. Anything over a 500ms response time was considered a failure for baseline, and no response at all was considered disastrous. Failure before the attack typically occurred around 1.5k qps ( questions per second ) with disastrous results at around 2k qps ( per server ). With 4 physical servers in rotation at the time, we were capable of handling about 6k qps while being crippled around 8k qps. To determine these results, we used an application called ‘queryperf’ which is apart of the contrib section of BIND and created 250k random length domains using available TLD’s (http://data.iana.org/TLD/tlds-alpha-by-domain.txt). The python program ‘gen-data-queryperf.py’ which is also apart of queryperf created the domains for us.
The day of the questions increase, we spiked north of 50k qps and as noted above we were only capable of responding to 6k qps before our 500ms mean average and 8k before we encountered a failure rate of not responding at all ( servfail ).
To quickly mitigate the problem, we decided to use the Netscalers query cache and placed PowerDNS query caching servers in front of the PowerDNS Authoritative servers to help respond to requests. While this bought us another 10k qps in capacity, the solution was clumsy and not ideal and we were still encountering some packet loss coupled with sporadic load times. Searching the web it was quickly noticed that we were not alone and other large DNS providers were experiencing a similar traffic increase.
With a stopgap in place we began working on a way to handle the increased workload and also be able to scale quickly incase of further spikes. Luckily, we had already built a new environment as we were wanting to upgrade and replace the physical servers with virtual ones and move to the most recent CentOS and PowerDNS products. When we tested the new setup, we quickly and somewhat nervously watched the same results of 1.5k qps before failure coming through queryperf’s emotionless stdout. As this product was inherited, I was not familiar with the other backend options available to PowerDNS and upon reading their documentation I saw that it supported BIND style files. Being very well versed with MySQL, I knew that handling data reporting was something MySQL was well suited for, but having SQL query overhead was probably costing us significant cpu cycles as sar was showing high amounts of context switches. Immediately I began converting our DNS records into flat text files.
A few hundred regular expressions and an hour of QA later resulted in 28MB’s worth of zone files ready to be questioned. With queryperf in hand, we eagerly ran our first tests.
sh-4.1# ./queryperf -d 250k_domains.txt -s 172.21.9.97 -T 50000
DNS Query Performance Testing Tool
Version: $Id: queryperf.c,v 1.12 2007/09/05 07:36:04 marka Exp $
[Status] Processing input data
[Status] Sending queries (beginning with 172.21.9.97)
[Status] Testing complete
Parse input file: once
Ended due to: reaching end of file
Queries sent: 250000 queries
Queries completed: 249982 queries
Queries lost: 18 queries
Queries delayed(?): 0 queries
RTT max: 0.265384 sec
RTT min: 0.000083 sec
RTT average: 0.000525 sec
RTT std deviation: 0.001293 sec
RTT out of range: 0 queries
Percentage completed: 99.99%
Percentage lost: 0.01%
Started at: Sun Jun 16 12:22:34 2013
Finished at: Sun Jun 16 12:22:40 2013
Ran for: 5.072967 seconds
Queries per second: 49277.276986 qps
Total QPS/target: 49285.430261/50000 qps
After tuning PowerDNS and the virtual container, I was astonished to see the qps capacity increase per server from 1.5k to over 49k before disastrous failure. Even more remarkable was the mean for performance was MUCH more consistent with response times averaging in the sub 1ms ranges ( internally of course ). After continued testing we felt confident in deployment of our solution by adding 3 more of the same virtual machine to replace the old setup. The new structure was capable of handling well over 120k random questions per second ( not including netscaler help ) and if we required scaling out, we simply needed to clone more virtual machines.
In operations, learning how to put out fires is a constant struggle and your downtime mainly consists of dreaming up even worse fires to try and prevent them. Deadlines have a great way of focusing the mind, and thanks to an outstanding group of peers, we were quickly were able to Macgyver together a CO2 scrubber when the time came and bring our DNS queries home.