EECS 345 2013 Distributed Systems: DNS Performance and the Effectiveness of Caching

Sunday, February 3, 2013

DNS Performance and the Effectiveness of Caching

DNS Performance and the Effectiveness of Caching
Jung, Sit, Balakrishnan, and Morris

The paper explores the current effectiveness of the DNS service through analysis of three large data sets collected in 2000.

This paper was all about understanding the three data sets the authors were able to collect, 2 from a vantage point on MIT's network and one on the KAIST network. I thought it was interesting to see two traces from the same network a year apart, since nearly the same demographic will be present, and you can see the evolution of their use of the service. While the authors don't make much of a point of this, there was nearly a doubling of DNS lookups in a 11 month period. It would be interesting to see a more long term study of the usage patterns based on this, but I digress. The authors have two big take aways from the paper. First, that retransmissions and zero referral cause a substantial portion of DNS traffic. The mention that 63% of the DNS packets were generated by lookups that obtain no answer. There lookups are only between 4.9% and 3.1% of the total of all lookups. Second, they take a look at the effectiveness of caching, mentioning that popularity of sites requested follows a Zipf distribution, so there's a large portion of sites caching will have no benefit. Looking at the TTL values of a DNS entry, they find there isn't much benefit in having a TTL longer than a 10 minutes, most of the benefit comes in the first few minutes, and cache performance is not improved by having a longer value. The benefits of sharing a cache between people also levels out quite quickly, after 6 or 7 people are using the same cache, there isn't much of a benefit of having another person use it.

The one thing that bugged me about this paper was that they didn't propse any changes to DNS to help improve performance. They had a great data set to use for simulations of changes, and only used it for the caching experiments. While that was an interesting experiment, it's results are not something you can force on people. A website trying to load balance servers will still set a lower TTL even if they know it makes DNS caching ineffective. An experiment about making modifications to DNS's retransmission timeout in order to alleviate some of the redundant traffic generated would have had great implications.

This study would be interesting to see now. With DNS prefetching as a feature in browsers, the number of queries per user has to have exploded since the time of this paper's publishing. I would expect that the latency number to go up even further as a result.

6 comments:

AnonymousFebruary 3, 2013 at 8:39 PM
This paper has presented a detailed analysis of traces of DNS and associated TCP traffic collected on the Internet based on the data form traced links of MIT Laboratory and KAIST, then discusses trace-driven simulations to study the effectiveness of DNS caching as a function of TTL and degree of cache sharing.

By analyse the collected data, the authors have these results:
1.Distribution of popular names following the Zipf’s law fails to make use of caching with larger TTL values.
2.Sharing cache among groups of clients has limited gain in terms of cache hit after the total member count crosses 20-25.
3.The client-perceived latency is adversely affected by number of referrals, and caching NS records to reduce the number of referrals will decrease latency as well as load on the root servers.
4.Distribution of names causing negative responses follows a heavy tailed distribution as well. As a result, hit rate of negative caching is also limited.

Using trace-driven simulations algorithm, the author wants to find how useful to share DNS caches among many client machines and what is the impact of choice of TTL on caching effectiveness. The authors quantify two important statistics: the distribution of name popularity and TTL values in the trace data. Both determine cache hit rates. And draw a conclusion that for A-records, lower TTL seems won’t harm the hit rates, caching appears to have limited effectiveness in practice and potential. But for NS-records, it would increase the load on root server and harm DNS scalability.

At the end, the paper draw a conclusion that the widespread use of dynamic, lower-TTL A-record bindings should not harm DNS performance. Such that the scalability of DNS are due less to the hierarchical design of its name space or good A-record caching(originally believed).
ReplyDelete
Replies
UnknownFebruary 3, 2013 at 9:08 PM
What I got out of this paper was that DNS works well for both its intended use as well as more unintended uses like load balancing and fault tolerance. However, close to half the requests are malformed or unnecessary. This contributes to a lot of unnecessary traffic, especially to root and TLD servers as these malformed requests cannot be handled by a cache. The misconfiguration of resolvers and such are inherent in a hierarchical system like DNS. It did seem as if the issues of aggressive retransmission were legitimate and addressable, though.
ReplyDelete
Replies
UnknownFebruary 3, 2013 at 9:23 PM
The paper does a great job at studying the effect of caching in DNS, not to forget that exploiting caching was the major reason for shifting from the hosts.txt in the legacy systems.

It also pursues a good point of argument that keeping very high TTLs actually adversely affect systems. This follows operations research principles that when the queue gets longer than a particular length, people don't stand in the queue anymore. Similarly, keeping TTLs more than 1000 s (more than 16 odd minutes) does little improvement and more harm as faster re-transmission is the need of the hour.

But, DNS uses UDP and thus, has no scheme for reliable data transmission. So, that is one of the reasons (if not the major) for the non-responses. The authors did not mention this when stating their results.
ReplyDelete
Replies
Marcel FloresFebruary 4, 2013 at 6:02 AM
I would suspect that the number of queries one sees has also increased as the number of services/pages that make use of CDNs (and therefore hacky use of DNS with short timeouts and such) has also increased, or at least the number of queries that have to go all the way to the authoritative server.

I guess my sense of timescale is all messed up when I think about these things. 15 minutes seems like a pretty short cache to me, but on further consideration it does seem rather long (ie when thinking about browsing habits). Anyone feel similarly?
ReplyDelete
Replies
Gursimran SinghFebruary 4, 2013 at 7:51 AM
I looked up the Zipf distribution, since it is mentioned that web request popularity and DNS name popularity both have this sort of "heavy tailed" distribution, in which a majority of the samples are in the tail (so a log-log graph is a straight line). It's interesting that this means that higher TTL values only really benefit the cache hit rates of a few super popular names, while the heavy tail isn't really helped much.

This distribution is a crucial link between the two papers (this and CoDoNS). Beehive pretty much derives its replication behavior by learning from and using this distribution to provide a score of popularity. This approach is general enough that I could imagine it being used in other DHTs dealing with data that has power law distribution characteristics.
ReplyDelete
Replies
Jin (Byungjin Jun)February 4, 2013 at 7:59 AM
DNS Performance and the Effectiveness of Caching has fancy data and nice results. It dealt with DNS lookup and caching in several different point of view and showed results well.

But I think they could derive more meaningful results from their experiments, though they did nice work. For example, the KAIST's ratio of negative answer is much more huge than MIT's and there must be some reasons, but authors didn't even mention it. Same thing happens on total iterative lookups. Moreover, certain number of results from KAIST was just discarded without using but that is a data set between two institutions located almost on the other side of earth, so they could have used it more efficiently.
ReplyDelete
Replies

Add comment