Designing a DHT for low latency and high throughput
Frank Dabek, Jingyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, Robert Morris
MIT CS and AI Laboratory
As this paper points out, designing a wide-area DHT with high-throughput and low-latency is a challenge. The paper explores a range of solutions the existing systems employs, and results from it some new techniques, including use of latency predictions based on synthetic co-ordinates, efficient integration of lookup routing and data fetching, and a congestion control mechanism suitable for fetching data striped over large numbers of servers.
The paper finds that existing work has solutions to make the lookup of keys in DHTs scalable, low-latency, fault-tolerant, and secure, but hasn't considered much about the same with reading and storing data. And the paper makes the following contributions.
For low-latency, the paper first compares recursive and iterative lookup nodes algorithm. Unlike the iterative algorithm who waits for the response before proceeding, recursive algorithm enables each node in the lookup path directly forwards the query to the next node, and when the query reaches the key's predecessor, the predecessor sends its successor list directly back to the originator. But if a recursive lookup elicits no response, the originator has no information about what happened and how to re-try in a way that is more likely to succeed. Sometimes a simple re-try may work but in the extreme case only the originator knows a problem exists. So in practice, DHash++ uses recursive lookups by default and iterative lookups after persistent failures.
Then the paper discusses the trade-off between replication and coding. As replicated data allows for low-latency reads because there are many choices for server selection, coded data reduces bandwidth consumption at the expense of increased read latency, because the lower bound on the total time to find and fetch a block is the time to the most distant of the closest set of fragments required to reconstruct the block.
Lastly, the paper explains that the final lookup steps can take little advantage of proximity routing, because as approaching the target node, the number of choices that are closer to the node decreases, so the average latency increases. However, the lookup could stop early at any of those predecessors that has a successor list containing nodes sufficient to reconstruct the block. The trade-off is the expense of data fetch latency, as it decreases the number of successors (and thus fragments) that the originator can choose from.
In terms of high-throughput, the paper introduces an integrated transport protocol called STP, which borrows many ideas form TCP but still implements different retransmit timer and policy. Because lost packets have a large negative impact on DHash++ throughput as each block transfer is preceded by a multi-RPC lookup, under-predicting the latency of an RPC is costly to the advancement of the control window. STP uses Vivaldi latency predictions to help choose the retransmit time-out interval for each RPC. Besides it keeps a moving average of the mistake Vivaldi makes on each successful RPC's round-trip time, and add a 15 ms constant to avoid retransmissions to other virtual nodes on the same host. Unlike TCP, when STP recognizes a time-out it doesn't re-send the RPC right away, but instead gives DHash++ a chance to re-send the RPC to a different destination to avoid wasting time sending RPC's to nodes that have crashed or have overloaded access links.
Overall this paper has discussed a series of design decisions and options faced by DHTs. Which interests me the most is the improvement of performance doesn't happen at where to optimize the algorithm itself, but to do different trade-off's in various scenario with the help of simulations and measurements.