Sunday, January 27, 2013

design DHT with low-latency and high-throughput


Designing a DHT for low latency and high throughput

Frank Dabek, Jingyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, Robert Morris
MIT CS and AI Laboratory

As this paper points out, designing a wide-area DHT with high-throughput and low-latency is a challenge. The paper explores a range of solutions the existing systems employs, and results from it some new techniques, including use of latency predictions based on synthetic co-ordinates, efficient integration of lookup routing and data fetching, and a congestion control mechanism suitable for fetching data striped over large numbers of servers.

The paper finds that existing work has solutions to make the lookup of keys in DHTs scalable, low-latency, fault-tolerant, and secure, but hasn't considered much about the same with reading and storing data. And the paper makes the following contributions.

For low-latency, the paper first compares recursive and iterative lookup nodes algorithm. Unlike the iterative algorithm who waits for the response before proceeding, recursive algorithm enables each node in the lookup path directly forwards the query to the next node, and when the query reaches the key's predecessor, the predecessor sends its successor list directly back to the originator. But if a recursive lookup elicits no response, the originator has no information about what happened and how to re-try in a way that is more likely to succeed. Sometimes a simple re-try may work but in the extreme case only the originator knows a problem exists. So in practice, DHash++ uses recursive lookups by default and iterative lookups after persistent failures.

Then the paper discusses the trade-off between replication and coding. As replicated data allows for low-latency reads because there are many choices for server selection, coded data reduces bandwidth consumption at the expense of increased read latency, because the lower bound on the total time to find and fetch a block is the time to the most distant of the closest set of fragments required to reconstruct the block.

Lastly, the paper explains that the final lookup steps can take little advantage of proximity routing, because as approaching the target node, the number of choices that are closer to the node decreases, so the average latency increases. However, the lookup could stop early at any of those predecessors that has a successor list containing nodes sufficient to reconstruct the block. The trade-off is the expense of data fetch latency, as it decreases the number of successors (and thus fragments) that the originator can choose from.

In terms of high-throughput, the paper introduces an integrated transport protocol called STP, which borrows many ideas form TCP but still implements different retransmit timer and policy. Because lost packets have a large negative impact on DHash++ throughput as each block transfer is preceded by a multi-RPC lookup, under-predicting the latency of an RPC is costly to the advancement of the control window. STP uses Vivaldi latency predictions to help choose the retransmit time-out interval for each RPC. Besides it keeps a moving average of the mistake Vivaldi makes on each successful RPC's round-trip time, and add a 15 ms constant to avoid retransmissions to other virtual nodes on the same host. Unlike TCP, when STP recognizes a time-out it doesn't re-send the RPC right away, but instead gives DHash++ a chance to re-send the RPC to a different destination to avoid wasting time sending RPC's to nodes that have crashed or have overloaded access links.

Overall this paper has discussed a series of design decisions and options faced by DHTs. Which interests me the most is the improvement of performance doesn't happen at where to optimize the algorithm itself, but to do different trade-off's in various scenario with the help of simulations and measurements.

8 comments:

  1. As Yixi mentioned this paper discussed various design decisions and options. It seems that authors accomplish high accuracy in terms of comparison of a series of options, and the high performance of STP in comparison with TCP looks interesting though it is not that widespread.

    However, I think people should've focused on more important point. Due to the fact that DHT is using key and value pair, words in using for searching should be exactly corresponded with expected results, and searchers cannot use wild card (*) to accelerate their searching. This is a huge flaw in searching, thus its actual usage must be very restrictive. I know some application like BitTorrent already have used DHT but I've also heard that lots of users were complaining about its inconvenience in searching. This problem should be solved urgently for actual use of DHT, though performance of DHT have been tremendously improved as paper shows.

    ReplyDelete
    Replies
    1. I honestly did not understand what you meant by "words in using for searching should be exactly corresponded with expected results, and searchers cannot use wild card (*) to accelerate their searching.", but it is assumed in this paper and in Kademlia paper, as far as I remember, that all nodes have to contribute to data storage and retrieval. In other words, whole nodes (network) act as a single system from an outside view. When this network needs to store a data, it finds the appropriate place and stores it, distributedly. However, for bittorent case, users dont store data for others (if I am not mistaken), and this disrupts the spirit of DHT systems mentioned in this paper.

      Delete
    2. Jin - it all depends on how you generate the key; there's some more recent work that shows how to generate keys based on keywords/attributes rather than content or full attribute set (so that I can search for all objects w/ a given attribute value); still it is not trivial for sure

      Delete
  2. This paper explores some optimization design decisions in better DHT in peer-to-peer systems. For the design DHash++, it uses a base Chord look-up algorithm to find data. Chord is a lookup protocol to find keys with runtime O(log N), nodes store coordinates with IP addresses in routing tables and successor lists. In DHash++, values are mapped to keys using SHA-1 hash function. It stores key/value pairs (so-called blocks) in a set of servers and uses Chord and Vivaldi algorithm coordinates. These are all original spirit of many DHTs in P2P information system like Kademlia. This mechanism of DHT can meet the requirements of high availability, balance load and fast moving of data. Howerer it also brings up some problems like congestion of network and packet loss for the big distribution of data on so large range of servers.
    In the evaluation section, it is obviously that PlanetLab and RON test-beds in DHash++ implementation has lower(near ½) latency than the King dataset hosts. As to application workload, DHash++ is designed for read-heavy applications that demand low-latency and high throughput. The examples are SFR and UsenetDHT. In the aspect of low latency: recursive vs. iterative design spirit is very important. Iterative can send lookup query to each successive node in lookup path, so it can detect node failure, thus the requesting node can predict the latencies to each of the responsible nodes without having to communicate with them firstly. But Recursive cannot do so, it just forward query directly to next node, so it cannot detect failed nodes, even though it has less conjestion and faster response time.

    At last, the authors also introduces an alternative transport protocol--STP (Striped Transport Protocol), DHash++ allows it to transport data. It transmitts data directly to other nodes instead of through multiple hops. Its core is a TCP-like congestion window controlling the number of concurrent outstanding RPCs. Compared to TCP, it has higher throughput and lower latency.

    ReplyDelete
  3. Regarding the 'iterative' versus 'recursive' techniques for key look up, it seems like a small change to the recursive technique would allow you to have all the speed benefits of recursive without the reliability drawbacks. Simply have each node along the path of the 'recursion' send a packet back to the originator, indicating that it received the look up, and is forwarding it appropriately. This way, if there is a failure along the way to the key, the originator will know of it quickly (because it's stopped receiving updates), will know the last successful node along the way, and can quickly retry by sending the look up request straight to that node.

    Perhaps there is some reason why this solution is impractical, or wouldn't work that the authors are aware of. I can't think of anything obviously wrong with it though. Or perhaps I'm overestimating the benefit that I imagine this change would provide.

    ReplyDelete
    Replies
    1. Not sure what would be the overall advantage of acks, you still have to wait and process them so how close are you getting iterative? acks are parallel, so there's some gain there ... maybe worth checking Dabek's thesis (search, he's in google and has a link to it).

      Delete
    2. I agree with John that the ack's in that way would be helpful and basically you don't need to wait, because the time-out is a very conservative estimation and any ack's should arrive earlier than the response from target node.

      Btw, haven't read his thesis but the link is here, http://pdos.csail.mit.edu/~fdabek/thesis.pdf

      Delete