Tuesday, March 5, 2013

Scatter

Scalable Consistency in Scatter
by Lisa Glendenning, Ivan Beschastnikh, Arvind Krishnamurthy and Thomas Anderson

This paper introduces a new distributed and consistent key/value storage system called Scatter.

Firstly, authors adapt the idea of decentralized peer to peer systems like Chord, Kademlia and such. However, these system lack consistency, so they aim to propose a new system that will be fault tolerant, consistent and in planetary scale. They structure the nodes into groups and each group covers a fraction of the circular key space. Every group is managed by Paxos protocol, which we studied earlier, and there are "merge", "split", "migrate" and "repartition" operations defined on groups and every group is connected with its predecessor and successor through individual nodes. Also, group size is small since Paxos' performance drops for large number of nodes. Two phase commit protocol is used in case of an multi-group operation; participating group leaders should agree (commit) on the operation in order for the global picture to be changed. How the storage handled is, concisely, that key space of each group is divided among the member nodes and those nodes are called primary for certain keys. Therefore, for each store operation, value is forwarded to primary node in the group and replicated among others using Paxos. Another novelty they propose is their join operations. They differentiate join operations depending on the need for group failure, throughput and latency. Hence, they add the node to the most appropriate group with respect to the need.

I personally was not able to identify any major flaws, but have couple of concerns. First of all, they want their system to include millions of nodes and their group formation does not depend on geographic placement of the nodes, which may create a case where each group member is scattered around the world. This will bring too much overhead for Paxos. Another thing is that they random k groups to be joined and with rather small k (=3), I dont understand how they can manage non-negligible improvements on resilience and latency.

All and all, solid ideas are presented in the paper and they can be implemented in case they address the security issues.

5 comments:

  1. Doesn't the Load Balance policy potentially conflict with the Latency policy? If a node joins the group with maximum load, then with each join, the latency of that group increases. Then suppose, many nodes join in a very small interval, which policy is to be followed? Does the application have a say in which it finds more important?

    ReplyDelete
  2. It seems as if this same broad concept has been tossed around pretty consistently in multiple papers. If the system is too decentralized to scale, make groups and leaders, divide up the key space, and find out how to do group management and leader election. Then the system will scale because nodes are no longer communicating significantly outside their group.

    ReplyDelete
  3. Even though it seems they've done nice job, I see a couple of problems of evaluation.
    Since they implemented Scatter on top of the Paxos, they should have come up with cases of leader failure deeply though leader. I think it is non-trivial but they didn't show this evaluation only mentioning its mechanism briefly.
    Second, they could have compared performance, availability and consistency with other distributed key-value storage in order to show their accomplishment but they just showed a comparison to OpenDHT. I can't make sure it was possible or not but if they compared Scatter's feature with Dynamo's, it would be quite more clear to recognize the difference of strong consistency versus strong availability.

    ReplyDelete
  4. They say that Scatter is designed for the biggest systems, but will Paxos be able to handle that much traffic? Even more concerning is the fact that their RPC calls will bring even more overhead to Paxos, during a repartition for example, where groups of nodes are exchanging a lot of data.

    ReplyDelete
    Replies
    1. Tubes, spanner runs on paxos. Your argument would apply equally well to the spanner system, which we would all figure out pretty quickly if it didn't scale (because you know... the entire internet runs on google). I highly doubt that the scatter system would not be able to handle traffic because of paxos (although I guess there are some differences in how it is used, I think that its a nonissue).

      Delete