Sunday, February 10, 2013

Summary D3S: Debugging Deployed Distributed Systems

NOTE: I didn't write this summary. I'm just posting it for Jake Sherin, because he doesn't have access to the blog at the moment.

D3S is a checker that uses predicates, specified by a developer, and checks these predicates while a system is running. The goal is to automate a lot of what would normally be a manual debugging process. In addition, it allows run-time checking to scale to large systems and makes the checking fault tolerant. D3S addresses three main challenges. The first is that the developer must be able to easily point out what properties to check but allow the checking to use multiple machines so large systems can be checked at run-time. To address this, checkers can be organized in a directed-acyclic graph. Each vertex is a computational stage, and sequential C++ functions can be written for checkers in this stage. State tuples are outputted by the checkers and flow to the downstream vertices in the graph. A vertex can be mapped to several verifier processes that run the checkers in parallel, so that D3S can use multiple machines to scale run-time checking. The second challenge is that D3S must be able to deal with failures of checking machines. To do this, D3S monitors verifier processes, so that when one fails, a new one can be started with the input of the failed process. The third challenge is that D3S should handle failures of processes being checked. This is solved by removing the failed processes from globally-consistent snapshots before checking them. A logical clock is used to order the exposed state tuples, and there is knowledge of which processes are in the snapshot at each timestamp.

D3S allows for a developer to write predicates,which are compiled by D3S for a state exposer and a checking logic module. These can be inserted while the system is running, but violations that occurred prior to the insertion may be missed. In addition, D3S run-time can partition computation across multiple machines, without much effort from the developer. D3S supports stream processing, where vertices only transmit difference in their output compared to the last timestamp, and check the state incrementally. This is useful because consecutive timestamps may only have a slight difference in state. Sampling is also supported by D3S in order to help developers further reduce the overhead.

Global snapshots are integral to D3S. First, a global timestamp is acquired, using a logical clock. When a process sends a message, it attaches its logical clock as well. A receiving process sets its logical clock to the maximum of its local logical clock and the attached clock, so that D3S run-time preserves happens-before relationships, can order all tuples in a consistent total order, and can construct snapshots.

A predicate is technically modeled as a function defined over a finite number of consecutive snapshots. A consecutive snapshot is a set of all snapshots of live processes at time t. D3S needs a complete snapshot, though, in order to verify a predicate correctly. This is obtained by getting the state of all processes that constitute the system at a timestamp.

D3S was implemented on five different systems to check its effectiveness. It was implemented on PacificA, a semi-structured distributed system, on MPS Paxos Implementation, a Paxos-based replicated state machine service, on a web search engine, on the i3 Chord implementation, and on a BitTorrent client. They all basically have the same result, in that D3S detected bugs in a shorter period of time than it would have taken developers otherwise. However, it does show how D3S has a wide array of useful situations. The only issue is that the predicates must be useful in order to get useful results, so it is up to the developer to create those effectively.

D3S's performance, according the data, varies depending on the number of threads, but even in worst case it has just below 8% overhead. The main thing that the data seems to stress is that the number of threads is vital to performance. Related work is discussed and compared with D3S. The other methods are replay-based predicate checking, online monitoring, log analysis, large-scale parallel applications, and model checking. They do mention that they see replay as an essential complimentary technology to online predicate checking. Ultimately, the goal would be to use online checking to catch violations, then enable time-travel debug of a recent history with bounded time replay. Future work focus is on combining online predicate checking and offline replay, as well as exploring ways for a developer to easily find bugs in data center applications that are composed of many systems.

11 comments:

  1. Debugging Deployed Distributed Systems, D3S, shows cool results of debugging distributed systems efficiently. It deals with several important challenges and shows some of its solutions bringing its impressive evaluation result up; at most 8%, mostly around 1% slowdown in some of different circumstances (although it seems little bit vague due to the fact that the worst case could have more than 8% slowdown with 512/1024 bytes packet and 1000 frequency). However, it seems D3S does not implement realistic method. Under the real large-scale distributed circumstances, we cannot make sure it works well or not. Moreover, their endeavor decreasing overhead diminishes useful information to debug as well though it is a trade-off undoubtedly. To be more useful, D3S must get improved on this part.

    ReplyDelete
    Replies
    1. The concern is brought up in the paper - in the end of section 5, the paper addresses concerns regarding realistic large-scale systems. This particular problem really epitomizes the root of working with distributed systems - it is hard to pinpoint where errors are occurring when an application involves many systems, as each system is potentially changing constantly.
      While the paper moves us one step closer towards debugging large-scale systems, it also highlights the difficulties of doing so.

      Delete
  2. One of the things that I always find interesting is the number of monitoring nodes that need to be added to systems for things like D3S, or Codons, or Consensus Routing to work. With all of the issues being discovered with distributed systems, it's very possible that the number of monitoring systems needed could put a serious damper on productivity in the system because of all the bandwidth and data flow necessary to keep the system in good shape. This is less of a specific comment and more of a general observation, but it's something I've definitely seen in all of these papers.

    ReplyDelete
    Replies
    1. I definitely agree. The test cases they perform in most of these papers are far from realistic simulations, but they just pass that off as an assumption that their system "scales well". It would be nice to see some better testing and reliable results.

      Delete
    2. That's true, but I think the key thing to observe is that in the absence of such systems, an even bigger damper could be put on productivity: failure. So sure the monitoring may cost you 2% (probably more) of the system data, but a system with 2% less capacity is better than one that's broken.

      If you think about the nature of a distributed system, their relative weight makes sense: if you want to provide a service (be it monitoring, dns, whatever) to a large group, you can't really expect to do that in a way that's not as scalable as the system itself. So the only way you could ever hope to monitor a distributed system is... well with a distributed system. And you're right, one naturally runs into the same problems we've been seeing the whole time (trying to understand system state, distribute information, etc.) . But it's not clear to me that there is any alternative.

      Delete
    3. I think the thing to compare/look at here is the "baseline cost"; what would be the overhead if you have something like D3S but don't use it? Whatever else you add to improve reliability comes at a cost and here they try to give you an idea of that.

      Delete
  3. I had a problem with the testing methodology in this paper, since the authors used both 1. a very small network and 2. uniform machines. If they are trying to simulate a realistic scenario for, say, bit torrent clients, then a mix of machines, bandwidths, networks, and users is necessary to have a truly accurate simulation.

    Beyond that, the idea of the monitoring system seems interesting, but I couldn't get past the distracting (and prevalent) spelling and grammatical mistakes. The paper seemed very unprofessional and unreliable as a result.

    ReplyDelete
  4. I really wish they would have given examples of the rules they wrote on the 5 systems tested. It seems like you have to know a lot about what's happening under the hood in the systems. So this tool would be ideal for a developer of a single system. I just wonder if it could be used in a deployment system where there are several DSs interacting with each other.

    ReplyDelete
    Replies
    1. True but that's the case w/ any assertion you write, right?

      Delete
  5. Although I agree that there were some spelling / grammar issues, I think that its hard to discredit someone who built such a complicated system based solely on their English skills. This is a rather unfortunate consequence of the fact that we require research to be done in English, but don't do much to help authors with poor english skills. Some conferences have english review systems to help authors, but apparently this paper did not get that type of help.

    As for their testing, it is unfortunate that they used similar machines, but they did say that they injected errors into the system to account for the similarities. Overall, it would not have been difficult to use varied types of machines to test their system, which does show some carelessness in their research.

    ReplyDelete
    Replies
    1. Gotta loving the present participle - best verb form!

      Delete