EECS 345 2013 Distributed Systems: February 2013

Sunday, February 24, 2013

Spanner

Spanner: Google's Globally-Distributed Database

Spanner is a scalable, multi-version, globally-distributed, synchronously-replicated database, and the paper discusses parts of its design in varying levels of depth.

There is a lot in this paper, so I'll just try to hit the highs and lows, starting with the good parts. The TrueTime API solves a big problem when trying to order events in a distributed system, and as they mention several times, without it the central design of Spanner would be impossible. TrueTime is a sub-system (?) used by Spanner that provides a global time for the current moment along with a confidence interval based on the worst-case clock skew. The nodes providing these services rely on either system level atomic clocks, which have a known skew rate and must be synchronized periodically, or GPS clocks, having a greater accuracy but are susceptible communication failures. Given the interval nature this system provides, there are times when Spanner will have to wait when resolving two conflicting requests, or when a transaction has been committed, but this tends to be a short interval on the order of ~10ms, and two transactions are very unlikely to touch the same rows in the database, just given the scale of the system itself. TrueTime is a cool system, and one I hope they open source.

Spanner itself provides a semi-relational model to the client. This is really cool, typically distributed databases (in my limited experiences reading about them) provide only key-value store, and pushed the burden on the application to do any more advanced lookups. They use a trick involving the primary keys of a row to map to the other values, and use interleaving of table data in directories to allow certain features of a relational database while still remaining distributed.

I wished this paper had listed more metrics about its performance, and compare an application that uses spanner versus Bigtable or Megastore, both of which were mentioned a few times in the paper. Since this system was primary designed to be used with Google's F1 advertising backend, it would have been nice to see some quantities on throughput in the new system versus the old (they have the new numbers), as well as application level complexities that might have been resolved and system level upkeep. Seeing numbers on latencies on the user-end of an application when leaders are failing or a data-center get knocked offline would have also been interesting. I don't doubt that Google has the numbers on these scenarios, I just wonder why they included the numbers they did.

Sunday, February 17, 2013

Dynamo: Amazon’s Highly Available Key-value Store

Dynamo is a highly available key-value storage implementation for large-scale distributed systems. Dynamo relies on the following properties to be effective:

Primary key data storage for a simple read/write interface
Consistent hashing for replication
Object versioning for consistency

The use of primary key data storage cuts down on the overhead of a relational database. Dynamo also implements Service Level Agreements between services in order to maintain latency requirements. Latency requirements are made to satisfy the 99.9^th percentile rather than relying on a mean or median requirement. Amazon also acknowledges a trade-off between availability and consistency. In order to improve availability, eventual consistency is maintained with optimistic replication. This brings up the issue of conflict management which is handled in data reads in a “last write wins” or system-specified policy. This reduces the number of writes that are rejected because of network and node failures. Dynamo is essentially a “zero-hop” DHT in which a node stores all necessary information to know how to contact nodes with which it needs to communicate.

Dynamo exposes the functions get() and put() to the system which are executed using a quorum-like procedure, requiring a number of responses before propagating. Along with consistent hashing, Dynamo’s partitioning system assigns each node a set of virtual nodes which determine which set of data it is responsible for. These nodes then replicate the data they are responsible for on a number of successor nodes to ensure failure tolerance. Because of system failure, it is possible to have multiple versions of the same data in the system simultaneously. Dynamo uses vector clocks to determine versioning between these data replicas and perform conflict resolution. Dynamo detects inconsistent data through Merkle trees in which a tree traversal algorithm is implemented to determine if all keys are up-to-date.

This contributes to Dynamo’s use as a scalable and highly available distributed data storage mechanism.

P. Hunt et al., Zookeeper: Wait-Free Coordination in Internet-scale Systems, USENIX ATC, 2010

In this paper, the author introduce ZooKeeper, a wait-free solution for coordination of large scale distributed system.

One of the most important requirement of distributed system is coordination. We've been in touch with this since the first day we were introduced to multi-thread programming: we need to implement lock to prevent multiple thread from accessing the same piece of memory at the same time. This is just the simplest example and there are millions of other instances that requirement different degree and different forms of coordination. As mentioned in the paper, Amazon Simple Queue Service focus on queing, and Chubby guarantee synchronization. All of these services are developed for specific coordination needs individually and they are all implemented separately on server side, which takes large amount of unnecessary time and work. ZooKeeper, however, is heading another direction. It enable developers to implement their own primitives by an API backed up by a coordination kernel which support new primitives without requiring changes to service core. Another very nice property of ZooKeeper is that it moves away from blocking primitives, which may slow down the whole system with one of two exceptionally slow process. This property, called wait-free, is the most important feature of ZooKeeper and unsurprisingly it appeared on the topic of this paper. Furthermore, with hierarchy of Herlihy, ZooKeeper implements a universal object which provides order guarantee for operation. Caching data on client side, and pipe-lining also help enhancing the performance of ZooKeeper.

Sunday, February 10, 2013

Summary D3S: Debugging Deployed Distributed Systems

NOTE: I didn't write this summary. I'm just posting it for Jake Sherin, because he doesn't have access to the blog at the moment.

D3S is a checker that uses predicates, specified by a developer, and checks these predicates while a system is running. The goal is to automate a lot of what would normally be a manual debugging process. In addition, it allows run-time checking to scale to large systems and makes the checking fault tolerant. D3S addresses three main challenges. The first is that the developer must be able to easily point out what properties to check but allow the checking to use multiple machines so large systems can be checked at run-time. To address this, checkers can be organized in a directed-acyclic graph. Each vertex is a computational stage, and sequential C++ functions can be written for checkers in this stage. State tuples are outputted by the checkers and flow to the downstream vertices in the graph. A vertex can be mapped to several verifier processes that run the checkers in parallel, so that D3S can use multiple machines to scale run-time checking. The second challenge is that D3S must be able to deal with failures of checking machines. To do this, D3S monitors verifier processes, so that when one fails, a new one can be started with the input of the failed process. The third challenge is that D3S should handle failures of processes being checked. This is solved by removing the failed processes from globally-consistent snapshots before checking them. A logical clock is used to order the exposed state tuples, and there is knowledge of which processes are in the snapshot at each timestamp.

D3S allows for a developer to write predicates,which are compiled by D3S for a state exposer and a checking logic module. These can be inserted while the system is running, but violations that occurred prior to the insertion may be missed. In addition, D3S run-time can partition computation across multiple machines, without much effort from the developer. D3S supports stream processing, where vertices only transmit difference in their output compared to the last timestamp, and check the state incrementally. This is useful because consecutive timestamps may only have a slight difference in state. Sampling is also supported by D3S in order to help developers further reduce the overhead.

Global snapshots are integral to D3S. First, a global timestamp is acquired, using a logical clock. When a process sends a message, it attaches its logical clock as well. A receiving process sets its logical clock to the maximum of its local logical clock and the attached clock, so that D3S run-time preserves happens-before relationships, can order all tuples in a consistent total order, and can construct snapshots.

A predicate is technically modeled as a function defined over a finite number of consecutive snapshots. A consecutive snapshot is a set of all snapshots of live processes at time t. D3S needs a complete snapshot, though, in order to verify a predicate correctly. This is obtained by getting the state of all processes that constitute the system at a timestamp.

D3S was implemented on five different systems to check its effectiveness. It was implemented on PacificA, a semi-structured distributed system, on MPS Paxos Implementation, a Paxos-based replicated state machine service, on a web search engine, on the i3 Chord implementation, and on a BitTorrent client. They all basically have the same result, in that D3S detected bugs in a shorter period of time than it would have taken developers otherwise. However, it does show how D3S has a wide array of useful situations. The only issue is that the predicates must be useful in order to get useful results, so it is up to the developer to create those effectively.

D3S's performance, according the data, varies depending on the number of threads, but even in worst case it has just below 8% overhead. The main thing that the data seems to stress is that the number of threads is vital to performance. Related work is discussed and compared with D3S. The other methods are replay-based predicate checking, online monitoring, log analysis, large-scale parallel applications, and model checking. They do mention that they see replay as an essential complimentary technology to online predicate checking. Ultimately, the goal would be to use online checking to catch violations, then enable time-travel debug of a recent history with bounded time replay. Future work focus is on combining online predicate checking and offline replay, as well as exploring ways for a developer to easily find bugs in data center applications that are composed of many systems.

Sunday, February 3, 2013

CoDoNS

In paper The Design and Implementation of a Next Generation Name Service for the Internet, the authors Venugopalan Ramasubramanian Emin Gu ̈n Sire introduced CoDoNS,the next generation of domain name system. The motivations of improving our current domain name system come from three aspects: first is that our current DNS are extremely vulnerable to denial-of-service attack, second is that delays in DNS transitions are generally very long, (thirty percent of web retrievals take more than one second for look up) and third is that because of lack of cache coherency, the length of period in which clients are waiting for update is too big, which preventing quick service relocation in response of change of environment. The newly-designed Cooperative Domain Name System (CoDoNS) may effectively solve these problems in current DNS respectively by exhibit three new features: resilience to attacks, high performance and fast update propagation. Structurally, CoDoNS follow peer-to-peer overlays. This gives advantage of resilience to attacks because the whole system relies less on a central server, which is usually object of DoS attacks. Furthermore, CoDoNS applies analytically informed proactive caching, which increase the system's adjustability to change of environment. This caching layer, called Beehive, replicates DNS mappings to matching anticipated demand, which provides a guaranteed strong performance. Intuitively, when then whole system is implemented, it works like a giant distributed system, where each client participate locally. Notice that despite the advantages of this fresh designed system, it is impossible to replace the current DNS overnight. The possibly takeover process must be done piece by piece.

The Design and Implementation of a Next Generation Name Service for the Internet

Traditional DNS, according to this paper, is susceptible to denial of service (DoS) attacks, has long delays in name-address translation, and does not support fast updates. These problems stem fundamentally from the structure of the legacy DNS. This paper proposed as an alternative for DNS, Cooperative Domain Name System (CoDoNS), whose goals is to overcome such limitations. It consists of globally distributed nodes that self organize to form a peer-to-peer network and is aimed at high lookup performance, resilience to denial of service attacks, and fast propagation of updates.

Very large-scale performance measurements suggests that CoDoNS can provide lower latencies for query lookups. Several features of CoDoNS are particularly attracting:

1. Dropping hierarchical structure of DNS and dentralizing load dynamically to distributed peers to avoid susceptibility of DoS attacks.

2. Better performance without higher expense by varying replication scale according to query popularity: more popular records are replicated at a larger scale, while reduced amount of replications are performed for less popular records.

3. Better adaptation to flash crowds/bottlenecks. Peer nodes are organize themselves continuously make adaptive replicates .

4. Proactive update propagation enables that unexpected changes can be circulated quickly and cached in the system.

last but not least, this system has several compatibility issues with current DNS infrastructure, which may be inhibit its deployment in the real environment.

The Design and Implementation of a Next Generation Name Service for the Internet

V. Ramasubramanian, E. Sirer, The Design and Implementation of a Next Generation Name Service for the Internet, In Proc. of SIGCOMM 2004.

This paper describes the design and implementation of the Cooperative Domain Name System (CoDoNS), which leverages P2P network to improve lookup performance.

Traditional DNS systems suffer from great latency, load imbalance, update propagation as well as implementation errors. They are vulnerable to attacks such as DoS attacks. The authors propose to use Distributed Hashing Tables to maintain a distributed Domain Name System called CoDoNS. CoDoNS decouples namespace management from the physical location of name servers in the network. Based on P2P networks, CoDoNS provides lower latency as the experiment shows. Eliminating the static query processing hierarchy and shedding load dynamically onto peer nodes greatly decreases the vulnerability of CoDoNS to denial of service attacks. Self-organization and continuous adaptation of replication avoids bottlenecks in the presence of ﬂash crowds.

Many Distributed DNS Systems have been released since the paper was published such as BitTorrent powered DNS system. Some organizations such as piratebay use decentralized DNS system to avoid he increasing control which the governments have on the DNS root servers and continue exchange files with copyright.

DNS Performance and the Effectiveness of Caching

DNS Performance and the Effectiveness of Caching
Jung, Sit, Balakrishnan, and Morris

The paper explores the current effectiveness of the DNS service through analysis of three large data sets collected in 2000.

This paper was all about understanding the three data sets the authors were able to collect, 2 from a vantage point on MIT's network and one on the KAIST network. I thought it was interesting to see two traces from the same network a year apart, since nearly the same demographic will be present, and you can see the evolution of their use of the service. While the authors don't make much of a point of this, there was nearly a doubling of DNS lookups in a 11 month period. It would be interesting to see a more long term study of the usage patterns based on this, but I digress. The authors have two big take aways from the paper. First, that retransmissions and zero referral cause a substantial portion of DNS traffic. The mention that 63% of the DNS packets were generated by lookups that obtain no answer. There lookups are only between 4.9% and 3.1% of the total of all lookups. Second, they take a look at the effectiveness of caching, mentioning that popularity of sites requested follows a Zipf distribution, so there's a large portion of sites caching will have no benefit. Looking at the TTL values of a DNS entry, they find there isn't much benefit in having a TTL longer than a 10 minutes, most of the benefit comes in the first few minutes, and cache performance is not improved by having a longer value. The benefits of sharing a cache between people also levels out quite quickly, after 6 or 7 people are using the same cache, there isn't much of a benefit of having another person use it.

The one thing that bugged me about this paper was that they didn't propse any changes to DNS to help improve performance. They had a great data set to use for simulations of changes, and only used it for the caching experiments. While that was an interesting experiment, it's results are not something you can force on people. A website trying to load balance servers will still set a lower TTL even if they know it makes DNS caching ineffective. An experiment about making modifications to DNS's retransmission timeout in order to alleviate some of the redundant traffic generated would have had great implications.

This study would be interesting to see now. With DNS prefetching as a feature in browsers, the number of queries per user has to have exploded since the time of this paper's publishing. I would expect that the latency number to go up even further as a result.