EECS 345 2013 Distributed Systems: 2013

Tuesday, March 12, 2013

Where Does the Internet Filtering Occur in China?

It is hardly a secret that China is filtering what their people can see on the internet. This censorship is used to control propaganda, control sentiments, and to silence unfavorable views to the government.

This paper seeks to discover where this filtering takes place.

Some terms before diving into the paper:
AS - Autonomous System, a collection of IP routing prefixes that have a clear routing policy to the internet. These are assigned a number, and each ISP has at least one.
Border AS - these AS connect to ASes from other countries.
Internal AS - these AS connect to ASes within the country.
IDS - Intrusion Detection System - these devices are what the paper is looking for, and effectively create the filter.

The paper is primarily separated into two parts, first getting a better understanding of China's AS Topology, then finding out which ASes the IDSes are attached to. In getting a better sense of China's AS Topology, the paper describes lots of research into which ASes belong to China and all that good stuff. Basically, they find that most ASes are internal (138) with 24 border ASes and 92 external ASes, with 133 unique peerings with external ASes. Another interesting result is that treating each border AS as the root of a tree, the maximum depth is 2, where only 18 ASes are at level 2. This shows that almost all ASes are either a border AS or connected directly to a border AS. This is significant as it shows that the easiest way for the Great Firewall of China to menace over its citizens is by setting itself up at these border ASes.
The second part of the paper describes a method to find out which ASes the IDSes are attached to - they simply send HTTP GET requests with known keywords that trigger the firewall and procure the IP address from where interference is sent. This is done to a variety of websites all over the country. Here, they find that almost all of the IDSes belong to the border ASes, with only 2.9% belonging to internal ASes. This shows that there is not very much domestic filtering, but mostly external filtering.

Overall, the paper is pretty informational - however, it constantly laments the difficulty in getting complete data. The paper also expects an understanding of networking terms and concepts (RST packets, AS, BGP, etc) which can be explained pretty quickly, greatly expanding the target audience of the paper as the topic is quite controversial and brought up often. Results are pretty consistent with their hypotheses. While the paper does not seek methods of bypassing the filter, it does mention that it is the "first study dedicated to explore both AS and router-level structures of China's censored network," and thus it seems it may be prudent to suggest how this information may contribute to the difficulties in bypassing the firewall. In fact, the paper never mentions (as far as I know) what else can be done in the topic for the future.

Tor: The Second-Generation Onion Router

Tor: The Second-Generation Onion Router
by Roger Dingledine, Nick Mathewson, and Paul Syverson

Tor, a circuit-based low-latency anonymous communication service, is the second-generation Onion Routing system addresses limitations in the original design by adding perfect forward secrecy, congestion control, directory servers, integrity checking, configurable exit policies, and a practical design for location-hidden services via rendezvous points.

To understand this paper, we need to learn onion routing first. Following is an explanation of onion routing from Wiki:

A routing onion is a data structure formed by wrapping a plain text message with successive layers of encryption, such that each layer can be 'unwrapped' (decrypted) like the layer of an onion by one intermediary in a succession of intermediaries, with the original plain text message only being viewable by at most, the sender, the last intermediary (the exit node) and the recipient.

Tor is distinguished from the original onion router by several features.
- perfect forward secrecy: Tor now uses an telescoping path-building design allowing the initiator negotiates session keys with each successive hop in the circuit.
- congestion control: Tor’s decentralized congestion control uses end-to-end acks to maintain anonymity while allowing nodes at the edges of the network to detect congestion or flooding and send less data until the congestion subsides.
- directory servers: Tor takes a simplified view toward distributing state information. i.e. certain more trusty nodes act as directory servers providing signed directories describing known routers and their current state.
- integrity checking: Tor verifies data integrity before it leaves the network to hamper attacks.
- configurable exit policies: Tor provides a consistent mechanism for each node to advertise a policy describing the hosts and ports to which it will connect.
- rendezvous points and hidden services: Tor provides an integrated mechanism for responder anonymity via location protected servers. Instead of using reply onions like previous designs, in Tor, clients negotiate rendezvous points to connect with hidden servers not using reply onions.

Authors presented clear goals and even non-goals and lots of different kinds of attacks and defenses against of those attacks. In addition, they also gave several goals and directions to achieve after their achievement. By doing so, they clarify the limitation of their paper and show the possibility to improve their work.

In conclusion, Tor is the predominant technology that employs onion routing now. However, threats from anonymity is getting rapidly higher and higher and distributed system is particularly vulnerable to these. Any of security issue should be keep improved to cope with this kind of threats.

Sunday, March 10, 2013

The Costs and Limits of Availability of Replicated Services

From Georgios T., who can't seem to post:

The Costs and Limits of Availability of Replicated Services. Haifeng Yu & Amin Vahdat

The motivation of this paper comes from the realization that utility of networked services is limited by availability and not performance. The different contemporary, in regard to this paper, approaches to improve availability used caching & replication which create inconsistency problems. The authors identify the two extremes for dealing with inconsistency as 'strong consistency' and 'optimistic consistency' set forth to explore the costs and limits of dynamically exchanging consistency for availability using a contiguous consistency model. In a contiguous consistency model applications specify the maximum distance from strong consistency, thus bounding optimistic consistency.
In order to evaluate different implementations, a methodology for computing the upper bounds of service availability as a function of faultload, workload, consistency and replication level is described. The methodology is based on Q(offline) and Q(online) -- sets of questions that a protocol will have to answer and can be resolved optimally offline and online respectively. The answers to the questions form a dominant algorithm; Q(online) answers are specified by the inputs of this algorithm, thus less questions in the online set improves the chance for a tractable solution.
Another contribution of this paper is a comparison, using a prototype based on TACT, of service availability that can be achieved with different consistency protocols(Voting, Primary, Golding), faultload and consistency levels. Through the experiments it is shown that small optimization in the existing protocols (i.e. aggressive write propagation) can bring them closer to the availability upper bounds. An interesting finding of the evaluation section is that maximizing availability through maintaining an as strong as possible consistency level creates a trade-off between availability and communication. Finally, the authors also explore the effect of the degree of replication and network reliability to availability and it is shown that more replicas are not always good, they can improve read reliability but degrade write availability due to consistency constrains.

Thursday, March 7, 2013

Practical Byzantine Fault Tolerance

This short post is written for the paper Practical Byzantine Fault Tolerance by Miguel Castro and Barbara Liskov from MIT. Before we read this paper, a very important terminology should be mentioned and understood. Byzantine fault: Previously, when we mention fault tolerance, we mean resilience to node failure, or non-responding. However, in real world, we also need to protect the system from malicious action, and mis-sent information or instruction. These problems are called Byzantine fault. Therefore byzantine fault tolerance means the mechanism to protect the system from these types of situations. The paper is named Practical because all papers related to byzantine fault tolerance system before are mostly devoted to the theoretical part of this topic, and they assumes a synchronous environment. While this paper tackles the topic in a practical point of view with an algorithm to solve this problem in asynchronous system settings.

Algorithm works as follows:
1. a client sends a request to invoke a service operation to the primary
2. the primary multi-casts the request to the backups
3. replicas execute the request and send a reply to the client
4. the client waits for one replies from different replicas with the same result;
this is the result of the operation

The algorithm guarantees liveness and safety because all non-faulty replicas agree on a total order for the execution of requests despite failures.

Tuesday, March 5, 2013

Scatter

Scalable Consistency in Scatter
by Lisa Glendenning, Ivan Beschastnikh, Arvind Krishnamurthy and Thomas Anderson

This paper introduces a new distributed and consistent key/value storage system called Scatter.

Firstly, authors adapt the idea of decentralized peer to peer systems like Chord, Kademlia and such. However, these system lack consistency, so they aim to propose a new system that will be fault tolerant, consistent and in planetary scale. They structure the nodes into groups and each group covers a fraction of the circular key space. Every group is managed by Paxos protocol, which we studied earlier, and there are "merge", "split", "migrate" and "repartition" operations defined on groups and every group is connected with its predecessor and successor through individual nodes. Also, group size is small since Paxos' performance drops for large number of nodes. Two phase commit protocol is used in case of an multi-group operation; participating group leaders should agree (commit) on the operation in order for the global picture to be changed. How the storage handled is, concisely, that key space of each group is divided among the member nodes and those nodes are called primary for certain keys. Therefore, for each store operation, value is forwarded to primary node in the group and replicated among others using Paxos. Another novelty they propose is their join operations. They differentiate join operations depending on the need for group failure, throughput and latency. Hence, they add the node to the most appropriate group with respect to the need.

I personally was not able to identify any major flaws, but have couple of concerns. First of all, they want their system to include millions of nodes and their group formation does not depend on geographic placement of the nodes, which may create a case where each group member is scattered around the world. This will bring too much overhead for Paxos. Another thing is that they random k groups to be joined and with rather small k (=3), I dont understand how they can manage non-negligible improvements on resilience and latency.

All and all, solid ideas are presented in the paper and they can be implemented in case they address the security issues.

Sunday, February 24, 2013

Spanner

Spanner: Google's Globally-Distributed Database

Spanner is a scalable, multi-version, globally-distributed, synchronously-replicated database, and the paper discusses parts of its design in varying levels of depth.

There is a lot in this paper, so I'll just try to hit the highs and lows, starting with the good parts. The TrueTime API solves a big problem when trying to order events in a distributed system, and as they mention several times, without it the central design of Spanner would be impossible. TrueTime is a sub-system (?) used by Spanner that provides a global time for the current moment along with a confidence interval based on the worst-case clock skew. The nodes providing these services rely on either system level atomic clocks, which have a known skew rate and must be synchronized periodically, or GPS clocks, having a greater accuracy but are susceptible communication failures. Given the interval nature this system provides, there are times when Spanner will have to wait when resolving two conflicting requests, or when a transaction has been committed, but this tends to be a short interval on the order of ~10ms, and two transactions are very unlikely to touch the same rows in the database, just given the scale of the system itself. TrueTime is a cool system, and one I hope they open source.

Spanner itself provides a semi-relational model to the client. This is really cool, typically distributed databases (in my limited experiences reading about them) provide only key-value store, and pushed the burden on the application to do any more advanced lookups. They use a trick involving the primary keys of a row to map to the other values, and use interleaving of table data in directories to allow certain features of a relational database while still remaining distributed.

I wished this paper had listed more metrics about its performance, and compare an application that uses spanner versus Bigtable or Megastore, both of which were mentioned a few times in the paper. Since this system was primary designed to be used with Google's F1 advertising backend, it would have been nice to see some quantities on throughput in the new system versus the old (they have the new numbers), as well as application level complexities that might have been resolved and system level upkeep. Seeing numbers on latencies on the user-end of an application when leaders are failing or a data-center get knocked offline would have also been interesting. I don't doubt that Google has the numbers on these scenarios, I just wonder why they included the numbers they did.

Sunday, February 17, 2013

Dynamo: Amazon’s Highly Available Key-value Store

Dynamo is a highly available key-value storage implementation for large-scale distributed systems. Dynamo relies on the following properties to be effective:

Primary key data storage for a simple read/write interface
Consistent hashing for replication
Object versioning for consistency

The use of primary key data storage cuts down on the overhead of a relational database. Dynamo also implements Service Level Agreements between services in order to maintain latency requirements. Latency requirements are made to satisfy the 99.9^th percentile rather than relying on a mean or median requirement. Amazon also acknowledges a trade-off between availability and consistency. In order to improve availability, eventual consistency is maintained with optimistic replication. This brings up the issue of conflict management which is handled in data reads in a “last write wins” or system-specified policy. This reduces the number of writes that are rejected because of network and node failures. Dynamo is essentially a “zero-hop” DHT in which a node stores all necessary information to know how to contact nodes with which it needs to communicate.

Dynamo exposes the functions get() and put() to the system which are executed using a quorum-like procedure, requiring a number of responses before propagating. Along with consistent hashing, Dynamo’s partitioning system assigns each node a set of virtual nodes which determine which set of data it is responsible for. These nodes then replicate the data they are responsible for on a number of successor nodes to ensure failure tolerance. Because of system failure, it is possible to have multiple versions of the same data in the system simultaneously. Dynamo uses vector clocks to determine versioning between these data replicas and perform conflict resolution. Dynamo detects inconsistent data through Merkle trees in which a tree traversal algorithm is implemented to determine if all keys are up-to-date.

This contributes to Dynamo’s use as a scalable and highly available distributed data storage mechanism.

P. Hunt et al., Zookeeper: Wait-Free Coordination in Internet-scale Systems, USENIX ATC, 2010

In this paper, the author introduce ZooKeeper, a wait-free solution for coordination of large scale distributed system.

One of the most important requirement of distributed system is coordination. We've been in touch with this since the first day we were introduced to multi-thread programming: we need to implement lock to prevent multiple thread from accessing the same piece of memory at the same time. This is just the simplest example and there are millions of other instances that requirement different degree and different forms of coordination. As mentioned in the paper, Amazon Simple Queue Service focus on queing, and Chubby guarantee synchronization. All of these services are developed for specific coordination needs individually and they are all implemented separately on server side, which takes large amount of unnecessary time and work. ZooKeeper, however, is heading another direction. It enable developers to implement their own primitives by an API backed up by a coordination kernel which support new primitives without requiring changes to service core. Another very nice property of ZooKeeper is that it moves away from blocking primitives, which may slow down the whole system with one of two exceptionally slow process. This property, called wait-free, is the most important feature of ZooKeeper and unsurprisingly it appeared on the topic of this paper. Furthermore, with hierarchy of Herlihy, ZooKeeper implements a universal object which provides order guarantee for operation. Caching data on client side, and pipe-lining also help enhancing the performance of ZooKeeper.

Sunday, February 10, 2013

Summary D3S: Debugging Deployed Distributed Systems

NOTE: I didn't write this summary. I'm just posting it for Jake Sherin, because he doesn't have access to the blog at the moment.

D3S is a checker that uses predicates, specified by a developer, and checks these predicates while a system is running. The goal is to automate a lot of what would normally be a manual debugging process. In addition, it allows run-time checking to scale to large systems and makes the checking fault tolerant. D3S addresses three main challenges. The first is that the developer must be able to easily point out what properties to check but allow the checking to use multiple machines so large systems can be checked at run-time. To address this, checkers can be organized in a directed-acyclic graph. Each vertex is a computational stage, and sequential C++ functions can be written for checkers in this stage. State tuples are outputted by the checkers and flow to the downstream vertices in the graph. A vertex can be mapped to several verifier processes that run the checkers in parallel, so that D3S can use multiple machines to scale run-time checking. The second challenge is that D3S must be able to deal with failures of checking machines. To do this, D3S monitors verifier processes, so that when one fails, a new one can be started with the input of the failed process. The third challenge is that D3S should handle failures of processes being checked. This is solved by removing the failed processes from globally-consistent snapshots before checking them. A logical clock is used to order the exposed state tuples, and there is knowledge of which processes are in the snapshot at each timestamp.

D3S allows for a developer to write predicates,which are compiled by D3S for a state exposer and a checking logic module. These can be inserted while the system is running, but violations that occurred prior to the insertion may be missed. In addition, D3S run-time can partition computation across multiple machines, without much effort from the developer. D3S supports stream processing, where vertices only transmit difference in their output compared to the last timestamp, and check the state incrementally. This is useful because consecutive timestamps may only have a slight difference in state. Sampling is also supported by D3S in order to help developers further reduce the overhead.

Global snapshots are integral to D3S. First, a global timestamp is acquired, using a logical clock. When a process sends a message, it attaches its logical clock as well. A receiving process sets its logical clock to the maximum of its local logical clock and the attached clock, so that D3S run-time preserves happens-before relationships, can order all tuples in a consistent total order, and can construct snapshots.

A predicate is technically modeled as a function defined over a finite number of consecutive snapshots. A consecutive snapshot is a set of all snapshots of live processes at time t. D3S needs a complete snapshot, though, in order to verify a predicate correctly. This is obtained by getting the state of all processes that constitute the system at a timestamp.

D3S was implemented on five different systems to check its effectiveness. It was implemented on PacificA, a semi-structured distributed system, on MPS Paxos Implementation, a Paxos-based replicated state machine service, on a web search engine, on the i3 Chord implementation, and on a BitTorrent client. They all basically have the same result, in that D3S detected bugs in a shorter period of time than it would have taken developers otherwise. However, it does show how D3S has a wide array of useful situations. The only issue is that the predicates must be useful in order to get useful results, so it is up to the developer to create those effectively.

D3S's performance, according the data, varies depending on the number of threads, but even in worst case it has just below 8% overhead. The main thing that the data seems to stress is that the number of threads is vital to performance. Related work is discussed and compared with D3S. The other methods are replay-based predicate checking, online monitoring, log analysis, large-scale parallel applications, and model checking. They do mention that they see replay as an essential complimentary technology to online predicate checking. Ultimately, the goal would be to use online checking to catch violations, then enable time-travel debug of a recent history with bounded time replay. Future work focus is on combining online predicate checking and offline replay, as well as exploring ways for a developer to easily find bugs in data center applications that are composed of many systems.

Sunday, February 3, 2013

CoDoNS

In paper The Design and Implementation of a Next Generation Name Service for the Internet, the authors Venugopalan Ramasubramanian Emin Gu ̈n Sire introduced CoDoNS,the next generation of domain name system. The motivations of improving our current domain name system come from three aspects: first is that our current DNS are extremely vulnerable to denial-of-service attack, second is that delays in DNS transitions are generally very long, (thirty percent of web retrievals take more than one second for look up) and third is that because of lack of cache coherency, the length of period in which clients are waiting for update is too big, which preventing quick service relocation in response of change of environment. The newly-designed Cooperative Domain Name System (CoDoNS) may effectively solve these problems in current DNS respectively by exhibit three new features: resilience to attacks, high performance and fast update propagation. Structurally, CoDoNS follow peer-to-peer overlays. This gives advantage of resilience to attacks because the whole system relies less on a central server, which is usually object of DoS attacks. Furthermore, CoDoNS applies analytically informed proactive caching, which increase the system's adjustability to change of environment. This caching layer, called Beehive, replicates DNS mappings to matching anticipated demand, which provides a guaranteed strong performance. Intuitively, when then whole system is implemented, it works like a giant distributed system, where each client participate locally. Notice that despite the advantages of this fresh designed system, it is impossible to replace the current DNS overnight. The possibly takeover process must be done piece by piece.

The Design and Implementation of a Next Generation Name Service for the Internet

Traditional DNS, according to this paper, is susceptible to denial of service (DoS) attacks, has long delays in name-address translation, and does not support fast updates. These problems stem fundamentally from the structure of the legacy DNS. This paper proposed as an alternative for DNS, Cooperative Domain Name System (CoDoNS), whose goals is to overcome such limitations. It consists of globally distributed nodes that self organize to form a peer-to-peer network and is aimed at high lookup performance, resilience to denial of service attacks, and fast propagation of updates.

Very large-scale performance measurements suggests that CoDoNS can provide lower latencies for query lookups. Several features of CoDoNS are particularly attracting:

1. Dropping hierarchical structure of DNS and dentralizing load dynamically to distributed peers to avoid susceptibility of DoS attacks.

2. Better performance without higher expense by varying replication scale according to query popularity: more popular records are replicated at a larger scale, while reduced amount of replications are performed for less popular records.

3. Better adaptation to flash crowds/bottlenecks. Peer nodes are organize themselves continuously make adaptive replicates .

4. Proactive update propagation enables that unexpected changes can be circulated quickly and cached in the system.

last but not least, this system has several compatibility issues with current DNS infrastructure, which may be inhibit its deployment in the real environment.

The Design and Implementation of a Next Generation Name Service for the Internet

V. Ramasubramanian, E. Sirer, The Design and Implementation of a Next Generation Name Service for the Internet, In Proc. of SIGCOMM 2004.

This paper describes the design and implementation of the Cooperative Domain Name System (CoDoNS), which leverages P2P network to improve lookup performance.

Traditional DNS systems suffer from great latency, load imbalance, update propagation as well as implementation errors. They are vulnerable to attacks such as DoS attacks. The authors propose to use Distributed Hashing Tables to maintain a distributed Domain Name System called CoDoNS. CoDoNS decouples namespace management from the physical location of name servers in the network. Based on P2P networks, CoDoNS provides lower latency as the experiment shows. Eliminating the static query processing hierarchy and shedding load dynamically onto peer nodes greatly decreases the vulnerability of CoDoNS to denial of service attacks. Self-organization and continuous adaptation of replication avoids bottlenecks in the presence of ﬂash crowds.

Many Distributed DNS Systems have been released since the paper was published such as BitTorrent powered DNS system. Some organizations such as piratebay use decentralized DNS system to avoid he increasing control which the governments have on the DNS root servers and continue exchange files with copyright.

DNS Performance and the Effectiveness of Caching

DNS Performance and the Effectiveness of Caching
Jung, Sit, Balakrishnan, and Morris

The paper explores the current effectiveness of the DNS service through analysis of three large data sets collected in 2000.

This paper was all about understanding the three data sets the authors were able to collect, 2 from a vantage point on MIT's network and one on the KAIST network. I thought it was interesting to see two traces from the same network a year apart, since nearly the same demographic will be present, and you can see the evolution of their use of the service. While the authors don't make much of a point of this, there was nearly a doubling of DNS lookups in a 11 month period. It would be interesting to see a more long term study of the usage patterns based on this, but I digress. The authors have two big take aways from the paper. First, that retransmissions and zero referral cause a substantial portion of DNS traffic. The mention that 63% of the DNS packets were generated by lookups that obtain no answer. There lookups are only between 4.9% and 3.1% of the total of all lookups. Second, they take a look at the effectiveness of caching, mentioning that popularity of sites requested follows a Zipf distribution, so there's a large portion of sites caching will have no benefit. Looking at the TTL values of a DNS entry, they find there isn't much benefit in having a TTL longer than a 10 minutes, most of the benefit comes in the first few minutes, and cache performance is not improved by having a longer value. The benefits of sharing a cache between people also levels out quite quickly, after 6 or 7 people are using the same cache, there isn't much of a benefit of having another person use it.

The one thing that bugged me about this paper was that they didn't propse any changes to DNS to help improve performance. They had a great data set to use for simulations of changes, and only used it for the caching experiments. While that was an interesting experiment, it's results are not something you can force on people. A website trying to load balance servers will still set a lower TTL even if they know it makes DNS caching ineffective. An experiment about making modifications to DNS's retransmission timeout in order to alleviate some of the redundant traffic generated would have had great implications.

This study would be interesting to see now. With DNS prefetching as a feature in browsers, the number of queries per user has to have exploded since the time of this paper's publishing. I would expect that the latency number to go up even further as a result.

Sunday, January 27, 2013

Designing a DHT for low latency and high throughput

Traditionally, several problems constrain design of DHTs. e.g. Link capacity of node, physical distance of nodes, network congestion/packet loss. This paper introduces a distributed hash table aimed at lowering latency and improving throughout. This paper addresses several important issues concerning the DHT's design.

Firstly, for latency reduction, iterative/recursive lookups are compared. Iterative lookup query to each successive node in the path, although it can detect node failures, it has to wait for response before next operation. In comparison, recursive lookup takes fewer queries and less network flow - it forward query to the next node. Secondly, selection of proximity neighbor is also important to reduce latency, ID-space range is defined to proximity detection. Lastly, latency is related with storage policies: erasure-coding and replication, latency is proportional to l/m where l is number of fragments and m is the number required to reconstruct the block.

As for high throughput, the biggest difficulty is that data is spread over a large set of servers. This paper introduces an alternative transport protocol STP (Striped Transport Protocol), it can transmit data directly to other nodes in a single instance, which has higher throughput and lower latency than TCP.

design DHT with low-latency and high-throughput

Designing a DHT for low latency and high throughput

Frank Dabek, Jingyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, Robert Morris
MIT CS and AI Laboratory

As this paper points out, designing a wide-area DHT with high-throughput and low-latency is a challenge. The paper explores a range of solutions the existing systems employs, and results from it some new techniques, including use of latency predictions based on synthetic co-ordinates, efficient integration of lookup routing and data fetching, and a congestion control mechanism suitable for fetching data striped over large numbers of servers.

The paper finds that existing work has solutions to make the lookup of keys in DHTs scalable, low-latency, fault-tolerant, and secure, but hasn't considered much about the same with reading and storing data. And the paper makes the following contributions.

For low-latency, the paper first compares recursive and iterative lookup nodes algorithm. Unlike the iterative algorithm who waits for the response before proceeding, recursive algorithm enables each node in the lookup path directly forwards the query to the next node, and when the query reaches the key's predecessor, the predecessor sends its successor list directly back to the originator. But if a recursive lookup elicits no response, the originator has no information about what happened and how to re-try in a way that is more likely to succeed. Sometimes a simple re-try may work but in the extreme case only the originator knows a problem exists. So in practice, DHash++ uses recursive lookups by default and iterative lookups after persistent failures.

Then the paper discusses the trade-off between replication and coding. As replicated data allows for low-latency reads because there are many choices for server selection, coded data reduces bandwidth consumption at the expense of increased read latency, because the lower bound on the total time to find and fetch a block is the time to the most distant of the closest set of fragments required to reconstruct the block.

Lastly, the paper explains that the final lookup steps can take little advantage of proximity routing, because as approaching the target node, the number of choices that are closer to the node decreases, so the average latency increases. However, the lookup could stop early at any of those predecessors that has a successor list containing nodes sufficient to reconstruct the block. The trade-off is the expense of data fetch latency, as it decreases the number of successors (and thus fragments) that the originator can choose from.

In terms of high-throughput, the paper introduces an integrated transport protocol called STP, which borrows many ideas form TCP but still implements different retransmit timer and policy. Because lost packets have a large negative impact on DHash++ throughput as each block transfer is preceded by a multi-RPC lookup, under-predicting the latency of an RPC is costly to the advancement of the control window. STP uses Vivaldi latency predictions to help choose the retransmit time-out interval for each RPC. Besides it keeps a moving average of the mistake Vivaldi makes on each successful RPC's round-trip time, and add a 15 ms constant to avoid retransmissions to other virtual nodes on the same host. Unlike TCP, when STP recognizes a time-out it doesn't re-send the RPC right away, but instead gives DHash++ a chance to re-send the RPC to a different destination to avoid wasting time sending RPC's to nodes that have crashed or have overloaded access links.

Overall this paper has discussed a series of design decisions and options faced by DHTs. Which interests me the most is the improvement of performance doesn't happen at where to optimize the algorithm itself, but to do different trade-off's in various scenario with the help of simulations and measurements.

Summary of "SCRIBE: A large-scale and decentralized application-level multicast infrastructure"

1. Title: SCRIBE: A large-scale and decentralized application-level multicast infrastructure

Authors: Miguel Castro, Peter Druschel, Anne-Marie Kermarrec and Antony Rowstron

2. In this paper, the application-layer IP multicast tool Scribe, along with its supporting framework Pastry, are discussed in terms of their implementation, advantages and disadvantages, and performance.

3. Scribe and Pastry, sitting at the application level, avoid the limitations of network-level multicast faced in the past: namely, the lack of wide-scale development (and any motivation for such development), and the inability to track group membership. This system works in a very similar way to previously discussed distributed hashing schemes, including the focus of our project, Kademlia. One clever design idea the authors note is the flexibility of the reliability constraints that can be built on top of Scribe. Scribe is based on best effort delivery, much like its TCP base, however, the application can be adjusted to provide better reliability guarantees, if necessary.

4. The experiment they run seems infeasible without some sort of cap, or bottleneck elimination routine, which they apply later. Having thousands of children tables or children table entries would not be feasible on an end-user machine, and bringing down such a well-connected node would not be an option either. Thus, it seems that the bottleneck elimination is necessary in their design. The authors also mention the infeasibility of the software with small networks, but it's unlikely that this software would be adopted en masse, and smaller networks, (for instance, a private chat client), would be extremely inefficient or even unusable due to the stresses and overhead imposed by Scribe on small networks. Finally, although the paper lists a variety of other, related projects, the authors make no attempt to discuss the shortcomings of Scribe, or the potential advantages of their competitors.

5. A system of IP multicast is still relevant today. While certain applications have made attempts at solving this problem, there are still considerable reliability and performance concerns for application-layer multicast. The age of this paper (about a decade), coupled with ISPs, might stifle development of a Scribe-like system for multicast. However, similar issues of reliability and large-scale deployment, as presented in the paper, are very real and relevant to this day.

Friday, January 25, 2013

better DHT, Scribe

In the first paper Designing a DHT for low latency and high throughput, the authors give us an idea of how to build a Distributed Hash Table (DHT for short) with lower latency and higher throughput. Because of the fast growing need for distributed system to handle large amount of data, a better distributed hash table for data fetching and storing becomes more and more desperately needed. When considering dictionary-structure, couple aspects should be consider. Latency, which indicates the time it takes to process one request, and throughput, which indicates the ability to handle large amount of requests simultaneously, are definitely among the most important ones. In this paper, the author presents several ways to reduce the latency, including changing of data layout, switching from iterative look up methods to recursive ones, choosing nearby nodes as routing table entries, and integrating routing and fetching. Personally, I like the idea of switching from iterative to recursive the most, because it might eliminate half of latency by letting intermediate nodes forward the lookup immediately before the acknowledgement of previous hop. To me, it seems like a way of thinking that might help reduce latency of many other structures. Regarding higher throughput, this paper introduces Striped Transport Protocol (STP) to solve the problems of low throughput coming with traditional TCP connection by allowing nodes to put and get data directly to other nodes, instead of routing data through multiple overlay hops.

The second paper describe SCRIBE: A large-scale and decentralized application-level multicast infrastructure, along with Pastry, a p2p location and routing substrate upon which Scribe was built, which form a robust, self-organizing overlay network. Scribe let individual node create a group, individual node join group created by other nodes, and most importantly multicast messages to all members of the group with appropriate credentials. However, one property must also be noticed: Scribe specifies no particularly delivery order, by which the optimization of multicasting process can be achieved since there is no restriction on delivery order. In detail, applications in API of Scribe include create, join, leave, and multicast. With the help of Pastry, Scribe manage group creation, joining, and multicasting. One important property of Pastry and Scribe is that both of them are fully decentralized, which means all decisions are based on local information. This property guarantee the independency of individual. This property also fit in the system well because all connections are peer-tp-peer. It is also through this model that scalability and reliability are achieved.

Monday, January 21, 2013

Week 3

Implementing Remote Procedure Calls A. Birrell and D. Nelson

This paper describes the overall structure of the RPC mechanism.

In this paper the authors introduce remote procedure calls. RPC is seemed as an extension on procedures executed in another address space on a single computer. RPC transfers the control and data across a communication network. It can be provided in a high-level language, and the programmer does not need to worry about the details of remote interaction. RPC is widely used in many areas.

The many faces of publish/subscribe P. Eugster, P. Felber, R. Guerraoui and A-M. Kermarrec

In this paper the authors discuss about the publish/subscribe interaction scheme and its variants, as well as some alternative interaction schemes.

The authors propose the common denominators of publish/subscribe schemes, namely time, space and synchronization decoupling of subscribers and publishers. They survey several interaction scheme such as RPC and message queuing. The conclude that publish/subscribe scheme supports better for time, space and synchronization decoupling than any of those methods. Meanwhile, They compare two widely used schemes, topic-based and content-based. In this scheme, publishers and subscribers are loosely coupled. Moreover it provides better scalability.

Security issue might be a disadvantage of publish/subscribe scheme. It is possible that a client fools the middleware medium so that it could receive the information it is not authorized to receive.

Implementing Remote Procedure Calls

This paper deals with the implementation details of RPC on a test infrastructure at Xerox PARC. It incorporates the various design principles defined as a part of RPC's theoretical description by B. J. Nelson and gives relevant explanations of why the parameters are defined in the way they are and how they benefit distributed computing.

Primarily aimed at easing the development of distributed applications, RPC is based on simplicity, efficiency and generality. Simplicity is mainly observed at the programmer's end, for whom, executing a remote functionality is equivalent to making a local function call. The efficiency lies in the fact that simplicity of procedure calls guarantees rapid over-the-wire communication. Finally, the generality is exhibited in two aspects: firstly, the fact that RPC is essentially a function call, which is an omnipresent phenomenon in every computing environment and secondly, in the fact that the design is not tightly bound to procedure-call mechanism and can be implemented over an underlying message-passing system, if need be. Another very important thought maintained throughout the design of RPC is the closeness to local procedure calls. This deliberate insistence sets correct expectations for programmers implementing distributed applications both, at the caller and the callee ends.

In the view of maintaining simplicity of design and implementation, the part where the bulk data transfers constitute twice the number of packets (an acknowledgement each for every packet, but the last) seems an overhead in today's scheme of things. As the authors pointed out, typically only 10% of the bandwidth of the network was used at that time, easily accommodating the additional traffic generated in the scenario mentioned above. Keeping in mind that the bandwidth today is many times as much as it was then, the sheer amount of data exchange today will probably result in significant bandwidth being used for otherwise avoidable information.

Today, message passing and RPC form the main methods of communication. They can be analogously compared to UDP and TCP respectively. While RPCs ensure synchronous communication, and hence are inherently reliable; MPIs are asynchronous in nature and hence, non-reliable. On the other hand, concurrency is very easily possible in MPIs while it can be a tedious task to implement in case of RPCs, as the authors themselves mention, viz. in multicast and broadcast like scenarios.

Tuesday, January 15, 2013

Week 2

Rethink the design of the Internet

The author first explained the the main idea of end to end argument and the advantages brought by such network design paradigm, that is: moving functions "up" from core and "out" to the edge node gives us a less complex core network and facilitates construction of more innovative and reliable application software.

Then the paper lists the motivations for redesign the network: end users may not trust each other, more demanding applications, inclusion of third-parties, etc. For thse issues, technical and non-technical solutions are offered.Some of them do not interrupt the end-to-end arguments and some others add more control on the core network. For example, Firewall system, traffic and spam filter.Paper also evaluates the functions served by non-tech mechanism, like usage of law.

Week 2 Papers

Rethinking the design of the Internet: The end to end arguments vs. the brave new world by David D. Clark and Marjory S. Blumenthal

In this paper the authors discussed the possibility the end to end principle could be compromised.

In this paper which is more than 10 years old, the authors discussed on some scenarios that implementing an application-level function in the core of the network is necessary. Some of the arguments have been proved true now. For example, as the authors said, the Internet today is an untrustworthy world, manufacturers such as Cisco have selling devices such as Cisco Spam & Virus Blocker. CDN serve content to end-users with high availability and high performance. CDNs serve a large fraction of the Internet content today so that delivery of data is accelerated.

On the other hand, some other problems have already been solved. For example, today ISPs are becoming tunnels for data, and ISP service differentiation is almost unnoticeable.

This paper successfully foresaw a more sophisticated Internet in which application-level function is implemented in the core of the Internet.

Consensus Routing: The Internet as a Distributed System by J. John, E. Katz-Basset, A. Krishnamurthy, T. Anderson and A. Venkataramani

This 2008 paper proposed to improve the consistency of the network to reduce routing loops and blackholes.

In this paper the authors proposed running a distributed coordination algorithm to ensure that a route is adopted only after all dependent routers have agreed upon a globally consistent view of global state in order to reduce loops and blackholes. In the meantime, a small fraction of the packets are heuristically forwarded to find a route if a safe route is not found yet. This is called a transient mode the function.

One concern of this paper in my eyes is that if the network is not very reliable, the mechanism might have to update the consistency information very frequently, and the cost for sending and receiving these packets might be high.

This paper reminds me of OpenFlow, which is an open standard that enables researchers to run experimental protocols in the campus networks we use every day. We could leverage OpenFlow to try more optimized protocols.

Monday, January 14, 2013

week 2

Two papers in these weeks describe two important topics in network: discussion over the appropriateness of end-to-end argument after all these years of development of network, and consensus routing as a solution to the problem of lack of consistency.

On Monday, we read a paper End-to-End Arguments in System Design, which states that the placement of functions on internet should be placed end-to-end, or as "up" as possible for the completeness and correctness of the function, and it is amazing that it has guided the internet for more than 30 years. However, by the growth of internet, the functionality and environment have changed a lot and pure end-to-end arguments seem to have trouble in some cases. In paper Rethinking the Design of the Internet: End-to-end arguments vs. the brave new world, the authors argue that in the new world, instead of strictly following the end-to-end argument, we should carefully balance the pros and cons of placement of each individual function. Reasons that force us to rethink about the placement include, untruthfulness of less sophisticated internet users, complexity of functions, and interposition of third-parties like government. As a result, we need to rethink of our principles for function placement.

In second paper, Consensus Routing: The Internet as a Distributed System, the authors introduce consensus routing as a solution to the problem of lack of consistency, which is a result of our internet routing protocals' preference to responsiveness over consistency. Furthermore consensus routing also take the goal of liveliness into account by two different modes of packet delivery: stable mode, which ensures consistency because a route is adopted only after consensus among all routers, and transient mode, where heuristic forwarding is proceed after failed links. Transient forwarding schemes includes deflections, detour, and back-up routes. At the end of the paper, evaluation section shows us the effectiveness of this consensus routing method in maintaining connectivity, which turns out to be surprisingly good.

Sunday, January 13, 2013

Design Philosophy of DARPA Protocols Summary- Week 2

1. D. Clark, The design philosophy of the DARPA Internet Protocols, In Proc. of ACM SIGCOMM, August 1988

2. This paper talks about the historical developments and requirements of TCP/IP protocols.

3. The paper focuses on the historical goals of the internet. They are (in order of importance): 1. communication must continue despite loss of gateways, 2. support for multiple types of service, 3. the architecture must accommodante multiple types of networks, 4. distributed management, 5. cost effective, 6. host attachment, and 7. ressource tracking. The paper claims that the internet does a good job at first couple goals, but there is more work to be done for the later goals. I thought it was interesting to note that TCP and IP were designed as a single layer and were later split into two layers to be more flexible. I also found it interesting that the internet had much more of a military consideration that I had originally realized.

4. As this paper was not a scientific or technical writeup, it does not have any flaws that are specific to its methodology. I found the paper to be generally well written, and insightful to its stated purpose. I would have liked for the author to talk more about the internal arguments that the designers of TCP/IP had and how that shaped the original specifications. I also would have liked the author to focus more on the government's intervention in the development of the protocols and how that effected the design considerations. Finally, I would have liked the discussion to be more technical so that the reasoning behind some of the decisions would have been more clear.

5. The history of the development of the internet is very relavent to today's internet. Looking at how the early requirements of the internet changed over time and how that effected the design decisions is extremely relavent to today. In the past several years, the types of devices and the mediums that they are using to connect to the internet are completely changing. Traditionally, computers were large and connected to the internet through a wire. Increasingly today, computers are small, mobile, and connected to the internet though less reliable connections on radio networks. Changing the protocols to best serve these models is important to keep performance as high as possible given the infrastructure that we are utilizing.

<3
Jon

Wednesday, January 9, 2013

Summary for "Lessons from Giant-Scale Services"

Summary:
In this paper the author talks about lessons learnt from giant-scale services. This includes: the model, high availibility and its different metrics, and updates and growth.

Important Ideas/Strength:
I believe the focus on availability metrics and their importance to giant-scale services was key. Not only explaining each metric in detail, the author emphasizes that uptime is not enough for these type of systems. Yield and harvest are more important. A figure such as 0.9999 uptime could be misleading because the 0.0001 downtime could be during the peak period.

Flaws/Advance Since Publication:
The author describes upgrades and changes to these services as controlled failures to the service. It could be that some of the cases weren't avoidable at the time. Nowadays, these failures could be avoided. With the advances in load balancing solutions and virtualization of servers/software these failures can be reduced if planned properly.

Tuesday, January 8, 2013

Week1

I'd like to explain an interesting phenomenon that we may see before in daily life and is related to topics in paper "Lessons from Giant-Scale Services".It mentions that in a large-scale application, read operations number is dominant to write operation, so companies may deploy database serving writing operations to different place from those for reading. Balance manager is used for picking the server for specific request when routing .

Sometimes when finish editing profiles on a site, we may see no changes happen immediately when viewing profile page. It may be confusing and leads to re-operations. And the reason is related to infrastructure design just mentioned. Suppose that the servers for profile-editing page( writing operation ) is on the west coast and for page-viewing is on the east side. When we finish editing and then want to view the results, we are still routed to east coast. But because the replication-delay, weird thing happens.

The solution to get around this by facebook is to set the time of write operation to cookies in browsers. When load balancer finds these time infos in cookie, it will compare these writing time with current time. If the slot is big enough, so replication is definitely finished and server on east coast will be used as usual. If not, we will be routed to west coast.

Week 1

In Lessons from Giant-Scale services, Brewer addressed several important methods for evaluation in the aspect of availability for distributed system in giant scale from a generalized perspective. First, harvest and yield, instead of traditional uptime, are introduced as metrics in measurement. Second, DQ Principle analysis is used in terms of fault impact and evaluation of system design (replication vs. partitioning).

The paper Experiences with CoralCDN, with operational practices of CoralCDN, Freedman represented how this decentralized network system (proxy, DNS, indexing) can reduce load on the web server This distributed network system typically serves terabytes of data and tens of millions of HTTP requests per day. in addition to high availability, scalability, fault-tolerances, etc., resource management is another big challenge in distributed systems, with limitations in bandwidth, performance, storage, etc., minimized client latency are always desired. Also, security and resource protection cannot be compromised.

Donghan

In the paper on Coral CDN, Freeman shares the design, success and challenges of the CoralCDN System partially in consideration for building and/or managing similar CDN systems in future.

He mentions that the system was not suited for accessing unpopular content and that an eviction policy was used in evicting content from proxies. I would have liked him to share more on this policy – to explain what benchmark is used to determine that content is unpopular and how long is such content cached before eviction.

The paper on Giant scale devices basically shares the ideas explained in chapter 1 of the course textbook. That is, scalability, availability and fault tolerance being required for an efficient distributed system.

Osezua.

Week 1

Real world distributed systems are complicated pieces of machinery that focus on making many moving parts work together.

The papers were both about high performance systems and their design. The Brewer paper focuses on the more practical implementation issues of these systems, while the Freedman paper focuses on a dynamic CDN implementation.

I would have liked for the Brewer paper to go into more detail for each other the subjects that they talked about. It is very rare to see how these systems are designed, and the more detail the better. The Freedman paper, in my opinion, was a little dry compared to the Brewer paper, and could have been shorter. I did like how they were extremely thorough, it just seemed a little too academic in its tone to me (which helped make the paper dry).

For this post, I want to focus on the way that both papers talk about network saturation. In the Freedman paper, they focus on resource saturation as a tradeoff in terms of D (data) and Q (capacity or queries per a second). They phrase performance degradation as a tradeoff between these two. In the Brewer paper, he looks at performance from a more simplistic view--cutting off bandwidth after 10g. This shows that the spectrum for dealing with performance degradation can range from complicated schemes to simple ones, and still be effective.

<3
Jon

Week 1

Hey everybody,

I decided to write about some trends in these first two papers that seem to be generally important concepts in distributed systems.

These were laid out plainly in the E. Brewer, Lessons from Giant-Scale Services while the M. Freedman, Experiences with CoralCDN: A Five-Year Operational View was more of a case study highlighting these similar aspects. A primary concern of these systems is reliability which takes form in fault tolerance. There is much discussion of how to deal with failing nodes and even disasters in these systems. These same issues are a focus of the CoralCDN reflection as they must deal with websites in flash-crowd scenarios that may not necessarily be able to respond to requests. Another obvious focus is the minimizing of client latencies or simply making things faster, a primary focus of distributed computing. As well, noted in both papers was minimizing internal bandwidth usage and utilizing resources as effectively as possible. This was discussed in the DQ principle as bandwidth correlates more than I/O seeks with overall performance capacity.