EECS 345 2013 Distributed Systems: January 2013

Sunday, January 27, 2013

Designing a DHT for low latency and high throughput

Traditionally, several problems constrain design of DHTs. e.g. Link capacity of node, physical distance of nodes, network congestion/packet loss. This paper introduces a distributed hash table aimed at lowering latency and improving throughout. This paper addresses several important issues concerning the DHT's design.

Firstly, for latency reduction, iterative/recursive lookups are compared. Iterative lookup query to each successive node in the path, although it can detect node failures, it has to wait for response before next operation. In comparison, recursive lookup takes fewer queries and less network flow - it forward query to the next node. Secondly, selection of proximity neighbor is also important to reduce latency, ID-space range is defined to proximity detection. Lastly, latency is related with storage policies: erasure-coding and replication, latency is proportional to l/m where l is number of fragments and m is the number required to reconstruct the block.

As for high throughput, the biggest difficulty is that data is spread over a large set of servers. This paper introduces an alternative transport protocol STP (Striped Transport Protocol), it can transmit data directly to other nodes in a single instance, which has higher throughput and lower latency than TCP.

design DHT with low-latency and high-throughput

Designing a DHT for low latency and high throughput

Frank Dabek, Jingyang Li, Emil Sit, James Robertson, M. Frans Kaashoek, Robert Morris
MIT CS and AI Laboratory

As this paper points out, designing a wide-area DHT with high-throughput and low-latency is a challenge. The paper explores a range of solutions the existing systems employs, and results from it some new techniques, including use of latency predictions based on synthetic co-ordinates, efficient integration of lookup routing and data fetching, and a congestion control mechanism suitable for fetching data striped over large numbers of servers.

The paper finds that existing work has solutions to make the lookup of keys in DHTs scalable, low-latency, fault-tolerant, and secure, but hasn't considered much about the same with reading and storing data. And the paper makes the following contributions.

For low-latency, the paper first compares recursive and iterative lookup nodes algorithm. Unlike the iterative algorithm who waits for the response before proceeding, recursive algorithm enables each node in the lookup path directly forwards the query to the next node, and when the query reaches the key's predecessor, the predecessor sends its successor list directly back to the originator. But if a recursive lookup elicits no response, the originator has no information about what happened and how to re-try in a way that is more likely to succeed. Sometimes a simple re-try may work but in the extreme case only the originator knows a problem exists. So in practice, DHash++ uses recursive lookups by default and iterative lookups after persistent failures.

Then the paper discusses the trade-off between replication and coding. As replicated data allows for low-latency reads because there are many choices for server selection, coded data reduces bandwidth consumption at the expense of increased read latency, because the lower bound on the total time to find and fetch a block is the time to the most distant of the closest set of fragments required to reconstruct the block.

Lastly, the paper explains that the final lookup steps can take little advantage of proximity routing, because as approaching the target node, the number of choices that are closer to the node decreases, so the average latency increases. However, the lookup could stop early at any of those predecessors that has a successor list containing nodes sufficient to reconstruct the block. The trade-off is the expense of data fetch latency, as it decreases the number of successors (and thus fragments) that the originator can choose from.

In terms of high-throughput, the paper introduces an integrated transport protocol called STP, which borrows many ideas form TCP but still implements different retransmit timer and policy. Because lost packets have a large negative impact on DHash++ throughput as each block transfer is preceded by a multi-RPC lookup, under-predicting the latency of an RPC is costly to the advancement of the control window. STP uses Vivaldi latency predictions to help choose the retransmit time-out interval for each RPC. Besides it keeps a moving average of the mistake Vivaldi makes on each successful RPC's round-trip time, and add a 15 ms constant to avoid retransmissions to other virtual nodes on the same host. Unlike TCP, when STP recognizes a time-out it doesn't re-send the RPC right away, but instead gives DHash++ a chance to re-send the RPC to a different destination to avoid wasting time sending RPC's to nodes that have crashed or have overloaded access links.

Overall this paper has discussed a series of design decisions and options faced by DHTs. Which interests me the most is the improvement of performance doesn't happen at where to optimize the algorithm itself, but to do different trade-off's in various scenario with the help of simulations and measurements.

Summary of "SCRIBE: A large-scale and decentralized application-level multicast infrastructure"

1. Title: SCRIBE: A large-scale and decentralized application-level multicast infrastructure

Authors: Miguel Castro, Peter Druschel, Anne-Marie Kermarrec and Antony Rowstron

2. In this paper, the application-layer IP multicast tool Scribe, along with its supporting framework Pastry, are discussed in terms of their implementation, advantages and disadvantages, and performance.

3. Scribe and Pastry, sitting at the application level, avoid the limitations of network-level multicast faced in the past: namely, the lack of wide-scale development (and any motivation for such development), and the inability to track group membership. This system works in a very similar way to previously discussed distributed hashing schemes, including the focus of our project, Kademlia. One clever design idea the authors note is the flexibility of the reliability constraints that can be built on top of Scribe. Scribe is based on best effort delivery, much like its TCP base, however, the application can be adjusted to provide better reliability guarantees, if necessary.

4. The experiment they run seems infeasible without some sort of cap, or bottleneck elimination routine, which they apply later. Having thousands of children tables or children table entries would not be feasible on an end-user machine, and bringing down such a well-connected node would not be an option either. Thus, it seems that the bottleneck elimination is necessary in their design. The authors also mention the infeasibility of the software with small networks, but it's unlikely that this software would be adopted en masse, and smaller networks, (for instance, a private chat client), would be extremely inefficient or even unusable due to the stresses and overhead imposed by Scribe on small networks. Finally, although the paper lists a variety of other, related projects, the authors make no attempt to discuss the shortcomings of Scribe, or the potential advantages of their competitors.

5. A system of IP multicast is still relevant today. While certain applications have made attempts at solving this problem, there are still considerable reliability and performance concerns for application-layer multicast. The age of this paper (about a decade), coupled with ISPs, might stifle development of a Scribe-like system for multicast. However, similar issues of reliability and large-scale deployment, as presented in the paper, are very real and relevant to this day.

Friday, January 25, 2013

better DHT, Scribe

In the first paper Designing a DHT for low latency and high throughput, the authors give us an idea of how to build a Distributed Hash Table (DHT for short) with lower latency and higher throughput. Because of the fast growing need for distributed system to handle large amount of data, a better distributed hash table for data fetching and storing becomes more and more desperately needed. When considering dictionary-structure, couple aspects should be consider. Latency, which indicates the time it takes to process one request, and throughput, which indicates the ability to handle large amount of requests simultaneously, are definitely among the most important ones. In this paper, the author presents several ways to reduce the latency, including changing of data layout, switching from iterative look up methods to recursive ones, choosing nearby nodes as routing table entries, and integrating routing and fetching. Personally, I like the idea of switching from iterative to recursive the most, because it might eliminate half of latency by letting intermediate nodes forward the lookup immediately before the acknowledgement of previous hop. To me, it seems like a way of thinking that might help reduce latency of many other structures. Regarding higher throughput, this paper introduces Striped Transport Protocol (STP) to solve the problems of low throughput coming with traditional TCP connection by allowing nodes to put and get data directly to other nodes, instead of routing data through multiple overlay hops.

The second paper describe SCRIBE: A large-scale and decentralized application-level multicast infrastructure, along with Pastry, a p2p location and routing substrate upon which Scribe was built, which form a robust, self-organizing overlay network. Scribe let individual node create a group, individual node join group created by other nodes, and most importantly multicast messages to all members of the group with appropriate credentials. However, one property must also be noticed: Scribe specifies no particularly delivery order, by which the optimization of multicasting process can be achieved since there is no restriction on delivery order. In detail, applications in API of Scribe include create, join, leave, and multicast. With the help of Pastry, Scribe manage group creation, joining, and multicasting. One important property of Pastry and Scribe is that both of them are fully decentralized, which means all decisions are based on local information. This property guarantee the independency of individual. This property also fit in the system well because all connections are peer-tp-peer. It is also through this model that scalability and reliability are achieved.

Monday, January 21, 2013

Week 3

Implementing Remote Procedure Calls A. Birrell and D. Nelson

This paper describes the overall structure of the RPC mechanism.

In this paper the authors introduce remote procedure calls. RPC is seemed as an extension on procedures executed in another address space on a single computer. RPC transfers the control and data across a communication network. It can be provided in a high-level language, and the programmer does not need to worry about the details of remote interaction. RPC is widely used in many areas.

The many faces of publish/subscribe P. Eugster, P. Felber, R. Guerraoui and A-M. Kermarrec

In this paper the authors discuss about the publish/subscribe interaction scheme and its variants, as well as some alternative interaction schemes.

The authors propose the common denominators of publish/subscribe schemes, namely time, space and synchronization decoupling of subscribers and publishers. They survey several interaction scheme such as RPC and message queuing. The conclude that publish/subscribe scheme supports better for time, space and synchronization decoupling than any of those methods. Meanwhile, They compare two widely used schemes, topic-based and content-based. In this scheme, publishers and subscribers are loosely coupled. Moreover it provides better scalability.

Security issue might be a disadvantage of publish/subscribe scheme. It is possible that a client fools the middleware medium so that it could receive the information it is not authorized to receive.

Implementing Remote Procedure Calls

This paper deals with the implementation details of RPC on a test infrastructure at Xerox PARC. It incorporates the various design principles defined as a part of RPC's theoretical description by B. J. Nelson and gives relevant explanations of why the parameters are defined in the way they are and how they benefit distributed computing.

Primarily aimed at easing the development of distributed applications, RPC is based on simplicity, efficiency and generality. Simplicity is mainly observed at the programmer's end, for whom, executing a remote functionality is equivalent to making a local function call. The efficiency lies in the fact that simplicity of procedure calls guarantees rapid over-the-wire communication. Finally, the generality is exhibited in two aspects: firstly, the fact that RPC is essentially a function call, which is an omnipresent phenomenon in every computing environment and secondly, in the fact that the design is not tightly bound to procedure-call mechanism and can be implemented over an underlying message-passing system, if need be. Another very important thought maintained throughout the design of RPC is the closeness to local procedure calls. This deliberate insistence sets correct expectations for programmers implementing distributed applications both, at the caller and the callee ends.

In the view of maintaining simplicity of design and implementation, the part where the bulk data transfers constitute twice the number of packets (an acknowledgement each for every packet, but the last) seems an overhead in today's scheme of things. As the authors pointed out, typically only 10% of the bandwidth of the network was used at that time, easily accommodating the additional traffic generated in the scenario mentioned above. Keeping in mind that the bandwidth today is many times as much as it was then, the sheer amount of data exchange today will probably result in significant bandwidth being used for otherwise avoidable information.

Today, message passing and RPC form the main methods of communication. They can be analogously compared to UDP and TCP respectively. While RPCs ensure synchronous communication, and hence are inherently reliable; MPIs are asynchronous in nature and hence, non-reliable. On the other hand, concurrency is very easily possible in MPIs while it can be a tedious task to implement in case of RPCs, as the authors themselves mention, viz. in multicast and broadcast like scenarios.

Tuesday, January 15, 2013

Week 2

Rethink the design of the Internet

The author first explained the the main idea of end to end argument and the advantages brought by such network design paradigm, that is: moving functions "up" from core and "out" to the edge node gives us a less complex core network and facilitates construction of more innovative and reliable application software.

Then the paper lists the motivations for redesign the network: end users may not trust each other, more demanding applications, inclusion of third-parties, etc. For thse issues, technical and non-technical solutions are offered.Some of them do not interrupt the end-to-end arguments and some others add more control on the core network. For example, Firewall system, traffic and spam filter.Paper also evaluates the functions served by non-tech mechanism, like usage of law.

Week 2 Papers

Rethinking the design of the Internet: The end to end arguments vs. the brave new world by David D. Clark and Marjory S. Blumenthal

In this paper the authors discussed the possibility the end to end principle could be compromised.

In this paper which is more than 10 years old, the authors discussed on some scenarios that implementing an application-level function in the core of the network is necessary. Some of the arguments have been proved true now. For example, as the authors said, the Internet today is an untrustworthy world, manufacturers such as Cisco have selling devices such as Cisco Spam & Virus Blocker. CDN serve content to end-users with high availability and high performance. CDNs serve a large fraction of the Internet content today so that delivery of data is accelerated.

On the other hand, some other problems have already been solved. For example, today ISPs are becoming tunnels for data, and ISP service differentiation is almost unnoticeable.

This paper successfully foresaw a more sophisticated Internet in which application-level function is implemented in the core of the Internet.

Consensus Routing: The Internet as a Distributed System by J. John, E. Katz-Basset, A. Krishnamurthy, T. Anderson and A. Venkataramani

This 2008 paper proposed to improve the consistency of the network to reduce routing loops and blackholes.

In this paper the authors proposed running a distributed coordination algorithm to ensure that a route is adopted only after all dependent routers have agreed upon a globally consistent view of global state in order to reduce loops and blackholes. In the meantime, a small fraction of the packets are heuristically forwarded to find a route if a safe route is not found yet. This is called a transient mode the function.

One concern of this paper in my eyes is that if the network is not very reliable, the mechanism might have to update the consistency information very frequently, and the cost for sending and receiving these packets might be high.

This paper reminds me of OpenFlow, which is an open standard that enables researchers to run experimental protocols in the campus networks we use every day. We could leverage OpenFlow to try more optimized protocols.

Monday, January 14, 2013

week 2

Two papers in these weeks describe two important topics in network: discussion over the appropriateness of end-to-end argument after all these years of development of network, and consensus routing as a solution to the problem of lack of consistency.

On Monday, we read a paper End-to-End Arguments in System Design, which states that the placement of functions on internet should be placed end-to-end, or as "up" as possible for the completeness and correctness of the function, and it is amazing that it has guided the internet for more than 30 years. However, by the growth of internet, the functionality and environment have changed a lot and pure end-to-end arguments seem to have trouble in some cases. In paper Rethinking the Design of the Internet: End-to-end arguments vs. the brave new world, the authors argue that in the new world, instead of strictly following the end-to-end argument, we should carefully balance the pros and cons of placement of each individual function. Reasons that force us to rethink about the placement include, untruthfulness of less sophisticated internet users, complexity of functions, and interposition of third-parties like government. As a result, we need to rethink of our principles for function placement.

In second paper, Consensus Routing: The Internet as a Distributed System, the authors introduce consensus routing as a solution to the problem of lack of consistency, which is a result of our internet routing protocals' preference to responsiveness over consistency. Furthermore consensus routing also take the goal of liveliness into account by two different modes of packet delivery: stable mode, which ensures consistency because a route is adopted only after consensus among all routers, and transient mode, where heuristic forwarding is proceed after failed links. Transient forwarding schemes includes deflections, detour, and back-up routes. At the end of the paper, evaluation section shows us the effectiveness of this consensus routing method in maintaining connectivity, which turns out to be surprisingly good.

Sunday, January 13, 2013

Design Philosophy of DARPA Protocols Summary- Week 2

1. D. Clark, The design philosophy of the DARPA Internet Protocols, In Proc. of ACM SIGCOMM, August 1988

2. This paper talks about the historical developments and requirements of TCP/IP protocols.

3. The paper focuses on the historical goals of the internet. They are (in order of importance): 1. communication must continue despite loss of gateways, 2. support for multiple types of service, 3. the architecture must accommodante multiple types of networks, 4. distributed management, 5. cost effective, 6. host attachment, and 7. ressource tracking. The paper claims that the internet does a good job at first couple goals, but there is more work to be done for the later goals. I thought it was interesting to note that TCP and IP were designed as a single layer and were later split into two layers to be more flexible. I also found it interesting that the internet had much more of a military consideration that I had originally realized.

4. As this paper was not a scientific or technical writeup, it does not have any flaws that are specific to its methodology. I found the paper to be generally well written, and insightful to its stated purpose. I would have liked for the author to talk more about the internal arguments that the designers of TCP/IP had and how that shaped the original specifications. I also would have liked the author to focus more on the government's intervention in the development of the protocols and how that effected the design considerations. Finally, I would have liked the discussion to be more technical so that the reasoning behind some of the decisions would have been more clear.

5. The history of the development of the internet is very relavent to today's internet. Looking at how the early requirements of the internet changed over time and how that effected the design decisions is extremely relavent to today. In the past several years, the types of devices and the mediums that they are using to connect to the internet are completely changing. Traditionally, computers were large and connected to the internet through a wire. Increasingly today, computers are small, mobile, and connected to the internet though less reliable connections on radio networks. Changing the protocols to best serve these models is important to keep performance as high as possible given the infrastructure that we are utilizing.

<3
Jon

Wednesday, January 9, 2013

Summary for "Lessons from Giant-Scale Services"

Summary:
In this paper the author talks about lessons learnt from giant-scale services. This includes: the model, high availibility and its different metrics, and updates and growth.

Important Ideas/Strength:
I believe the focus on availability metrics and their importance to giant-scale services was key. Not only explaining each metric in detail, the author emphasizes that uptime is not enough for these type of systems. Yield and harvest are more important. A figure such as 0.9999 uptime could be misleading because the 0.0001 downtime could be during the peak period.

Flaws/Advance Since Publication:
The author describes upgrades and changes to these services as controlled failures to the service. It could be that some of the cases weren't avoidable at the time. Nowadays, these failures could be avoided. With the advances in load balancing solutions and virtualization of servers/software these failures can be reduced if planned properly.

Tuesday, January 8, 2013

Week1

I'd like to explain an interesting phenomenon that we may see before in daily life and is related to topics in paper "Lessons from Giant-Scale Services".It mentions that in a large-scale application, read operations number is dominant to write operation, so companies may deploy database serving writing operations to different place from those for reading. Balance manager is used for picking the server for specific request when routing .

Sometimes when finish editing profiles on a site, we may see no changes happen immediately when viewing profile page. It may be confusing and leads to re-operations. And the reason is related to infrastructure design just mentioned. Suppose that the servers for profile-editing page( writing operation ) is on the west coast and for page-viewing is on the east side. When we finish editing and then want to view the results, we are still routed to east coast. But because the replication-delay, weird thing happens.

The solution to get around this by facebook is to set the time of write operation to cookies in browsers. When load balancer finds these time infos in cookie, it will compare these writing time with current time. If the slot is big enough, so replication is definitely finished and server on east coast will be used as usual. If not, we will be routed to west coast.

Week 1

In Lessons from Giant-Scale services, Brewer addressed several important methods for evaluation in the aspect of availability for distributed system in giant scale from a generalized perspective. First, harvest and yield, instead of traditional uptime, are introduced as metrics in measurement. Second, DQ Principle analysis is used in terms of fault impact and evaluation of system design (replication vs. partitioning).

The paper Experiences with CoralCDN, with operational practices of CoralCDN, Freedman represented how this decentralized network system (proxy, DNS, indexing) can reduce load on the web server This distributed network system typically serves terabytes of data and tens of millions of HTTP requests per day. in addition to high availability, scalability, fault-tolerances, etc., resource management is another big challenge in distributed systems, with limitations in bandwidth, performance, storage, etc., minimized client latency are always desired. Also, security and resource protection cannot be compromised.

Donghan

In the paper on Coral CDN, Freeman shares the design, success and challenges of the CoralCDN System partially in consideration for building and/or managing similar CDN systems in future.

He mentions that the system was not suited for accessing unpopular content and that an eviction policy was used in evicting content from proxies. I would have liked him to share more on this policy – to explain what benchmark is used to determine that content is unpopular and how long is such content cached before eviction.

The paper on Giant scale devices basically shares the ideas explained in chapter 1 of the course textbook. That is, scalability, availability and fault tolerance being required for an efficient distributed system.

Osezua.

Week 1

Real world distributed systems are complicated pieces of machinery that focus on making many moving parts work together.

The papers were both about high performance systems and their design. The Brewer paper focuses on the more practical implementation issues of these systems, while the Freedman paper focuses on a dynamic CDN implementation.

I would have liked for the Brewer paper to go into more detail for each other the subjects that they talked about. It is very rare to see how these systems are designed, and the more detail the better. The Freedman paper, in my opinion, was a little dry compared to the Brewer paper, and could have been shorter. I did like how they were extremely thorough, it just seemed a little too academic in its tone to me (which helped make the paper dry).

For this post, I want to focus on the way that both papers talk about network saturation. In the Freedman paper, they focus on resource saturation as a tradeoff in terms of D (data) and Q (capacity or queries per a second). They phrase performance degradation as a tradeoff between these two. In the Brewer paper, he looks at performance from a more simplistic view--cutting off bandwidth after 10g. This shows that the spectrum for dealing with performance degradation can range from complicated schemes to simple ones, and still be effective.

<3
Jon

Week 1

Hey everybody,

I decided to write about some trends in these first two papers that seem to be generally important concepts in distributed systems.

These were laid out plainly in the E. Brewer, Lessons from Giant-Scale Services while the M. Freedman, Experiences with CoralCDN: A Five-Year Operational View was more of a case study highlighting these similar aspects. A primary concern of these systems is reliability which takes form in fault tolerance. There is much discussion of how to deal with failing nodes and even disasters in these systems. These same issues are a focus of the CoralCDN reflection as they must deal with websites in flash-crowd scenarios that may not necessarily be able to respond to requests. Another obvious focus is the minimizing of client latencies or simply making things faster, a primary focus of distributed computing. As well, noted in both papers was minimizing internal bandwidth usage and utilizing resources as effectively as possible. This was discussed in the DQ principle as bandwidth correlates more than I/O seeks with overall performance capacity.