Tuesday, March 12, 2013

Where Does the Internet Filtering Occur in China?

It is hardly a secret that China is filtering what their people can see on the internet.  This censorship is used to control propaganda, control sentiments, and to silence unfavorable views to the government. 

This paper seeks to discover where this filtering takes place. 

Some terms before diving into the paper:
AS - Autonomous System, a collection of IP routing prefixes that have a clear routing policy to the internet.  These are assigned a number, and each ISP has at least one.
Border AS - these AS connect to ASes from other countries. 
Internal AS - these AS connect to ASes within the country.
IDS -  Intrusion Detection System - these devices are what the paper is looking for, and effectively create the filter.

The paper is primarily separated into two parts, first getting a better understanding of China's AS Topology, then finding out which ASes the IDSes are attached to.  In getting a better sense of China's AS Topology, the paper describes lots of research into which ASes belong to China and all that good stuff.  Basically, they find that most ASes are internal (138) with 24 border ASes and 92 external ASes, with 133 unique peerings with external ASes.  Another interesting result is that treating each border AS as the root of a tree, the maximum depth is 2, where only 18 ASes are at level 2.  This shows that almost all ASes are either a border AS or connected directly to a border AS.  This is significant as it shows that the easiest way for the Great Firewall of China to menace over its citizens is by setting itself up at these border ASes.
The second part of the paper describes a method to find out which ASes the IDSes are attached to - they simply send HTTP GET requests with known keywords that trigger the firewall and procure the IP address from where interference is sent.  This is done to a variety of websites all over the country.  Here, they find that almost all of the IDSes belong to the border ASes, with only 2.9% belonging to internal ASes.  This shows that there is not very much domestic filtering, but mostly external filtering. 

Overall, the paper is pretty informational - however, it constantly laments the difficulty in getting complete data.  The paper also expects an understanding of networking terms and concepts (RST packets, AS, BGP, etc) which can be explained pretty quickly, greatly expanding the target audience of the paper as the topic is quite controversial and brought up often.  Results are pretty consistent with their hypotheses.  While the paper does not seek methods of bypassing the filter, it does mention that it is the "first study dedicated to explore both AS and router-level structures of China's censored network," and thus it seems it may be prudent to suggest how this information may contribute to the difficulties in bypassing the firewall.  In fact, the paper never mentions (as far as I know) what else can be done in the topic for the future.

Tor: The Second-Generation Onion Router

Tor: The Second-Generation Onion Router
 by Roger Dingledine, Nick Mathewson, and Paul Syverson

Tor, a circuit-based low-latency anonymous communication service, is the second-generation Onion Routing system addresses limitations in the original design by adding perfect forward secrecy, congestion control, directory servers, integrity checking, configurable exit policies, and a practical design for location-hidden services via rendezvous points.

To understand this paper, we need to learn onion routing first. Following is an explanation of onion routing from Wiki:
A routing onion is a data structure formed by wrapping a plain text message with successive layers of encryption, such that each layer can be 'unwrapped' (decrypted) like the layer of an onion by one intermediary in a succession of intermediaries, with the original plain text message only being viewable by at most, the sender, the last intermediary (the exit node) and the recipient.




Tor is distinguished from the original onion router by several features.
- perfect forward secrecy: Tor now uses an telescoping path-building design allowing the initiator negotiates session keys with each successive hop in the circuit.
- congestion control: Tor’s decentralized congestion control uses end-to-end acks to maintain anonymity while allowing nodes at the edges of the network to detect congestion or flooding and send less data until the congestion subsides.
- directory servers: Tor takes a simplified view toward distributing state information. i.e. certain more trusty nodes act as directory servers providing signed directories describing known routers and their current state.
- integrity checking: Tor verifies data integrity before it leaves the network to hamper attacks.
- configurable exit policies: Tor provides a consistent mechanism for each node to advertise a policy describing the hosts and ports to which it will connect.
- rendezvous points and hidden services: Tor provides an integrated mechanism for responder anonymity via location protected servers. Instead of using reply onions like previous designs, in Tor, clients negotiate rendezvous points to connect with hidden servers not using reply onions.

Authors presented clear goals and even non-goals and lots of different kinds of attacks and defenses against of those attacks. In addition, they also gave several goals and directions to achieve after their achievement. By doing so, they clarify the limitation of their paper and show the possibility to improve their work.

In conclusion, Tor is the predominant technology that employs onion routing now. However, threats from anonymity is getting rapidly higher and higher and distributed system is particularly vulnerable to these. Any of security issue should be keep improved to cope with this kind of threats.

Sunday, March 10, 2013

The Costs and Limits of Availability of Replicated Services


From Georgios T., who can't seem to post:

The Costs and Limits of Availability of Replicated Services. Haifeng Yu & Amin Vahdat

The motivation of this paper comes from the realization that utility of networked services is limited by availability and not performance. The different contemporary, in regard to this paper, approaches to improve availability used caching & replication which create inconsistency problems. The authors identify the two extremes for dealing with inconsistency as 'strong consistency' and 'optimistic consistency' set forth to explore the costs and limits of dynamically exchanging consistency for availability using a contiguous consistency model. In a contiguous consistency model applications specify the maximum distance from strong consistency, thus bounding optimistic consistency.
    In order to evaluate different implementations, a methodology for computing the upper bounds of service availability as a function of faultload, workload, consistency and replication level is described. The methodology is based on Q(offline) and Q(online) -- sets of questions that a protocol will have to answer and can be resolved optimally offline and online respectively. The answers to the questions form a dominant algorithm; Q(online) answers are specified by the inputs of this algorithm, thus less questions in the online set improves the chance for a tractable solution.
    Another contribution of this paper is a comparison, using a prototype based on TACT, of service availability that can be achieved with different consistency protocols(Voting, Primary, Golding), faultload and consistency levels. Through the experiments it is shown that small optimization in the existing protocols (i.e. aggressive write propagation) can bring them closer to the availability upper bounds. An interesting finding of the evaluation section is that maximizing availability through maintaining an as strong as possible consistency level creates a trade-off between availability and communication. Finally, the authors also explore the effect of the degree of replication and network reliability to availability and it is shown that more replicas are not always good, they can improve read reliability but degrade write availability due to consistency constrains.

Thursday, March 7, 2013

Practical Byzantine Fault Tolerance

This short post is written for the paper Practical Byzantine Fault Tolerance by Miguel Castro and Barbara Liskov from MIT. Before we read this paper, a very important terminology should be mentioned and understood. Byzantine fault: Previously, when we mention fault tolerance, we mean resilience to node failure, or non-responding. However, in real world, we also need to protect the system from malicious action, and mis-sent information or instruction. These problems are called Byzantine fault. Therefore byzantine fault tolerance means the mechanism to protect the system from these types of situations. The paper is named Practical because all papers related to byzantine fault tolerance system before are mostly devoted to the theoretical part of this topic, and they assumes a synchronous environment. While this paper tackles the topic in a practical point of view with an algorithm to solve this problem in asynchronous system settings.

Algorithm works as follows:
1. a client sends a request to invoke a service operation to the primary
2. the primary multi-casts the request to the backups
3. replicas execute the request and send a reply to the client
4. the client waits for one replies from different replicas with the same result; 
this is the result of the operation

The algorithm guarantees liveness and safety because all non-faulty replicas agree on a total order for the execution of requests despite failures.

Tuesday, March 5, 2013

Scatter

Scalable Consistency in Scatter
by Lisa Glendenning, Ivan Beschastnikh, Arvind Krishnamurthy and Thomas Anderson

This paper introduces a new distributed and consistent key/value storage system called Scatter.

Firstly, authors adapt the idea of decentralized peer to peer systems like Chord, Kademlia and such. However, these system lack consistency, so they aim to propose a new system that will be fault tolerant, consistent and in planetary scale. They structure the nodes into groups and each group covers a fraction of the circular key space. Every group is managed by Paxos protocol, which we studied earlier, and there are "merge", "split", "migrate" and "repartition" operations defined on groups and every group is connected with its predecessor and successor through individual nodes. Also, group size is small since Paxos' performance drops for large number of nodes. Two phase commit protocol is used in case of an multi-group operation; participating group leaders should agree (commit) on the operation in order for the global picture to be changed. How the storage handled is, concisely, that key space of each group is divided among the member nodes and those nodes are called primary for certain keys. Therefore, for each store operation, value is forwarded to primary node in the group and replicated among others using Paxos. Another novelty they propose is their join operations. They differentiate join operations depending on the need for group failure, throughput and latency. Hence, they add the node to the most appropriate group with respect to the need.

I personally was not able to identify any major flaws, but have couple of concerns. First of all, they want their system to include millions of nodes and their group formation does not depend on geographic placement of the nodes, which may create a case where each group member is scattered around the world. This will bring too much overhead for Paxos. Another thing is that they random k groups to be joined and with rather small k (=3), I dont understand how they can manage non-negligible improvements on resilience and latency.

All and all, solid ideas are presented in the paper and they can be implemented in case they address the security issues.

Sunday, February 24, 2013

Spanner


Spanner: Google's Globally-Distributed Database

Spanner is a scalable, multi-version, globally-distributed, synchronously-replicated database, and the paper discusses parts of its design in varying levels of depth.

There is a lot in this paper, so I'll just try to hit the highs and lows, starting with the good parts. The TrueTime API solves a big problem when trying to order events in a distributed system, and as they mention several times, without it the central design of Spanner would be impossible. TrueTime is a sub-system (?) used by Spanner that provides a global time for the current moment along with a confidence interval based on the worst-case clock skew. The nodes providing these services rely on either system level atomic clocks, which have a known skew rate and must be synchronized periodically, or GPS clocks, having a greater accuracy but are susceptible communication failures. Given the interval nature this system provides, there are times when Spanner will have to wait when resolving two conflicting requests, or when a transaction has been committed, but this tends to be a short interval on the order of ~10ms, and two transactions are very unlikely to touch the same rows in the database, just given the scale of the system itself. TrueTime is a cool system, and one I hope they open source.

Spanner itself provides a semi-relational model to the client. This is really cool, typically distributed databases (in my limited experiences reading about them) provide only key-value store, and pushed the burden on the application to do any more advanced lookups. They use a trick involving the primary keys of a row to map to the other values, and use interleaving of table data in directories to allow certain features of a relational database while still remaining distributed.

I wished this paper had listed more metrics about its performance, and compare an application that uses spanner versus Bigtable or Megastore, both of which were mentioned a few times in the paper. Since this system was primary designed to be used with Google's F1 advertising backend, it would have been nice to see some quantities on throughput in the new system versus the old (they have the new numbers), as well as application level complexities that might have been resolved and system level upkeep. Seeing numbers on latencies on the user-end of an application when leaders are failing or a data-center get knocked offline would have also been interesting. I don't doubt that Google has the numbers on these scenarios, I just wonder why they included the numbers they did.  


Sunday, February 17, 2013

Dynamo: Amazon’s Highly Available Key-value Store



Dynamo is a highly available key-value storage implementation for large-scale distributed systems. Dynamo relies on the following properties to be effective:

  • Primary key data storage for a simple read/write interface
  • Consistent hashing for replication
  • Object versioning for consistency

The use of primary key data storage cuts down on the overhead of a relational database. Dynamo also implements Service Level Agreements between services in order to maintain latency requirements. Latency requirements are made to satisfy the 99.9th percentile rather than relying on a mean or median requirement. Amazon also acknowledges a trade-off between availability and consistency. In order to improve availability, eventual consistency is maintained with optimistic replication. This brings up the issue of conflict management which is handled in data reads in a “last write wins” or system-specified policy. This reduces the number of writes that are rejected because of network and node failures. Dynamo is essentially a “zero-hop” DHT in which a node stores all necessary information to know how to contact nodes with which it needs to communicate. 

Dynamo exposes the functions get() and put() to the system which are executed using a quorum-like procedure, requiring a number of responses before propagating. Along with consistent hashing, Dynamo’s partitioning system assigns each node a set of virtual nodes which determine which set of data it is responsible for. These nodes then replicate the data they are responsible for on a number of successor nodes to ensure failure tolerance. Because of system failure, it is possible to have multiple versions of the same data in the system simultaneously. Dynamo uses vector clocks to determine versioning between these data replicas and perform conflict resolution. Dynamo detects inconsistent data through Merkle trees in which a tree traversal algorithm is implemented to determine if all keys are up-to-date. 

This contributes to Dynamo’s use as a scalable and highly available distributed data storage mechanism.