Sunday, February 24, 2013

Spanner


Spanner: Google's Globally-Distributed Database

Spanner is a scalable, multi-version, globally-distributed, synchronously-replicated database, and the paper discusses parts of its design in varying levels of depth.

There is a lot in this paper, so I'll just try to hit the highs and lows, starting with the good parts. The TrueTime API solves a big problem when trying to order events in a distributed system, and as they mention several times, without it the central design of Spanner would be impossible. TrueTime is a sub-system (?) used by Spanner that provides a global time for the current moment along with a confidence interval based on the worst-case clock skew. The nodes providing these services rely on either system level atomic clocks, which have a known skew rate and must be synchronized periodically, or GPS clocks, having a greater accuracy but are susceptible communication failures. Given the interval nature this system provides, there are times when Spanner will have to wait when resolving two conflicting requests, or when a transaction has been committed, but this tends to be a short interval on the order of ~10ms, and two transactions are very unlikely to touch the same rows in the database, just given the scale of the system itself. TrueTime is a cool system, and one I hope they open source.

Spanner itself provides a semi-relational model to the client. This is really cool, typically distributed databases (in my limited experiences reading about them) provide only key-value store, and pushed the burden on the application to do any more advanced lookups. They use a trick involving the primary keys of a row to map to the other values, and use interleaving of table data in directories to allow certain features of a relational database while still remaining distributed.

I wished this paper had listed more metrics about its performance, and compare an application that uses spanner versus Bigtable or Megastore, both of which were mentioned a few times in the paper. Since this system was primary designed to be used with Google's F1 advertising backend, it would have been nice to see some quantities on throughput in the new system versus the old (they have the new numbers), as well as application level complexities that might have been resolved and system level upkeep. Seeing numbers on latencies on the user-end of an application when leaders are failing or a data-center get knocked offline would have also been interesting. I don't doubt that Google has the numbers on these scenarios, I just wonder why they included the numbers they did.  


16 comments:

  1. I thought the argument for the TrueTime API was pretty convincing. The more certain you are about how uncertain you are about your data, the more likely you are to react appropriately..

    ReplyDelete
    Replies
    1. I agree. Timestamp is key for this. Making sure that you have a globally consistent timestamp allows you to have a database that is serializable. I think the real insight here is that its pretty much impossible to have a true timestamp for when an action happens. Allowing for a more flexible representation of time is a way to ensure accuracy and increase your awareness of the system at a global level.

      Honestly nothing more to say about this paper other than it must have been an awesome process to build and test. Seriously awesome work.

      Delete
    2. Table 2 shows that only read-write transaction requires locks for the sake of concurrency. I agree that the most valuable part of this paper is the reducing the possibility of blocking by leveraging the timestamps. I would be interested to see the performance of Spanner when it is really deployed in the backend of Google’s advertising.

      Delete
  2. It's interesting to see how Google combines various technologies (like Colossus, Paxos, etc) to gain performance or usability benefits for each platform they create. They created BigTable to move away from a relational model, but have now moved back to that with Spanner. Apparently a replacement for Colossus is also in the works, that aims to attain much better latencies and fault-tolerance, so it will be cool to see how the next iteration of these products works.

    Regarding TrueTime, while it would be kind of cool to see how it is implemented, it requires a good bit of specialized hardware, so I'm not sure how much effect open-sourcing it would have.

    I agree that more evaluation results would have been nice to see. The paper talks about latency and throughput experiments that were carried out, but does not present this data.

    ReplyDelete
    Replies
    1. Very much agree on the more evaluations, in particular the call for comparisons. These latencies and throughput measurements *seem* good, but what do I know? Especially since this system is billed as a replacement for an old method, comparisons seem critical.

      Delete
    2. Especially since we never really know from limited testing how well it will hold up in practice, the numbers can't really be evaluated well until a couple of months into actual release, in my opinion.

      Delete
  3. A directory is said to be the smallest granularity whose geographic replication properties can be application-specified. A point that I did not quite understand was about the sharding of extremely large directories into fragments, that can possibly be served from different Paxos groups. Does this not imply that, in essence, fragments should be the smallest granularity that should be application-configurable? Or is it that even though directories are split into fragments (and possibly served from different groups), the configuration properties are applied at the fragment level?

    ReplyDelete
    Replies
    1. I think they're just saying that directories are the smallest unit whose replication properties can be configured but the way that that replication is managed by Spanner is using "fragments" which are not exposed to the end user.

      Delete
  4. This paper initiates a new era in distributed databases in my opinion. That being said, the usage of atomic clocks and GPS receivers really push this work forward. In addition, having lock-free reads is one of the keys to reduce latency. Also, the scalability of this system is really marvelous.

    One thing I was looking for in the evaluation section is that how Spooner perform against BigTable and likes in a small scale. Also, although it seems very implementable and easy to comprehend, I cannot find a very good reason why this kind of work came out this late.

    ReplyDelete
  5. The most surprising aspect of this paper, to me, was how bad things were in F1 before Google started looking into Spanner. They mention that manually re-sharding data took two years to fully complete the last time it was performed, coupled with other complexities. Google always seems to be a company on top of the latest technologies, so I was very surprised to read how bad things had gotten.

    On the other hand, migration to a completely new backend technology is never easy, especially at the scale of the largest technological companies in the world. In this way, the Spanner system serves almost as a warning, and a lesson, to theorize solutions to big data well in advance of the necessary timeframe.

    All-in-all, Spanner is an impressive feat of technology and theory. Great read.

    ReplyDelete
    Replies
    1. Just goes to show that even the best of companies put off large migrations until it is absolutely necessary. Although, while the paper was published recently, they've been working on this for quite a while - around since '07-'08 ish?

      Delete
  6. I believe what stands out in this paper is the TrueTime API. By making Distributed Systems with stronger time semantics (almost near perfect), this is just what the Systems community were looking for (This was just like making something which was only probable before, near certain now).

    I do not see the war between the Spanner (NewSQL) and BigTable (NoSQL) war ending soon especially as most NoSQL-ites do not believe that NewSQL would work for all large Databases though they seem to agree that it works for Google. I believe though that Google puts up a strong point strengthened by the fact that even Facebook and other OSNs are looking for a Spanner-like System.

    I also was surprised like Besim to see this work coming up so late, and agree with Josiah that things seemed really bad before Spanner. This works for Google especially because the nature of the queries are by nature simple.

    This needs to be seen how it scales up to more complex queries. Data coming from CDNs based on OSNs and Recommendation-Systems would generate more complex queries. Also, I see even search getting more complex in near future thanks to "dreams" like search based on data (mp3, jpegs and more) than mere text.

    ReplyDelete
  7. The biggest innovation for Google spanner is its consistent global timestamp, TrueTime API. The philosophy is, the uncertainty is quantified and can be made small (with atomic clocks and GPS).
    Google is migrating their database to Spanner, it will be interesting if Spanner is commercialized. The need for database system with high scalability and consistency has never been satisfied, NoSQL cannot fully replace traditional RDBMS. Google Spanner is an incredible innovation that provides unbelievable scalability and transaction support.

    ReplyDelete
  8. Google has been doing awesome works and I guess most of computer guys are agree with it. Google spanner is cool as well as their other works. A huge defect of CAP theory in NoSQL has been pointed out; none of its three features, capability, availability and partition tolerance, cannot accomplished simultaneously. However, Google just provides groundwork to overcome it in this paper as introducing TrueTime which is the key work of their research. Since NoSQL was introduced most recently, in 2009, though the terminology was used 1998 first, I think it's not late but quite fast that Google introduced spanner last year.

    ReplyDelete
  9. It definitely seems like the TrueTime API makes the biggest impression, and is incredibly effective. The strengthening of time semantics is something that the DS community can benefit from, but Google's TrueTime does seem to be almost too good, and so I'm a little skeptical that in practice it is as clean as they make it appear.

    ReplyDelete
  10. Took me time to read all the comments, but I really enjoyed the article. It proved to be Very helpful to me and I am sure to all the commenters here! It’s always nice when you can not only be informed, but also entertained! spanner set India

    ReplyDelete