Ex-Googlers Building Cloud Software That’s Almost Impossible to Take Down

slashdev · on July 22, 2014

So what's special about this database? What properties does it have that are superior to the currently available alternatives? This article is all hype and no substance.

EDIT: GitHub has more information. It's a scalable ordered key-value store (think a distributed version of Berkeley db.) Storage is based on RocksDB (a variant of LevelDB) and consensus is achieved using Raft. The database is written in Go. It's meant to "tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention."

What's not clear at all is where it sits CAP wise. It says it's available and strongly consistent. Which would be CA, which is not an option (especially for something claiming to be failure tolerant.) It must either sacrifice write availability or consistency in the event of a partition. No idea which way it leans.

0xbadcafebee · on July 22, 2014

> Cockroach provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes--both from an historical snapshot timestamp and from the current wall clock time. SI provides lock-free reads and writes but still allows write skew. SSI eliminates write skew, but introduces a performance hit in the case of a contentious system. SSI is the default isolation; clients must consciously decide to trade correctness for performance. Cockroach implements a limited form of linearalizability, providing ordering for any observer or chain of observers.

Either you use a strongly consistent mode that can have poor performance under contention, or a strongly consistent mode that will have good performance under contention but lots of failed transactions. So you get to decide on performance vs availability.

To answer your question, it sacrifices availability, not consistency. It's an MVCC, after all.

Goranek · on July 22, 2014

I think they're trying to create open source Spanner. Here is Google scientific paper about Spanner. http://static.googleusercontent.com/media/research.google.co...

boomzilla · on July 22, 2014

In my opinion, one of the most profound ideas in Spanner is the introduction of TrueTime API and the guarantee provided in the implementation. I wonder if this project is going to have something similar?

yid · on July 22, 2014

From the description, it sounds like they're not at the moment. Instead, they seem to be aiming for the globally replicated, consistent, SQL-supporting features. Nothing wrong with that -- the world could use more geographically-aware database implementations. Seems like they'd be able to make use of the time sync for more efficient replication/failover/transactions in the future when the hardware is more widely available.

jbooth · on July 22, 2014

Why are they using RocksDB rather than LMDB?

If it's based on Raft, then it sacrifices availability if there aren't a quorum of nodes online.

0xbadcafebee · on July 22, 2014

Probably because 1. LMDB is limited to logical address space, 2. it has one big global lock, 3. It's a B-Tree, and both of those contribute to the fact that 4. LMDB is a read-oriented database [performance wise]. I would conjecture that Rocks could also be 'more easily embeddable', but i'm talking out my ass there :)

And yeah, you kind of have to sacrifice availability if you want to stay consistent in the face of write skew...

jbooth · on July 22, 2014

One big global write lock in LMDB maps pretty well to one single stream of replicated log entries in Raft, IMO.

And logical address space is still far in excess of what most disks or arrays can fit, right? 40 bits or so on linux?

EDIT: 47 bits, for 128TB -- http://stackoverflow.com/questions/2159456/whats-the-max-fil...

0xbadcafebee · on July 22, 2014

I don't see how? One global write lock means a single instance can't update multiple ranges at a time, so determining consensus and writing from multiple peers would just take a long time for no reason. The whole point of an SSI MVCC is to get around difficult locks....

If Moore's law holds, a single SSD will outgrow the address space in around 7 years. In four years, an array of eight disks would outgrow the address space. This is just for a single server. If you want a linearly-scaling, robust solution for future requirements (like multi-petabyte and exabyte distributed datastores), there's no reason to lock yourself into technology that'll be obsolete in half a decade.

(edit: SanDisk says it may release 8TB SSDs next year, also adding "We see reaching the 4TB mark as really just the beginning and expect to continue doubling the capacity every year or two, far outpacing the growth for traditional HDDs")

t1m · on July 22, 2014

IIRC, current x86-64 chips are limited to 48bits virtual address to simplify the address translation logic (cheaper to manufacture).

This makes sense for the current generation of storage sub-systems, though it would be misleading to say using memory map technology will be "obsolete in half a decade". The 48 bit limit is arbitrary. Manufacturers have 56 bit designs on the table right now, and there is nothing stopping them from implementing full 64 bit virtual address support.

jbooth · on July 22, 2014

I'm not on the inside of cockroachDB's raft implementation, but typically you've got a single thread processing AppendEntries requests in a defined order, exactly one at a time, to guarantee the same order of execution on every node. There might be some small savings from doing a couple of updates concurrently here and there but your overall flow should be single threaded.

As far as the address space and big SSDs thing.. I'd be willing to gamble on linux supporting mmap up to the biggest devices on the market, one way or another. Heck, there's only 16 more bits after that 47 before every FS under VFS has to be rewritten, right?

kjksf · on July 22, 2014

I benchmarked lmdb vs. leveldb once and on a write-heavy workload leveldb destroys lmdb (think 10x better perf).

The author of LMDB makes pretty bold performance claims and people are too eager to believe them.

You shouldn't propagate those claims unless you've done benchmarking to verify them.

t1m · on July 22, 2014

I would be interested in seeing your benchmarks.

The author of LMDB doesn't really make bold claims, he actually just included LMDB (and the venerable Berkeley DB) in LevelDB's published benchmarks. The benchmarks were developed by the LevelDB team.

http://symas.com/mdb/microbench/

http://symas.com/mdb/inmem/

Goranek · on July 22, 2014

Just noticed that it's written in Go. Lately more and more databases are written in Go.

mathattack · on July 22, 2014

I see a headline like that thinking someone will take it as a dare. Of course the source is prone to overhype, so it's worth a grain (or twenty) of salt.

jahewson · on July 22, 2014

You got your idiom backwards: more salt = more meat and substance in the story, less salt = less substance.

jcbrand · on July 22, 2014

That idiom doesn't work the way you think it works :) https://en.wikipedia.org/wiki/Grain_of_salt

mathattack · on July 23, 2014

I like the reference. :-) And I've been misusing the idiom for at least 10 years without correction!

jahewson · on July 24, 2014

My dictionary lied to me! Wikipedia to the rescue again, very interesting.

jflowers45 · on July 22, 2014

I like the name: "CockroachDB" - and find it interesting that a bunch of the guys are working on it while they also work at Square - and also that it's supposedly based off a Google research paper for "Spanner" which I hadn't previously read about. Lot of good nuggets here.

grogenaut · on July 22, 2014

Why, when I see a wired article title, do I immediately assume that it is going to be completely untrue?

curiousDog · on July 22, 2014

Spanner truly is remarkable. I've always wondered why google hasn't opened it up!

toast0 · on July 22, 2014

The article mentions that Spanner has dependencies on several other Google projects. Assuming Google wants to open it up (more than publishing a paper), they would need to stub out all the dependencies first, which is a major effort.

dekhn · on July 22, 2014

It would make more sense to host it as a cloud API, or perhaps use it as a backend for Datastore: https://developers.google.com/datastore/ however according to this page: http://googledevelopers.blogspot.com/2013/05/get-started-wit...

it seems like it uses Megastore http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf

outside1234 · on July 22, 2014

Google does open source when it benefits them. Giving everyone else Spanner doesn't benefit them.

cle · on July 22, 2014

It wouldn't benefit anyone. Spanner is most likely tightly coupled with their internal services and infrastructure.

If you want to know how spanner works, read the paper. If you want to use it yourself, you'll need to build it on top of your own infrastructure, just like Google did.

nostrademons · on July 23, 2014

Or more when it doesn't harm them. Google's open-sourced several projects (Protocol Buffers, Closure, Gumbo) that didn't directly benefit Google, but also don't harm them or give away highly-valuable IP either.

_ofdw · on July 22, 2014

Personally I don't think takedowns are the biggest threat facing cloud users. By far the larger threat is having your cloud data harvested and used against you by adversaries such as advertising firms and three-letter agencies.

xionon · on July 22, 2014

That's not what this article is about at all. The technology described in this article is a distributed database.

Github project: https://github.com/cockroachdb/cockroach/

lern_too_spel · on July 22, 2014

Neither of those are realistic problems. (How and why would an advertising firm use your data against you, anyway?)

Servers go down every day, and data centers go down every month. This project solves a real problem.

_ofdw · on July 22, 2014

Targeting me with ads is absolutely an adversarial situation. I and most of the population would rather not see ads at all. I do not want advertising agencies building a profile of me so they can sell me stuff. Do you?

Maybe you'll come back with a completely-bogus, typical hackernews-ish response, deluding yourself, saying that yes you like it when advertisers target you based on the data they collect about you because of reasons X Y and Z, something something "better for me". Ads targeted at you do not benefit you in any way unless you work for an advertising firm.

And are you seriously arguing that NSA et al do not peek into cloud storage? Have you been living under a rock?

cwyers · on July 22, 2014

Most of the population seems to have determined that, given the tradeoff between "seeing ads" and "paying more for content," they are more willing to do the former than the later.

icebraining · on July 22, 2014

Given current transaction costs, yes, but we don't really know what would happen if these were lower (both in money and hassle).

In any case, it's not every day people are given the choice. There's probably a handful of sites you can pay to remove ads, and the cost is usually much higher than the value the user would have provided in ad revenue.

tormeh · on July 22, 2014

I think that's because it's perfect price differentiation. The alternative is that advertisement is massively overvalued.

eropple · on July 22, 2014

> Ads targeted at you do not benefit you in any way unless you work for an advertising firm.

They mean I don't have to pay every website that generates content in order to have them...you know...exist.

_ofdw · on July 22, 2014

You sure about that? You can't think of a single site that produces content without ads? I can.

eropple · on July 22, 2014

I certainly cannot think of a single site that produces professional content without income. I did not say without ads. Ads do mean, however, that I don't have to pay sites directly and the value I get is well in excess of the value I pay by providing ad impressions.

Economics is not a hard subject. Stop being a reductivist. Or stop being a sneering jerk (and let's not pretend you aren't trying to pick a fight with your tone, 'kay?). Both not-reductivist and not-sneering-jerk would be nice, though.

icebraining · on July 22, 2014

the value I get is well in excess of the value I pay by providing ad impressions.

You don't pay value. You pay a price, which provides value to the recipient. And the value received by the site being low does not mean the price is, since they're both subjective. Which is the problem with ads: for many, the price paid - the loss in privacy - is much greater than the pennies received by the advertiser. I'm glad you value your loss so low, but you shouldn't assume everyone does.

eropple · on July 22, 2014

Alternate character interpretation: you wildly overvalue your individual privacy. You, and I, are just not that important.

icebraining · on July 22, 2014

Oh, but this is where you're wrong; you seem to think one only values privacy if they're scared of black helicopters, but that's not the case. I value my privacy because I dislike the everyday intrusions on people's lives. As Raoul Vaneigem wrote, "The economy of everyday life is based on a continuous exchange of humiliations and aggressive attitudes", and tracking is nothing more than an automated and therefore efficient version of this.

Besides, privacy is like vaccination - it also needs herd immunity. If everyone is exposed, the few "important" people who really need it will stand out like a sore thumb.

eropple · on July 23, 2014

It has nothing to do with black helicopters. It has to do with your data being utterly unimportant to anybody except yourself. This is why it's only valuable in the aggregate.

Nobody cares.

icebraining · on July 23, 2014

They don't have to care. Target didn't care about some random teenager being pregnant, but her father still found out due to their targeted ads. That they don't care is irrelevant.