So what's special about this database? What properties does it have that are superior to the currently available alternatives? This article is all hype and no substance.
EDIT: GitHub has more information. It's a scalable ordered key-value store (think a distributed version of Berkeley db.) Storage is based on RocksDB (a variant of LevelDB) and consensus is achieved using Raft. The database is written in Go. It's meant to "tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention."
What's not clear at all is where it sits CAP wise. It says it's available and strongly consistent. Which would be CA, which is not an option (especially for something claiming to be failure tolerant.) It must either sacrifice write availability or consistency in the event of a partition. No idea which way it leans.
> Cockroach provides snapshot isolation (SI) and serializable snapshot isolation (SSI) semantics, allowing externally consistent, lock-free reads and writes--both from an historical snapshot timestamp and from the current wall clock time. SI provides lock-free reads and writes but still allows write skew. SSI eliminates write skew, but introduces a performance hit in the case of a contentious system. SSI is the default isolation; clients must consciously decide to trade correctness for performance. Cockroach implements a limited form of linearalizability, providing ordering for any observer or chain of observers.
Either you use a strongly consistent mode that can have poor performance under contention, or a strongly consistent mode that will have good performance under contention but lots of failed transactions. So you get to decide on performance vs availability.
To answer your question, it sacrifices availability, not consistency. It's an MVCC, after all.
In my opinion, one of the most profound ideas in Spanner is the introduction of TrueTime API and the guarantee provided in the implementation. I wonder if this project is going to have something similar?
From the description, it sounds like they're not at the moment. Instead, they seem to be aiming for the globally replicated, consistent, SQL-supporting features. Nothing wrong with that -- the world could use more geographically-aware database implementations. Seems like they'd be able to make use of the time sync for more efficient replication/failover/transactions in the future when the hardware is more widely available.
Probably because 1. LMDB is limited to logical address space, 2. it has one big global lock, 3. It's a B-Tree, and both of those contribute to the fact that 4. LMDB is a read-oriented database [performance wise]. I would conjecture that Rocks could also be 'more easily embeddable', but i'm talking out my ass there :)
And yeah, you kind of have to sacrifice availability if you want to stay consistent in the face of write skew...
I don't see how? One global write lock means a single instance can't update multiple ranges at a time, so determining consensus and writing from multiple peers would just take a long time for no reason. The whole point of an SSI MVCC is to get around difficult locks....
If Moore's law holds, a single SSD will outgrow the address space in around 7 years. In four years, an array of eight disks would outgrow the address space. This is just for a single server. If you want a linearly-scaling, robust solution for future requirements (like multi-petabyte and exabyte distributed datastores), there's no reason to lock yourself into technology that'll be obsolete in half a decade.
(edit: SanDisk says it may release 8TB SSDs next year, also adding "We see reaching the 4TB mark as really just the beginning and expect to continue doubling the capacity every year or two, far outpacing the growth for traditional HDDs")
IIRC, current x86-64 chips are limited to 48bits virtual address to simplify the address translation logic (cheaper to manufacture).
This makes sense for the current generation of storage sub-systems, though it would be misleading to say using memory map technology will be "obsolete in half a decade". The 48 bit limit is arbitrary. Manufacturers have 56 bit designs on the table right now, and there is nothing stopping them from implementing full 64 bit virtual address support.
I'm not on the inside of cockroachDB's raft implementation, but typically you've got a single thread processing AppendEntries requests in a defined order, exactly one at a time, to guarantee the same order of execution on every node. There might be some small savings from doing a couple of updates concurrently here and there but your overall flow should be single threaded.
As far as the address space and big SSDs thing.. I'd be willing to gamble on linux supporting mmap up to the biggest devices on the market, one way or another. Heck, there's only 16 more bits after that 47 before every FS under VFS has to be rewritten, right?
The author of LMDB doesn't really make bold claims, he actually just included LMDB (and the venerable Berkeley DB) in LevelDB's published benchmarks. The benchmarks were developed by the LevelDB team.
I see a headline like that thinking someone will take it as a dare. Of course the source is prone to overhype, so it's worth a grain (or twenty) of salt.
I like the name: "CockroachDB" - and find it interesting that a bunch of the guys are working on it while they also work at Square - and also that it's supposedly based off a Google research paper for "Spanner" which I hadn't previously read about. Lot of good nuggets here.
The article mentions that Spanner has dependencies on several other Google projects. Assuming Google wants to open it up (more than publishing a paper), they would need to stub out all the dependencies first, which is a major effort.
It wouldn't benefit anyone. Spanner is most likely tightly coupled with their internal services and infrastructure.
If you want to know how spanner works, read the paper. If you want to use it yourself, you'll need to build it on top of your own infrastructure, just like Google did.
Or more when it doesn't harm them. Google's open-sourced several projects (Protocol Buffers, Closure, Gumbo) that didn't directly benefit Google, but also don't harm them or give away highly-valuable IP either.
Personally I don't think takedowns are the biggest threat facing cloud users. By far the larger threat is having your cloud data harvested and used against you by adversaries such as advertising firms and three-letter agencies.
Targeting me with ads is absolutely an adversarial situation. I and most of the population would rather not see ads at all. I do not want advertising agencies building a profile of me so they can sell me stuff. Do you?
Maybe you'll come back with a completely-bogus, typical hackernews-ish response, deluding yourself, saying that yes you like it when advertisers target you based on the data they collect about you because of reasons X Y and Z, something something "better for me". Ads targeted at you do not benefit you in any way unless you work for an advertising firm.
And are you seriously arguing that NSA et al do not peek into cloud storage? Have you been living under a rock?
Most of the population seems to have determined that, given the tradeoff between "seeing ads" and "paying more for content," they are more willing to do the former than the later.
Given current transaction costs, yes, but we don't really know what would happen if these were lower (both in money and hassle).
In any case, it's not every day people are given the choice. There's probably a handful of sites you can pay to remove ads, and the cost is usually much higher than the value the user would have provided in ad revenue.
I certainly cannot think of a single site that produces professional content without income. I did not say without ads. Ads do mean, however, that I don't have to pay sites directly and the value I get is well in excess of the value I pay by providing ad impressions.
Economics is not a hard subject. Stop being a reductivist. Or stop being a sneering jerk (and let's not pretend you aren't trying to pick a fight with your tone, 'kay?). Both not-reductivist and not-sneering-jerk would be nice, though.
the value I get is well in excess of the value I pay by providing ad impressions.
You don't pay value. You pay a price, which provides value to the recipient. And the value received by the site being low does not mean the price is, since they're both subjective. Which is the problem with ads: for many, the price paid - the loss in privacy - is much greater than the pennies received by the advertiser. I'm glad you value your loss so low, but you shouldn't assume everyone does.
Oh, but this is where you're wrong; you seem to think one only values privacy if they're scared of black helicopters, but that's not the case. I value my privacy because I dislike the everyday intrusions on people's lives. As Raoul Vaneigem wrote, "The economy of everyday life is based on a continuous exchange of humiliations and aggressive attitudes", and tracking is nothing more than an automated and therefore efficient version of this.
Besides, privacy is like vaccination - it also needs herd immunity. If everyone is exposed, the few "important" people who really need it will stand out like a sore thumb.
It has nothing to do with black helicopters. It has to do with your data being utterly unimportant to anybody except yourself. This is why it's only valuable in the aggregate.
They don't have to care. Target didn't care about some random teenager being pregnant, but her father still found out due to their targeted ads. That they don't care is irrelevant.
EDIT: GitHub has more information. It's a scalable ordered key-value store (think a distributed version of Berkeley db.) Storage is based on RocksDB (a variant of LevelDB) and consensus is achieved using Raft. The database is written in Go. It's meant to "tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention."
What's not clear at all is where it sits CAP wise. It says it's available and strongly consistent. Which would be CA, which is not an option (especially for something claiming to be failure tolerant.) It must either sacrifice write availability or consistency in the event of a partition. No idea which way it leans.