One thing I notice is the over-usage of sharding, especially hash-based, might turn your Relational Database into just another key-value store, with consistency constraints moving into application code, and you lose many advantages of a traditional RDBMS
optimistically, you can try to shard by read use case, but that's never gonna be stable over time
if you need a true multi-tenant system you can only shard by individual entity and move all of the composition logic to the next layer up, there's no way to cheat
You’re exactly right. I work for a large cloud database service and the vast majority of our top customers shard by customer. This also gives you the benefit of using higher levels of sharping abstractions that map to performance SKUs for more demanding customers, allowing much more efficient allocation of COGs.
> the only real answer here is to shard by customer
No.
Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.
> if you need a true multi-tenant system you can only shard by individual entity and move all of the composition logic to the next layer up, there's no way to cheat
This is incorrect. Sharding and multi-tenancy are orthogonal concepts.
> No. Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.
what??
if you shard users by user ID and orders by order ID, then a query that joins a bunch of user orders in the same tenant namespace will spread across multiple user shards and multiple order shards
value distribution doesn't really have any impact here
(shard keys are also guaranteed to exist by definition, not clear what you mean by that)
if you don't care that specific queries cross sharding boundaries, okay, then no problem, but in that case sharding is not solving the problem that we are talking about here
> Sharding and multi-tenancy are orthogonal concepts.
sharding and multi-tenancy are only orthogonal if you don't care that a single tenant can have information on multiple shards
>> No. Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.
> what??
> if you shard users by user ID and orders by order ID, then a query that joins a bunch of user orders in the same tenant namespace will spread across multiple user shards and multiple order shards
Note my recommendation of picking a "stable, guaranteed-to-exist, shard key."
If there is a users table/collection sharded by its id and an orders table/collection sharded by its id, then there is no "guaranteed-to-exist shard key" between them, right? So, in that case, where the two are often accessed together, having a "guaranteed-to-exist shard key" of the "tenant namespace" would be the logical choice.
> value distribution doesn't really have any impact here
I mentioned value distribution strictly in the context of ordering composite shard keys (if applicable). My apologies for any confusion this might have introduced.
> (shard keys are also guaranteed to exist by definition, not clear what you mean by that)
My implication was in reference to what is always available across accessing sharded entities. In the scenario you describe, sharding by either user or order id would not be ideal. In situations where one or the other is not sharded, then identifying a common shard key likely is not needed.
>> Sharding and multi-tenancy are orthogonal concepts.
> sharding and multi-tenancy are only orthogonal if you don't care that a single tenant can have information on multiple shards
My assertion was regarding theory, not specific scenarios. In practice, having multiple tables/collections needing shards with the tenant being the common entity strongly implies sharding on a tenant property and not of those unique to each table/collection, as implied in your reply.
>Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that.
This is a pretty risky approach since it's almost certainly the case that you won't end up evenly distributing your data across shards using this method.
> This is a pretty risky approach since it's almost certainly the case that you won't end up evenly distributing your data across shards using this method.
Distributing data across shards is a function of the properties selected to use for partitioning. So I do not understand how "a stable, guaranteed-to-exist, shard key (composite or atomic properties)" is "a pretty risky approach."
While high volume multi-tenant "customers/users/accounts" systems are common, they are not the only ones which benefit from sharded persistent stores.
For example, consider a system which monitors farm equipment for Caterpillar and John Deere. Lets say each company has 100k devices which send one message per day to the system.
While it is easy to envision sharding device messages based on "DeviceId / Company" in this hypothetical system, there would be no value sharding the two customers.
you're right, uneven shards are an inevitable outcome of this approach
but shard "even-ness" is in direct tension with the concern of the GP, which is execution atomicity
frequently, it's better to have uneven shards (that you can e.g. scale independently when necessary) that give you atomic execution, than even shards that require distributed transactions
Pass a certain point, you ought to think about whether to keep using a RDBMS as a K-V store or switch to a real distributed K-V store like Cassandra, ScyllaDB, DynamoDB and the like
About hot spots, it has always been an issue with K-V stores, and the only real solution is a good key design, though there are some tricks:
* Use a uniformly distributed but deterministic key prefix. For example, instead of using raw user_id as key, attach a small hash before it: (hash(<user_id>), <user_id>) This can help with load distribution if your <user_id> is not uniformly distributed by itself such as a phone or id number.
* Add more data to your key to increase cardinality. For example, with time series data, instead of using object_id as partition key, use (user_id, time_bucket) so the data for a busy object will get split into different partition over time.
Sorry if I was unclear. Hoping to hear what strategies were practical within a given DB. A whole different platform is usually a much larger undertaking.
Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.