One thing I notice is the over-usage of sharding, especially hash-based, might t...

dehrmann · on April 7, 2023

Online use cases always(?) scale out to key-value stores. Offline use cases almost always scale to distributed, date-partitioned columnar stores.

paulryanrogers · on April 7, 2023

What's the alternative as things get too big? Sharding by date? By client? How do you prevent hotspots?

preseinger · on April 7, 2023

the only real answer here is to shard by customer

optimistically, you can try to shard by read use case, but that's never gonna be stable over time

if you need a true multi-tenant system you can only shard by individual entity and move all of the composition logic to the next layer up, there's no way to cheat

beoberha · on April 7, 2023

You’re exactly right. I work for a large cloud database service and the vast majority of our top customers shard by customer. This also gives you the benefit of using higher levels of sharping abstractions that map to performance SKUs for more demanding customers, allowing much more efficient allocation of COGs.

AdieuToLogic · on April 7, 2023

> the only real answer here is to shard by customer

No.

Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.

> if you need a true multi-tenant system you can only shard by individual entity and move all of the composition logic to the next layer up, there's no way to cheat

This is incorrect. Sharding and multi-tenancy are orthogonal concepts.

preseinger · on April 7, 2023

> No. Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.

what??

if you shard users by user ID and orders by order ID, then a query that joins a bunch of user orders in the same tenant namespace will spread across multiple user shards and multiple order shards

value distribution doesn't really have any impact here

(shard keys are also guaranteed to exist by definition, not clear what you mean by that)

if you don't care that specific queries cross sharding boundaries, okay, then no problem, but in that case sharding is not solving the problem that we are talking about here

> Sharding and multi-tenancy are orthogonal concepts.

sharding and multi-tenancy are only orthogonal if you don't care that a single tenant can have information on multiple shards

AdieuToLogic · on April 10, 2023

>> No. Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.

> what??

> if you shard users by user ID and orders by order ID, then a query that joins a bunch of user orders in the same tenant namespace will spread across multiple user shards and multiple order shards

Note my recommendation of picking a "stable, guaranteed-to-exist, shard key."

If there is a users table/collection sharded by its id and an orders table/collection sharded by its id, then there is no "guaranteed-to-exist shard key" between them, right? So, in that case, where the two are often accessed together, having a "guaranteed-to-exist shard key" of the "tenant namespace" would be the logical choice.

> value distribution doesn't really have any impact here

I mentioned value distribution strictly in the context of ordering composite shard keys (if applicable). My apologies for any confusion this might have introduced.

> (shard keys are also guaranteed to exist by definition, not clear what you mean by that)

My implication was in reference to what is always available across accessing sharded entities. In the scenario you describe, sharding by either user or order id would not be ideal. In situations where one or the other is not sharded, then identifying a common shard key likely is not needed.

>> Sharding and multi-tenancy are orthogonal concepts.

> sharding and multi-tenancy are only orthogonal if you don't care that a single tenant can have information on multiple shards

My assertion was regarding theory, not specific scenarios. In practice, having multiple tables/collections needing shards with the tenant being the common entity strongly implies sharding on a tenant property and not of those unique to each table/collection, as implied in your reply.

Kranar · on April 7, 2023

>Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that.

This is a pretty risky approach since it's almost certainly the case that you won't end up evenly distributing your data across shards using this method.

AdieuToLogic · on April 10, 2023

> This is a pretty risky approach since it's almost certainly the case that you won't end up evenly distributing your data across shards using this method.

Distributing data across shards is a function of the properties selected to use for partitioning. So I do not understand how "a stable, guaranteed-to-exist, shard key (composite or atomic properties)" is "a pretty risky approach."

Nathanba · on April 7, 2023

more recent customers/users/accounts probably do more actions than very old accounts though, how is that not also eventually creating uneven shards?

AdieuToLogic · on April 10, 2023

While high volume multi-tenant "customers/users/accounts" systems are common, they are not the only ones which benefit from sharded persistent stores.

For example, consider a system which monitors farm equipment for Caterpillar and John Deere. Lets say each company has 100k devices which send one message per day to the system.

While it is easy to envision sharding device messages based on "DeviceId / Company" in this hypothetical system, there would be no value sharding the two customers.

preseinger · on April 9, 2023

you're right, uneven shards are an inevitable outcome of this approach

but shard "even-ness" is in direct tension with the concern of the GP, which is execution atomicity

frequently, it's better to have uneven shards (that you can e.g. scale independently when necessary) that give you atomic execution, than even shards that require distributed transactions

dikei · on April 7, 2023

Pass a certain point, you ought to think about whether to keep using a RDBMS as a K-V store or switch to a real distributed K-V store like Cassandra, ScyllaDB, DynamoDB and the like

About hot spots, it has always been an issue with K-V stores, and the only real solution is a good key design, though there are some tricks:

* Use a uniformly distributed but deterministic key prefix. For example, instead of using raw user_id as key, attach a small hash before it: (hash(<user_id>), <user_id>) This can help with load distribution if your <user_id> is not uniformly distributed by itself such as a phone or id number.

* Add more data to your key to increase cardinality. For example, with time series data, instead of using object_id as partition key, use (user_id, time_bucket) so the data for a busy object will get split into different partition over time.

dalyons · on April 7, 2023

Or just change to user_uuid (or similar non int pk)

dikei · on April 7, 2023

Sure, unless your sharding/partitioning keys need to be derivable from the business keys.

esafak · on April 7, 2023

Distributed databases like Yugabyte, Cockroach, TiDB, etc.

paulryanrogers · on April 7, 2023

Sorry if I was unclear. Hoping to hear what strategies were practical within a given DB. A whole different platform is usually a much larger undertaking.

AdieuToLogic · on April 7, 2023

Pick a a stable, guaranteed-to-exist, shard key (composite or atomic properties) and use that. If composite, order properties used in most-to-least distinct value distribution.

dilyevsky · on April 7, 2023

Not so easy at scale because there’s additional requirement for it to also not be susceptible to hotspotting

charcircuit · on April 7, 2023

Yes, this is where caching layers come in. Now your cache acts as the RDBMS.