I've found myself explaining how sharding works to many people over the past year and realized it would be useful (and maybe even interesting) to break it down in plain English.
Sharding at the core is splitting your data up to where it resides in smaller chunks, spread across distinct separate buckets. A bucket could be a table, a postgres schema, or a different physical database. Then as you need to continue scaling you’re able to move your shards to new physical nodes thus improving performance.
The one step that happens in every sharding implementation is deciding what you’re going to shard or partition your data on. There are a number of trade-offs to various keys, and which is right really depends on your application. Once you determine what your sharding key will be, you’re going to ensure the sharding key is in place throughout your application. You can do this a few ways, an easy one is materializing that sharding key on all your models. By materializing or de-normalizing it, you allow your application to make fewer queries to define how it needs to route the data.
When a request comes in then, if you’re doing your sharding at the application layer, then your application will need to determine how to route the request. The same principle applies with Citus, you just don’t have to do any extra work to determine how to route the request, rather, Citus figures that out for you. If your shard key is on the query itself, Citus sees the shard key and routes the query accordingly.
Bringing this a little more to life, here is an example query that was sent to Citus. This example shows how Citus transforms the query:
SELECT *
FROM visits
WHERE occurred_at >= now() - '1 week'::interval
AND customer_id = 2
The above gets transformed with the Citus router executor to:
SELECT *
FROM visits_007
WHERE occurred_at >= now() - '1 week'::interval
AND customer_id = 2
So there you have it, you’re done right? You know all you need to about sharding… well not quite yet.
A common misconception about sharding is that when you define what your shard key is, that actual value of the shard key is the value that exists within your metadata tables that determine the routing. Let’s look at a real world example, say you’re building a SaaS application. You start in year 1 and get 20 customers, in year 2 you get 100 customers, and in year 3 you get 500 customers. Imagine your SaaS application is a CRM system, and you decide to shard by customer because the data for each of your customers needs to be kept separate from others. Because your early adopter customers will have been using your application over a longer time period, it’s likely that customers in year 1 and 2 are likely to have significantly more data than the new customers who onboarded in year 3. One option to define how values map to shards would be the following:
shard 1: customer ids 1-100
shard 2: customer ids 101-200
shard 3: customer ids 201-300
shard 4: customer ids 301-400
shard 5: customer ids 401-500
shard 6: customer ids 501-
Because our older customers will have a chance of accumulating more data since they’ve been a customer for longer you now have an uneven skew of data. With customers with ids 1-100 (those with the most data) all living on the same shard you can see why this may not be ideal.
Sharding by customer is super common—especially for multi-tenant applications—and has lots of benefits for performance and scale. But the example above highlights the situation where you can have shards with a very uneven distribution of data—and this uneven distribution of data across shards. The solution lies in the implementation of how you shard. As I like to say, it’s just an “implementation detail”. In this case at least, it’s more true than not.
The solution is simple, you hash each customer_id as it comes in, then you have a table that maps which hash values are stored in which shards. Let’s say you have 8 shards, and you start hashing the customer_ids. Your hash range on integers within Postgres can be from - 2147483648 (-2^32) to 2147483647 (2^32-1). If we were to divide that hash range evenly, we would have shards that have the following values:
shardid | min hash value | max hash value
---------+----------------+----------------
1 | -2147483648 | -1610612737
2 | -1610612736 | -1073741825
3 | -1073741824 | -536870913
4 | -536870912 | -1
5 | 0 | 536870911
6 | 536870912 | 1073741823
7 | 1073741824 | 1610612735
8 | 1610612736 | 2147483647
As our customer_ids come in, we’re going to evaluate the hash value of them. For customer 1, our hash comes out to -1905060026, for customer 2 it comes out to 1134484726, for customer 3 it comes out to -28094569. So as we can see, our earlier customers from year 1 already start to get evenly distributed among shards.
As we mentioned at the very beginning briefly shards are some distinct grouping of data. Too often it gets associated that a shard is a physical instance. In practice there is a lot of leverage to beginning with a higher number of shards than you have underlying instances. In the case of Citus shards are Postgres tables, and then we run nodes which contain at least one but often many more shards. By placing multiple shards in a single node you’re able to easily scale by moving a shard between nodes without having to break the data within the shard apart.
Let’s go back to our above example of the 8 shards we have. Assuming we have two nodes they might map something like the following:
shardid | min hash value | max hash value nodeid
---------+----------------+----------------+--------
1 | -2147483648 | -1610612737 | 1
2 | -1610612736 | -1073741825 | 2
3 | -1073741824 | -536870913 | 1
4 | -536870912 | -1 | 2
5 | 0 | 536870911 | 1
6 | 536870912 | 1073741823 | 2
7 | 1073741824 | 1610612735 | 1
8 | 1610612736 | 2147483647 | 2
If we wanted to scale out our setup, we could easily move half of our shards from one node to another without having to change our mapping:
shardid | min hash value | max hash value nodeid
---------+----------------+----------------+--------
1 | -2147483648 | -1610612737 | 1
2 | -1610612736 | -1073741825 | 2
3 | -1073741824 | -536870913 | 3
4 | -536870912 | -1 | 4
5 | 0 | 536870911 | 1
6 | 536870912 | 1073741823 | 2
7 | 1073741824 | 1610612735 | 3
8 | 1610612736 | 2147483647 | 4
Whether you’re sharding by a granular uuid, or by something higher in your model hierarchy like customer id, the approach of hashing your shard key before you leverage it remains the same. Hashing your partition key and keeping a mapping of how things route is key to a scalable sharding approach. Within Citus we follow the same standard principles that many others have for years when they hand rolled their own sharding, you just don’t have to put in all that effort now. Under the covers we automatically evaluate the hash value as it comes in, and within the catalog tables quickly determine how to route to the appropriate table or shard without you needing to re-write any queries.
Sharding may not make sense in all cases, but for most if the data model fits into a clean sharding model then there can be a lot of gains. You can get performance boosts even on smaller datasets, and more importantly you can rest easier at night not having to worry about when you’re going to run into a ceiling of how much higher you can grow. If you have any questions on if sharding may make sense for you feel free to drop us a note.
This post originally appeared on Citusdata.com.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.