Understand Database Sharding The Good and Ugly
What is Database sharding?
Sharding is a database architecture pattern related to horizontal partitioning the practice of separating one table’s rows into multiple different tables, known as partitions. Each partition has the same schema and columns, but also entirely different rows. Likewise, the data held in each is unique and independent of the data held in other partitions.
Advantages of sharding
Scaling
The main appeal of sharding a database is that it can help to facilitate horizontal scaling, also known as scaling out. Horizontal scaling is the practice of adding more machines to an existing stack in order to spread out the load and allow for more traffic and faster processing. This is often contrasted with vertical scaling, otherwise known as scaling up, which involves upgrading the hardware of an existing server, usually by adding more RAM or CPU.
Improve Performance
Another reason why some might choose a sharded database architecture is to speed up query response times. When you submit a query on a database that hasn’t been sharded, it may have to search every row in the table you’re querying before it can find the result set you’re looking for. For an application with a large, monolithic database, queries can become prohibitively slow. By sharding one table into multiple, though, queries have to go over fewer rows and their result sets are returned much more quickly.
Reliability
Sharding can also help to make an application more reliable by mitigating the impact of outages. If your application or website relies on an unsharded database, an outage has the potential to make the entire application unavailable. With a sharding database, though, an outage is likely to affect only a single shard. Even though this might make some parts of the application or website unavailable to some users, the overall impact would still be less than if the entire database crashed.
Easier to Manage
Production databases must be fully managed for regular backups, database optimization, and other common tasks. With a single large database, these routine tasks can be very difficult to accomplish, if only in terms of the time window required for completion. Routine table and index optimizations can stretch from hours to days, in some cases making regular maintenance infeasible. By using the sharding approach, each individual “shard” can be maintained independently, providing a far more manageable scenario, performing such maintenance tasks in parallel.
Reduce Costs
Most database sharding implementations take advantage of low-cost open-source databases and commodity databases. The technique can also take full advantage of reasonably priced “workgroup” versions of many commercial databases. Sharding works well with commodity multi-core server hardware, systems that are far less expensive when compared to high-end, multi-CPU servers and expensive storage area networks (SANs). The overall reduction in cost due to savings in license fees, software maintenance, and hardware investment is substantial in some cases 70% when compared to traditional solutions.
Sharding types
Sharding by Key Range
How does it work?
One way of partitioning is to assign a continuous range of keys (from some minimum to some maximum) to each partition, like the volumes of a paper encyclopedia.
If you know the boundaries between the ranges, you can easily determine which partition contains a given key. If you also know which partition is
assigned to which node, then you can make your request directly to the appropriate node (or, in the case of the encyclopedia, pick the correct book off the shelf).
Within each partition, we can keep keys in sorted order, This has the advantage that range scans are easy, and you can treat the key as a concatenated index in order to fetch several related records in one
query.
For example, consider an application that stores data from a network of sensors, where the key is the timestamp of the measurement (year-month-day-hour-minute-second). Range scans are very useful in this case because they let you easily fetch, say, all the readings from a particular month.
The downside of key range partitioning ( hot spots )
However, the downside of key range partitioning is that certain access patterns can lead to hot spots. If the key is a timestamp, then the partitions correspond to ranges of time one partition per day. Unfortunately, because we write data from the sensors to the database as the measurements happen, all the writes end up going to the same partition (the one for today), so that partition can be overloaded with writes
To avoid this problem in the sensor database, you need to use something other than the timestamp as the first element of the key. For example, you could prefix each timestamp with the sensor name so that the partitioning is first by sensor name and then by time. Assuming you have many sensors active at the same time, the write load will end up more evenly spread across the partitions. Now, when you want to fetch the values of multiple sensors within a time range, you need to perform a separate range query for each sensor name.
Sharding Hash of Key
How does it work?
Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key.
A good hash function takes skewed data and makes it uniformly distributed. Say you have a 32-bit hash function that takes a string. Whenever you give it a new string, it returns a seemingly random number between 0 and 232.
Even if the input strings are very similar, their hashes are evenly distributed across that range of numbers.
For partitioning purposes, the hash function need not be cryptographically strong:
for example, Cassandra and MongoDB use MD5, and Voldemort uses the FowlerNoll–Vo function. Many programming languages have simple hash functions built-in (as they are used for hash tables), but they may not be suitable for partitioning: for example, in Java’s Object.hashCode() and Ruby’s Object#hash, the same key may have a different hash value in different processes.
Once you have a suitable hash function for keys, you can assign each partition a range of hashes (rather than a range of keys), and every key whose hash falls within a partition’s range will be stored in that partition
This technique is good at distributing keys fairly among the partitions. The partition boundaries can be evenly spaced, or they can be chosen pseudorandomly (in which case the technique is sometimes known as consistent hashing).
Consistent hashing
Consistent hashing is a way of evenly distributing the load across an internet-wide system of caches such as a content delivery network (CDN).
It uses randomly chosen partition boundaries to avoid the need for central control or distributed consensus. Note that consistent here has nothing to do with replica consistency or ACID consistency, but rather describes a particular approach to rebalancing. As we shall see in “Rebalancing harding”, this particular approach actually doesn’t work very well for databases, so it is rarely used in practice (the documentation of some databases still refers to consistent hashing, but it is often inaccurate).
Because this is so confusing, it’s best to avoid the term consistent hashing
and just call it hash partitioning instead.
Rebalancing Sharding
Over time, things change in a database:
- The query throughput increases, so you want to add more CPUs to handle the load.
- The dataset size increases, so you want to add more disks and RAM to store it.
- A machine fails, and other machines need to take over the failed machine’s
responsibilities.
All of these changes call for data and requests to be moved from one node to another, The process of moving a load from one node in the cluster to another is called rebalancing.
No matter which partitioning scheme is used, rebalancing is usually expected to meet some minimum requirements:
- After rebalancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster.
- While rebalancing is happening, the database should continue accepting reads and writes.
- No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.
Strategies for Rebalancing
There are a few different ways of assigning partitions to nodes, Let’s briefly discuss each in turn.
How not to do it ( hash mod N )
When partitioning by the hash of a key, we said earlier that it’s best to divide the possible hashes into ranges and assign each range to a partition (e.g., assign a key to partition 0 if 0 ≤ hash(key) < b, to partition 1 if b0 ≤ hash(key) < b1, etc.).
Perhaps you wondered why we don’t just use mod (the % operator in many programming languages).
For example, hash(key) mod 10 would return a number between 0 and 9 (if we write the hash as a decimal number, the hash mod 10 would be the last
digit).
The problem with the mod N approach is that if the number of nodes N changes, most of the keys will need to be moved from one node to another. For example, say hash(key) = 123456. If you initially have 10 nodes, that key starts out on node 6 (because 123456 mod 10 = 6). When you grow to 11 nodes, the key needs to move to node 3 (123456 mod 11 = 3), and when you grow to 12 nodes, it needs to move to node 0 (123456 mod 12 = 0). Such frequent moves make rebalancing excessively expensive.
We need an approach that doesn’t move data around more than necessary.
Fixed number of partitions, Fortunately, there is a fairly simple solution: create many more partitions than there are nodes, and assign several partitions to each node. For example, a database running on a cluster of 10 nodes may be split into 1,000 partitions from the outset so that approximately 100 partitions are assigned to each node.
Now, if a node is added to the cluster, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again and If a node is removed from the cluster, the same happens in reverse.
Only entire partitions are moved between nodes. The number of partitions does not change, nor does the assignment of keys to partitions. The only thing that changes is the assignment of partitions to nodes. This change of assignment is not immediate it takes some time to transfer a large amount of data over the network so the old assignment of partitions is used for any reads and writes that happen while the transfer is in progress.
Dynamic partitioning
For databases that use key range partitioning, a fixed number of partitions with fixed boundaries would be very inconvenient, if you got the boundaries wrong, you could end up with all of the data in one partition and all of the other partitions empty.
Reconfiguring the partition boundaries manually would be very tedious.
For that reason, key range–partitioned databases such as HBase and RethinkDB create partitions dynamically.
When a partition grows to exceed a configured size (on HBase, the default is 10 GB), it is split into two partitions so that approximately half of the data ends up on each side of the split. Conversely, if lots of data is deleted and a partition shrinks below some threshold, it can be merged with an adjacent partition. This process is similar to what happens at the top level of a B-tree
Each partition is assigned to one node, and each node can handle multiple partitions, like in the case of a fixed number of partitions. After a large partition has been split, one of its two halves can be transferred to another node in order to balance the load.
In the case of HBase, the transfer of partition files happens through HDFS, the
underlying distributed filesystem.
An advantage of dynamic partitioning is that the number of partitions adapts to the total data volume. If there is only a small amount of data, a small number of partitions is sufficient, so overheads are small; if there is a huge amount of data, the size of each individual partition is limited to a configurable maximum.
However, a caveat is that an empty database starts off with a single partition since there is no a priori information about where to draw the partition boundaries. While the dataset is small until it hits the point at which the first partition has split all writes have to be processed by a single node while the other nodes sit idle.
To mitigate this issue, HBase and MongoDB allow an initial set of partitions to be configured on an empty database (this is called pre-splitting). In the case of key-range partitioning, pre-splitting requires that you already know what the key distribution is going to look like.
Dynamic partitioning is not only suitable for key range–partitioned data, but can equally well be used with hash-partitioned data. MongoDB since version supports both key-range and hash partitioning, and it splits partitions dynamically in either case.
disadvantages of sharding
Adds complexity in the system
Properly implementing a sharded database architecture is a complex task. If not done correctly, there is a significant risk that the sharding process can lead to lost data or corrupted tables. Sharding also has a major impact on your team’s workflows. Rather than managing and accessing one’s data from a single entry point, users must manage data across multiple shard locations, which could be potentially disruptive to some teams.
Rebalancing data
In a sharded database architecture, sometimes a shard outgrows other shards and becomes unbalanced, which is also known as a database hotspot. In this case, any benefits of sharding the database is canceled out. The database would likely need to be re-sharded to allow for a more even data distribution. Rebalancing has to be built in from the start otherwise while re-sharding, moving data from one shard to another shard requires a lot of downtimes.
Joining data from multiple shards
To implement some complex functionalities we may need to pull a lot of data from different sources spread across multiple shards. We can’t issue a query and get data from multiple shards. We need to issue multiple queries to different shards, get all the responses and merge them.
No Native Support
Sharding is not natively supported by every database engine. Because of this, sharding often requires a “roll your own”. This means that documentation for sharding or tips for troubleshooting problems are often difficult to find.
Conclusion
at the end of this article, those are some resources for reading and getting more knowledge
Understanding Database Sharding
Sharding in Distributed Systems
Designing Data-Intensive Applications (chapter 6)