Sharding: Architecture Pattern

Scalability stands as a crucial tenet that underpins the design and development of systems, applications, and infrastructure. Scaling is a default in today’s world of distributed systems and while we can scale our services easily(assuming they’re stateless!), the same cannot be said for our stateful systems like data-stores. In this article, we’ll delve into one of the common ways to horizontally scale Stateful systems!

Sharding

Sharding is a technique used to horizontally partition a data-store into smaller, more manageable fragments called shards, which are distributed across multiple servers or nodes. This allows us to scale our data-stores not only in terms of storage, but also in terms of compute as the queries and operations on each node are only for a subset of the data i.e shard.

Sharding Techniques

The choice of sharding approach depends on factors such as the nature of the data, access patterns, scalability requirements, and the specific characteristics of the system. Here are some common sharding techniques:

  1. Range-Based Sharding: Range-based sharding involves partitioning data based on a specific range of values within a chosen attribute. For example, data can be partitioned based on the range of customer IDs or timestamps. This approach allows for efficient querying of contiguous ranges of data but may lead to data skew if the distribution of values is uneven.
  2. Hash-Based Sharding: Hash-based sharding involves applying a hash function to a selected attribute to determine the shard assignment for each data item. The hash function distributes data uniformly across shards, ensuring an even distribution of workload. This approach allows for easy scaling and load balancing but may result in random distribution and potentially increased cross-shard queries.
  3. Composite Sharding: Composite sharding involves combining multiple sharding techniques to partition data. This approach is useful when a single sharding strategy may not be sufficient to handle the complexity or size of the data. For example, a composite sharding approach might involve range-based sharding based on a primary attribute and then using hash-based sharding within each range.
  4. Geo Sharding: In geo sharding, the data is divided into shards based on geographic boundaries, such as countries, states, cities, or specific spatial regions. Each shard is responsible for storing and…
Pratik Pandey - https://pratikpandey.substack.com

Senior Engineer with experience in designing and architecting large scale distributed systems.