Consistent Hashing: Architecture Pattern

Imagine you are building a shared distributed system that needs to store and retrieve data across multiple servers. One of the main challenges is deciding which server should be responsible for storing a given piece of data. Traditional approaches like modulo-based hashing often lead to inefficient data distribution and expensive rebalancing when the number of servers changes.

Problem with Modulo-Based Hashing

Let’s illustrate the problems with modulo-based hashing using a simple example of a distributed system with four servers (S1, S2, S3, and S4) and a dataset of five items (A, B, C, D, and E). We’ll use a modulo-based hashing technique to assign data items to servers based on their hash values.

Step 1: Calculate Hash Values Assuming our hash function assigns the following hash values to the data items:

A -> hash(A) = 11 B -> hash(B) = 20 C -> hash(C) = 32 D -> hash(D) = 45 E -> hash(E) = 50

Step 2: Assign Data Items to Servers Now, we distribute the data items to servers using the modulo-based hashing technique. Let’s say we use the modulo operator (%) to determine the server for each data item:

A -> S1 (hash(A) % 4 = 11 % 4 = 3) 
B -> S2 (hash(B) % 4 = 20 % 4 = 0)
C -> S2 (hash(C) % 4 = 32 % 4 = 0)
D -> S3 (hash(D) % 4 = 45 % 4 = 1)
E -> S4 (hash(E) % 4 = 50 % 4 = 2)

Data Distribution:

S1: A S2: B,C S3: D S4: E

Problem 1: Inefficient Data Redistribution

Now, let’s assume we add a new server, S5, to the system. According to the modulo-based hashing, the data distribution will change as follows:

A -> S1 (hash(A) % 5 = 11 % 5 = 1) 
B -> S2 (hash(B) % 5 = 20 % 5 = 0)
C -> S4 (hash(C) % 5 = 32 % 5 = 2)
D -> S2 (hash(D) % 5 = 45 % 5 = 0)
E -> S2 (hash(E) % 5 = 50 % 5 = 0)

Updated Data Distribution:

S1: A S2: B,D,E S4: C S3: S5:

As we can see, there is an inefficient data distribution. Ideally, we’d not have wanted data movement from servers that are not overloaded(anything except S2). However, we see that the new server wasn’t utilized at all and data was also moved away from S2. This process becomes even more problematic as the system scales and more servers are added, leading to inefficiency and resource waste.

Pratik Pandey - https://pratikpandey.substack.com

Senior Engineer with experience in designing and architecting large scale distributed systems.