Centralized Coordination II— Distributed Design Patterns

In my last article, we talked about how Centralized Coordination can help us reduce costs associated with performing Linearizable operations in a huge cluster. The idea is to perform operations requiring Linearizability in a small cluster(3–5 nodes). However, just choosing a small cluster doesn’t solve our problem completely. There are edge cases that we need to handle, which we will discuss now.

Problem Statement

As a client, I’m connected to my coordination cluster, which will give me linearizable results. My connection to the coordination cluster can happen with a leader node or follower node.

Leader Isolation

There could be a scenario where the leader node, despite being up, is partitioned from the rest of the coordination cluster. Let’s assume that we have a 3-node cluster with Nodes N1, N2 and N3.

A client(C1) initially connects to the leader N1 & let’s assume that all nodes in the cluster have a consistent state for key K with value V. Let’s assume that N1 gets partitioned from the cluster(not from the client).

Once the followers do not receive a heartbeat from the leader in the defined timeout, they trigger a leader election assuming the leader is down. A new leader is elected, let’s assume N2.

Another client(C2) connects to N2 and updates the value of K to V1. This message is accepted by both N2 and N3 and since the quorum successfully acknowledges the update, the value is updated with N2 and N3 but not with N1.

Now if C1 tries to read the value K, it would get a stale result of V, which would break the linearizable guarantee!

Follower Isolation

Let’s take this scenario where the follower node despite being up, is partitioned from the rest of the coordination cluster. We will use the same cluster as above with all nodes having the same initial value.

A client(C1) connects to the leader N1 and updates the value of K to V1. N1 propagates this change to N2 and N3. N2 receives the update from N1 and acknowledges it. Since N1 has acknowledgement from a quorum of nodes(2), it commits the update to V1. Note that N3 does not have the updated value, since it was partitioned.

Now if another client C2, connects to N3 to read the value of K, it would get a stale result of V, which would break the linearizable guarantee.

Solution

Leader Isolation

Approach 1 -

To ensure linearizability, the most straightforward approach will be for the leader to confirm from a majority of the cluster if it’s still a leader. It can do so by confirming the leader and generation details from a majority of the nodes in the cluster and if it doesn’t get a response, it knows that it’s isolated and cannot serve the read request from the client.

Pros: Linearizability is maintained and no stale data is served

Cons: The read latency increases because every read on the leader requires an additional roundtrip to a quorum of nodes, impacting read throughput.

Approach 2 -

The problem we’re facing with leader isolation is that even after a leader gets partitioned, it continues to be a leader. We can stop this behaviour by attaching a lease to the leader, which can be used to ensure -

  1. A leader can only hold the lease for its term. Following the expiry of the lease, the leader will step down and become a follower/candidate. At this point, it will no more server client requests.
  2. No follower can become the leader till the leader’s lease expires. So we wouldn’t have the scenario of the update of value to V1 being accepted in the first place.

How will the followers know about the lease time of the leader?

A leader, after setting the lease interval & starting the lease timer, sends a message to all the followers informing them about its lease interval. Each follower, on receiving the message, creates their countdown.

If a leader election is held, then each voter must propagate the known value of the lease timer that's remaining as part of the vote. This informs the new leader, about the time it needs to wait before it takes over as the leader. This ensures we don’t have more than one leader in our system at any point.

Pros: Linearizability is maintained and no stale data is served, without inducing read latency.

Cons: Because the new leader cannot take up leader responsibility till the old leader’s lease has expired, there could be a window where the cluster might not be able to accept writes/updates. This window should be small & should only occur in failure scenarios(leader crashes/partitions), & hence it's important to not set lease times to a very high value.

Follower Isolation

Approach 1 —

To ensure that my follower can never serve stale results, the easiest way is to ensure that a client always connects to the leader for both read/write requests. Since the leader has the latest state at any point in time, the client would never get stale results.

Pros: Linearizability is maintained and no stale data is served

Cons: The leader can be overwhelmed by a high number of clients, leading to crashes.

Approach 2 —

The other approach to resolve this would be to read the latest offset on the leader and if the state of the follower is not up to date with the leader, it can get and apply the missing offsets to be in a consistent state with the leader and server linearizable reads.

Pros: Linearizability is maintained and no stale data is served

Cons: Read latency can be higher if the follower is not caught up with the leader.

This brings us to the end of this article. We talked about different edge cases with clusters that provide Linearizability guarantees, and how we could handle them. Please post comments on any doubts you might have and will be happy to discuss them!

I found a great article on how Yugabyte handles leader isolation, so please check it out here — https://www.yugabyte.com/blog//low-latency-reads-in-geo-distributed-sql-with-raft-leader-leases/

Thank you for reading! I’ll be posting weekly content on distributed systems & patterns, so please like, share and subscribe to this newsletter for notifications of new posts.

Please comment on the post with your feedback, will help me improve! :)

Until next time, Keep asking questions & Keep learning!

--

--

Pratik Pandey - https://pratikpandey.substack.com

Senior Engineer with experience in designing and architecting large scale distributed systems.