Consistent hashing

chapter 5 of System Design Interview – An insider’s guide
System Design Interview Fundamentals

Before you jump into consistent hashing…

Consistent hashing is NOT the silver bullet to all problems. It has trade-offs based on the data distribution, queries, and hotkeys.

Why?

assuming a load balancer where it uses modulo operation to distribute the requests to X servers.
When one server is down, and we have to redistribute the request to the remaining X - 1 servers, almost all the requests would have mapped to a different server compared to original setup.

Consistent hashing is a special kind of hashing such that when a hash table is re-sized and consistent hashing is used, only k/n keys need to be remapped on average, where k is the number of keys, and n is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped - from wiki

So it minimize data migration due to shard redistribution when servers are added, removed, and down.

How? (Basics)

cache keys are hashed onto the hash “ring” (ring-like hash space), instead of the modulo operation
Each server is mapped to certain point on the ring as well.
To determine which server a key is stored on, we go clockwise from the key position on the ring until a server is found.
As a result, adding a new server or delete a new server should only change partial portion

Basic is alright but …

it is impossible to keep the same size of partitions on the ring for all servers considering a server can be added or removed. A partition is the hash space between adjacent servers. It is possible that the size of the partitions on the ring assigned to each server is very small or fairly large.
it is possible to have a non-uniform key distribution on the ring.

Solution

basically separate the ring to granular parts through introducing virtual nodes
Each virtual node refers to one real node, and each server is represented by multiple virtual nodes on the ring.
As the number of virtual nodes increases, the distribution of keys becomes more balanced. This is because the standard deviation gets smaller with more virtual nodes, leading to balanced data distribution.
More spaces are needed for the virtual nodes. This is a tradeoff, and we can tune the number of virtual nodes to fit our system requirements.

What

Distribute data across multiple servers evenly.
Minimize data movement when nodes are added or removed.
Heterogeneity: the number of virtual nodes for a server is proportional to the server capacity. For example, servers with higher capacity are assigned with more virtual nodes.