Sharding

Notes from System Design Interview Fundamentals

Food for thoughts

In an interview, before you decide to shard your application servers or databases, you need a reason to do so.
- For example, you can calculate how much total memory you’ll need in the next few years and check if a single database can handle that capacity. If not, you need to shard. (But note, if total storage is an issue, you can also move some hot data into cold storage instead of sharding)
- To improve the throughput of your database, you can calculate QPS and determine if a single database can handle the QPS, then decide if you need to shard. (or, for bandwidth issues you can use compression. Or, if QPS is an issue, you can potentially batch the query or reduce the calling client.)
- Similarly, if latency is an issue, you can potentially have a geo-shard to move the database closer to the user.
Sharding is just one of the many solutions to fix the problems above, and you shouldn’t immediately jump into sharding as the only solution. So consider sharding only if it makes the most sense.
It is important to understand the query pattern in an interview to think about the trade-off between hot spot and big scatter-gather in general. How to shard depends on the assumptions you make about the API calls.
there isn’t a perfect solution for sharding. Make an assumption that sets yourself up for success and does not over-complicate things.

What

When you decide you need to shard, it is not enough to just say you want to shard the database horizontally and expect everything to magically work. Ideally, if you followed the system design framework, you would have the database schema already.
It is essential to give at least 2 sharding scheme options and discuss the trade-off. The interviewer isn’t looking for your ability to come up with a single solution, but the ability to come up with options and trade-offs.

Why

Improve the Throughput of the System: You can have multiple shards that take write instead of a single shard. As long as the shards are share-nothing, the throughput is improved.
Improve the Capacity of the System: Assume each database can only store up to 1 TB of data. When you shard, you’re able to store more than 1 TB of data.
Improve the Latency of the System: you can create a local shard to take in local writes and lower latency. Or you can let each shard to have fewer data. When a database has less data.
Improve Perceived Availability: When there are more shards and a shard goes down, only the failed shard is impacted. Note that this doesn’t improve overall availability, however.

Types of sharding

Horizontal Sharding

When a database has too much data, you can shard by dividing up the rows into multiple different tables.
Having a shard to handle parts of the request will reduce the burden on a single global shard.

Vertical Sharding

(unlikely needed in an interview context)
When a database has too much data (more than the storage can handle), you can shard your database by migrating some table columns into new tables due to differences in query patterns and storage capacity.
The advantage is that this reduces the amount of data for a given table and the amount of storage needed if a column is sparse.
The disadvantage is you’ll need to make multiple updates for the same key to multiple tables and joining the tables on read is comparatively more expensive.

How to shard

Consistent hashing
(+) A hash-based sharding scheme’s advantage is that the system will distribute the data well across shards to minimize any hot spots.
(-) However, it doesn’t completely solve the problem if there’s an abnormally hot key.
(-) A hash-based sharding scheme’s disadvantage is there’s no relationship between the keys within the same shard.
- Sometimes when you make a query, you might be fetching for multiple rows, and you might need to get your query from multiple shards.
If you need to do a range query by the sharding key, this might not be the right solution.
However, once you are within a shard, you can add primary keys and indices as you normally would optimize for queries for that shard.

Range Key

applications may need to ensure a range of keys live on the same shard so the query won’t need to do a big scatter-gather.
The keys are sortable for the range key sharding scheme, and once sorted, the algorithm assigns each range to different shards.
The advantage of this approach is that a query becomes efficient when looking for data within the same shard.
- A common usage is to shard by timestamp so when the user queries for data with the latest timestamps, the query can get it from the same shard as opposed to scatter-gather from multiple shards.
The disadvantage of this approach is the likelihood of hot spot on writes and reads.
- For example, when you generate events with timestamps, all the event writes and reads will go into the same shard if you shard it by time buckets.

Outlier Keys / hot spot

Sometimes you may have a few keys that are outliers due to celebrity, big enterprise companies, or power users.

Option 1: Dedicated Shard

take the outlier keys out of the generic sharding problem space and have a dedicated shard.
advantage is that you can prevent outliers from the normal problem space, but
the disadvantage is the complexity of maintaining these one-off shards with configurations

Option 2: Shard Further

Another possibility is to divide up the shard further. Sometimes it’s not a problem when you just need the data from a sub-shard. However, sometimes you may still need to scatter-gather all sub-shards.

Shard Key and Primary / Index Key

When sharding, the sharding key can be separate from the primary and index key.
Sharding keys mainly helps you determine what to use to break the data down. Once you reach a particular shard, you can still have your primary and index key to optimize for the read and writes.

Geo-Shards

it’s possible to shard into multiple layers, and the common use case is geo-sharding, also known as zone.
You can create geo-shards, so user requests closer to the geo-shard will go there. Then, within each shard region, you can shard it further.
Sometimes in social graphs, you may be interested in your friends’ data, and if the sharding scheme scatters your friends across different geo-shards, then you need to scatter-gather.
- To optimize for this, use the heuristic that friends tend to be closer to each other by location, and you should group friends closer if possible to minimize the scatter-gather.

Trade-offs of Sharding

Scatter/Gather

When you shard, you need to think about retrieving the data you need based on your sharding scheme. Scatter-gather is not as performant as if you were to fetch it from a single shard.
For example, assume you have a driver location table with driver_id and location_id. If you shard it by driver_id, each shard will likely have drivers with that location_id. When you fetch drivers for a location_id, you need to scatter-gather all the shards.

Hot spots

Whenever you shard, you need to think about distributing your data across the shards and think about if there are real-life scenarios that could lead to hotspots.
- Examples: If you’re designing a chat application and each message has a timestamp, if you shard it by time range (10:00 AM to 11:00 AM go to the same shard), your shard that’s taking in the current time will be pretty hot.
- If you’re storing the driver locations and you’re sharding it by location ID, a location ID might be associated with a populated city and lead to a hotspot.
- If you are designing for a social media app database that stores feeds, celebrity posts will be frequently read and become hot.
When you’re figuring out hotspots, you need to consider both read and write queries. That should come from your APIs defined in the system design framework, and you can think about how those read and write APIs impact your design.

Machine Hops

Machine hops happen when you need to read from shard to shard for subsequent queries.
For example
- if you’re designing social graph storage, you have to query a friend of friends. If you shard it by name, all your friends may live in different shards and need a big scatter-gather. And to fetch for friends of friends, you need to read from other shards again if they don’t have a good locality.
- However, you can also shard it by their primary location because friends usually live close to each other. Then you may only need to gather from fewer machines because they might already be on the same shard.
- Of course, this could lead to hotspots for particular areas, but that’s a trade-off you need to mention and discuss with the interviewer.