Following my previous post, we are going start with the “hottest potato” - single-threaded vs multi-threaded argument.

This question in the context of Redis has been raised quite a few times before, sometimes sparkling pretty heated discussions. See, for example

and .I was blocked and riduclued for saying Redis should be multi-threaded. Both by the community and the maker for years.

— Kelly Sommers (@kellabyte) March 28, 2019

Ahem.https://t.co/Aom4CIYSaf

But we already knew this would be the result, didn’t we? This is how modern computers work. It’s not really a debate.

Eventually, Salvatore has decided to allow the offloading of I/O processing onto additional threads. But as I said before, only a single designated thread handles the main Redis dictionary.

It seems, however, that was not enough for the community. Two and half years ago,
a couple of talented folks from Canada decided to change the status quo and
created a multi-threaded Redis fork called *KeyDb*.
KeyDb allows multiple threads to access the Redis dictionary by protecting it with spinlocks.
See their introductionary post
about this.

In the previous post I mentioned the following arguments why Redis was built as single-threaded:

- It preserves low tail latency by avoiding contention on locks.
- Redis Cluster provides horizontal scale, which should be as good as vertical scale if not better: N single-core machines are equivalent to a single process spanning N cores.
- Pipelining gives you more throughput. A redis client can pipeline requests reaching ~1M qps, which is an order of magnitude higher than the regular traffic throughput.
- Room for the upside on a single machine is limited anyway.
- Single-threaded is simple, multi-threaded is complicated.

I think there are great benefits of having multi-threaded databases, and I will try to challenge these arguments. But before we continue forward, let’s agree on a few terms:

*A vertically scalable system* is a system that can scale *almost* linearly with its performance metrics
(aka requests per second) as a function of available CPUs on a single server.
*Horizontally scalable system* is a distributed system that can run on multiple nodes/processes
and scale *almost* linearly with a number of nodes.

I assert the following claims:

- A vertically scalable system is more cost-efficient than its equivalent horizontally scalable configuration until it reaches its physical limits on a single server.
- In order to reach the full potential of the modern hardware, and preserve the low latency property, a system should be designed as a shared-nothing architecture, i.e. avoid locking and contention, along with the original philosophy of Redis.

I will try to prove (1). I base (2) on the empirical evidence gathered in the research community and on lessons learned from other well-designed systems like ScyllaDb or Twitter’s Pelikan.

## Vertical scale vs Horizontal scale

Imagine you are in the business of selling sugar. You need to choose between renting a warehouse that can hold 10,000kg of sugar vs 10 warehouses that can hold 1000kg each. The price of smaller warehouses is precisely a tenth of the bigger one. It seems that there is no difference, right? However, if you choose to rent 10 smaller warehouses, you will see that after some time, some, but not all of them, will be nearly empty, and you need to send trucks to fill them up. Why? Because the randomness of nature dictates that you have almost zero chance that all warehouses are depleted at precisely the same rate. So you need to spend resources to fill some of the warehouses. Then you will spend resources to fill others and so on. Moreover, you need to staff those warehouses: you might need 2.5 people per place on average, but you will need to round it up and hire 3 folks in every place. There will be high-pressure days when your workers will work like crazy, while during other days, they will do nothing. Let’s do some math to understand how to model those inefficiencies.

Let’s assume that $X_1…X_n$ are independent random variables. Then the expected mean and variance of their sum is the sum of their means and variances:

$$ \mathbb{E} \biggl[ \sum_{i=1}^n X_i \biggr] = \sum_{i=1}^n {E[X_i]}$$ $$ Var \biggl[ \sum_{i=1}^n X_i \biggr] = \sum_{i=1}^n {Var[X_i]} $$

These formulas are basic rules in probability theory, see here, for example. Specifically, for independent random variables with the same mean $\mu$ and standard deviation $\sigma$, we get:

$$ \mathbb{E} \biggl[ \sum_{i=1}^n X_i \biggr] = \sum_{i=1}^n {E[X_i]} = \sum_{i=1}^n \mu = n* \mu $$

$$ Var \biggl[ \sum_{i=1}^n X_i \biggr] = \sum_{i=1}^n {Var[X_i]} = \sum_{i=1}^n \sigma^2 = n*\sigma^2 $$

The last equation can be rewritten as $ StdDev \biggl[\sum_{i=1}^n X_i \biggr] = \sqrt n * \sigma $.

These equations prove the so-called `square root staffing law`

in queueing theory, which
explains why provisioning a bigger system is more efficient than provisioning a few smaller ones
with equivalent capacity.

Indeed, when we provision a system that handles the load distributed as $(\mu, \sigma)$, we usually take an additional margin, say, twice the standard deviation, to cope with the intrinsic stochastic variability of that load. With $n$ warehouses, we can model their load as $n$ independent variables distributing as $(\mu, \sigma)$, therefore, in order to cope with $n * \mu$ effective load, we will need to staff $2 n \sigma$ additional resources. However, a single warehouse that handles the load $(n \mu, \sqrt n \sigma)$ needs only $2 \sqrt n \sigma$ resources to cover the same margin. The bigger warehouse is, the larger the difference between $\sqrt n \sigma$ and $n \sigma$.

There are quite a few articles on the internet and lots of academic research in this area. See this post, for example.

Obviously, the vertical scale is equivalent to renting a bigger warehouse, and the horizontal scale is equivalent to provisioning multiple smaller places.

Let’s switch back to the memory-store example. Suppose we provision 9 nodes that are expected to
host 12GB of data *on average* with the standard deviation - 2GB. We would take servers with
`12GB+2*2GB=16GB`

capacity, in total `144GB`

with `36GB`

margin. With a single server,
however, we would need `108 + sqrt(9) * 4 = 120GB`

. We just reduced the overall cost of the system
by 17%. And if 17% seems not much, we can compare a single server with arbitrary many `n`

servers
of $\frac 1 n$ capacity. With `n`

large enough, the standard deviation becomes the significant
factor compared to the average load. For example, 108 nodes that host 108GB overall, would
need to sustain load $L(\mu=1, \sigma=0.577)$, thus we would need to provision `108*(1 + 2*0.577) = 232GB`

which is 93% higher than the single server cost.

So far, I have talked about economy of scale and why pooling resources is more efficient than employing
multiple independent capacities. There are additional factors that increase the total cost of ownership
for multi-node system: I believe that any experienced devop would agree that managing a fleet of `N`

servers
is more complicated than managing a single server just because of moving parts: The chance that at least one of the servers will fail is approximately `N`

time bigger than for a single machine.
In addition, the horizontally scalable system might impose additional restrictions on how the system is used. Specifically, with Redis - Redis Cluster does not allow transactions or multi-key operations covering
multiple nodes, it lacks multi-database support, and it can not issue consistent management operations
like `flushdb`

, `save`

across the cluster.

The goal of this section is not to persuade you that vertical scale is strictly better than horizontal scale - obviously, the vertical scale is bounded by the physical limits of its host and can not always be chosen. However, when there is a choice - it can be the much simpler and most cost-efficient alternative to splitting your workloads across separate nodes.

## Shared-nothing architecture

`“Hardware on which modern workloads must run is remarkably different from the hardware on which current programming paradigms depend, and for which current software infrastructure is designed.”`

- From Scylla blog.

Share-nothing architecture is a methodology to build a system in such way that its state is partitioned among multiple processors, and each processor can execute its work independently from others. Some prominent examples of shared-nothing architecture: a) Map-Reduce is a distributed processing framework that processes huge amounts of data over multiple independent workers. b) Redis Cluster is comprised of independent Redis nodes, where each one of them can perform without relying on others. c) ScyllaDb is a Cassandra clone that partitions its server database over CPUs in such a way that each CPU thread consistently handles the same partition. See more info about their open-sourced Seastar framework d) Similarly, Envoy is a prominent proxy server that uses thread-local storage to minimize contention between multiple threads inside its process.

Shared-nothing architecture can be designed over multiple nodes like with (a) and (b) or within a single process like with (c) and (d).

In our case, I believe that memory store can be designed as a multi-threaded, single-process system that utilizes the underlying CPUs using shared-nothing architecture. In other words, every cpu thread will handle its dictionary partition shard. Other threads can not access directly the data-structures they do not own. If a thread needs to write or read from a dictionary managed by another thread, it achieves it by sending a message to the owner via a dedicated message bus. This architecture is not novel - it appears a lot in technical papers, and became mainstream thanks to ScyllaDb design.

Any mature database needs to perform operations equivalent to Redis `flushdb`

,
`save`

, `load`

etc. It also needs to perform resize or compaction operations periodically.
With the single-threaded architecture, it means that a single CPU is involved in processing all this data
which heavily reduces database resilience. This brings us to another significant benefit for shared-nothing architecture with the thread-per-core threading model, which is often neglected.
Modern servers maintain a bounded ratio of CPU vs memory.
Say, in AWS, `r5`

family has 1:8 ratio, `m5`

family has 1:4. Similarly, in GCP `n2`

family maintains
ratios between 1-8 GB per vcpu. Therefore, with the thread-per-core model, each thread handles
at most `K`

GB of workload, which means that a database stays resilient whether it’s 8GB or
860 GB on a single machine.

## Benchmarks

This section addresses arguments (3) and (4) from the beginning of the post, namely how much upside we can bring by employing shared-nothing architecture using modern cloud servers. For that, I will use a toy redis-like memory store called midi-redis. midi-redis is similar to rust tokio mini-redis - i.e. it’s built to demonstrate the capabilities of the underlying IO library by implementing a subset of redis/memcached commands.

Specifically, midi-redis supports `GET`

, `SET`

, `PING`

, and `DEBUG POPULATE`

commands.
It also has complimentary support for memcached `GET`

and `SET`

commands and pipeline mode for both protocols.

The underlying I/O library that powers `midi-redis`

is helio
that uses linux io-uring api underneath. `helio`

is specially designed for
shared-nothing architectures and it is the evolution of my previous library GAIA.

I performed benchmarking of `midi-redis`

on a variety of instances on the AWS cloud.
The point is not to evaluate `midi-redis`

but to show the true potential of the hardware vs what
Redis gives us on this hardware. Please, treat these numbers as directional only -
I run each load-test only once, so I believe there is some variance that could affect numbers in +-15% range. `redis-1`

bar in the graph represents Redis 6 with the `io-threads=1`

configuration, and `redis-8`

denotes
`io-threads=8`

configuration. Redis results were similar on all instance types,
therefore I used the `redis-1`

run on `c5.18xlarge`

as my relative baseline for all other runs.

I used read-only traffic to minimize the influence of memory allocations and
database resizes on the results. Also, I run “debug populate 10000000” command before each run
to fill a server under test with data. Finally, I run memtier_benchmark from 3 client machines in the same zone with the following configuration:
`--key-prefix=key: --key-pattern=P:P --ratio 0:1 -n 2000000`

. The number of threads and connections
for `memtier_benchmark`

were chosen to maximize the throughput of the server under test and were different
for each instance type.

My first graph shows throughput for regular traffic without pipelined requests:

You can see that midi-redis running most network-capable AWS instance c6gn.16xlarge has the throughput
that is >20 times higher than `redis-1`

and >10 times higher than `redis-8`

.

My next graph shows the throughput of instances with pipeline mode when `memtier_benchmark`

sends bursts
of 10 requests at once (`--pipeline 10`

):
Here, `c6gn.16xlarge`

has seven times more throughput reaching a staggering 7.4M qps. Interestingly,
`redis-8`

is a bit slower than `redis-1`

because Redis main thread becomes the bottleneck for pipelined traffic.
And `redis-8`

spends additional cpu for coordination with its io-threads. `midi-redis`

on the other hand,
splits its dictionary between all its threads, reduces their communication to a minimum,
and scales its performance much better.

I do not know if a factor of 20 or a factor of 7 sounds impressive to you, but please remember that Redis 6 is the product of a decade of development and optimizations. Even 5%, 10% of incremental improvement is significant here. By changing the foundation, we allow potential 2000% upside for the non-pipeline case. Moreover, with time we will benefit from additional tailwinds from hardware advancements - with better networking and more cpus we will see even higher rates.

The speed of a car is 140 km/h, the speed of a jet plane is 950 km/h and the speed of sound is 1235 km/h. If Redis is a fast car, then we could fly instead on a supersonic plane.