Kubernetes Redis controller for autoscaling a Redis cluster
This post outlines the design and development of a Kubernetes, a controller for autoscaling a Redis cluster within Kubernetes. Redis cluster is a high-performance implementation of Redis that can scale up to 1000 nodes. It offers high-availability, is able to survive network partitions through the replication of data, and can scale horizontally by increasing the number of Redis nodes in the cluster. Kubernetes is a container orchestration platform for deploying Linux containers in a cluster. Kubernetes, or k8s, automates and simplifies the deployment of Linux container by orchestrating the networking and container lifecycle management.
2. Redis Cluster
Redis cluster uses two different node types called masters and slaves. Redis, a key-value store, stores values that are hashed into a finite number of 16384 hash slots. These hash slots are assigned to master nodes. Slave nodes in a Redis cluster replicate all the data in a given master node. A master node can have multiple slave nodes assigned to replicate it. When horizontally scaling up a Redis cluster by adding a new master node it is necessary to re-assign hash slots in the Redis cluster. Practically, it is desired to have the hash slots evenly distributed among the existing master nodes and the new master node. Similarly, when removing a master node from the cluster, those hash slots currently assigned to the master node to be removed must be re-assigned to the existing Redis cluster nodes. There is a ruby script called `redis-trib.rb` distributed with the source for Redis server that can be used to re-shard the cluster when adding a new master node.
Redis cluster nodes need to maintain their state. It is necessary to keep a record of the cluster topology on Redis cluster nodes in case those nodes are restarted or fail for one reason or another. Luckily, Kubernetes has a dedicated persistent that serves this purpose: StatefulSets, which allow for the ordered, graceful creation and termination of pods in a replica set.
Cornucopia is a library that acts as the controller for the Redis cluster managed by Maestro. As such, Cornucopia is a dependency of Maestro. Cornucopia exposes the following four main commands:
1. Add Master
2. Add Slave
3. Remove Master
4. Remove Slave
Cornucopia does auto re-sharding when performing operations involving master nodes. It will automatically assign newly added slave nodes to the poorest master when adding a new slave node to the cluster. The poorest master in a Redis cluster is defined as one of the master nodes that has the least number of slave nodes replicating it.
When adding a new master node to the Redis cluster, Cornucopia calculates a re-shard table. This re-shard table is used to redistribute the hash slots in the cluster. Hash slots should be assigned to the newly added master node from the existing one. The resulting redistribution of hash slots should result in all the hash slots being evenly distributed across the entire set of master nodes in the Redis cluster. There are 16384 total hash slots in the Redis cluster. Say for example that we add a fourth master node to a cluster that has three master nodes. Since hash slots are evenly distributed among master nodes, then we want to move over 16384/4 = 4096 hash slots to the new master from the existing masters.
When adding a new master node, Cornucopia performs a sequence of steps behind the scenes to ensure that the cluster re-sharding proceeds correctly and seamlessly. First, the newly added node is joined to the Redis cluster. This first step results in an empty master node that is recognized in the cluster. Next, Cornucopia gets connections to each of the master nodes in the cluster. These connections will be used to perform the slots migration in the next stages. Cornucopia then calculates the re-shard table required to redistribute the hash slots evenly across the new set of masters. Finally, the calculated re-shard table from the previous step is used to perform the necessary migration of slot keys to the new master node.
Removing a master is similar to adding a master with a few extra requirements. When submitting a Remove master command to Cornucopia, it is required to provide the URI of a node in the current cluster. This URI will be the node that is removed from the cluster. If the node that this URI belongs to is not currently a master node in the Redis cluster, then Cornucopia will convert it into a master node by performing a manual failover on the slave node that currently occupies this URI, effectively converting it into a master node. The reason for this extra complication of choosing a URI to remove is because Maestro uses StatefulSets, and StatefulSets scale down in a specific order. More will be said about this later. After submitting the remove master task to Cornucopia, Cornucopia will perform a manual failover if necessary. Next, Cornucopia gets connections to each of the master nodes in the cluster. Cornucopia then calculates a re-shard table required to redistribute the hash slots evenly across the new set of master nodes, which will exclude the master node being removed. The re-shard table calculated when removing a master node defines slot assignments to the remaining master nodes. These slots are received from the master node being removed. The hash slots currently assigned to the master to be removed need to be redistributed evenly to the remaining masters. The calculated re-shard table from the previous step is used to perform the necessary migration of slot keys away from the retired master to the remaining masters. After the slot migration completes, Cornucopia takes all the slave nodes that are currently replicating the retired master, and uses them to replicate the poorest remaining masters. Finally, the retired master is forgotten from the Redis cluster.
Adding and removing of slave nodes is similar to adding and removing masters, except that no re-sharding is performed. Adding a slave node is the most simple. After the slave node is joined to the cluster, it will be an empty master. Cornucopia will then choose the poorest master (the master with the least number of slaves replicating it) to be replicated by the new slave node. Removing a slave node requires a URI to be provided in the command, similar to how removing a master node works. If the URI provided in the remove slave command is not the URI of a current slave node in the Redis Cluster (i.e., it is a master node), then Cornucopia will find a slave node replicating the master node located at the given URI, and then perform a manual failover on that slave node. This will result in the node at the provided URI to become a slave node. At this point the slave node is forgotten from the Redis cluster.
Maestro makes decisions on when to add or remove nodes from a Redis cluster being managed as a StatefulSet in a Kubernetes cluster. Maestro uses cornucopia as a library dependency. It sends commands to Cornucopia to add or remove Redis nodes to the Redis cluster. Maestro is configured to add masters and slave nodes in alternating order. When scaling the cluster up or down, it will add or remove two nodes at a time. So, when scaling up the cluster, one master and one slave is added, and when scaling down the cluster, one slave and one master is removed from the cluster. Maestro uses memory utilization calculated across the Redis cluster as a metric to decide when to perform a scaling up or down operation of the Redis cluster.
The memory metrics that are used to control how Maestro makes decision about performing scaling of the cluster are configurable. Metrics used to control scaling up are configured independently from metrics that are used to control scaling down.
The memory metric based auto-scaler in Maestro has two parts. The first part is a sampler that takes samples periodically from all of the Redis nodes. The period is configurable, and the metric sampled is the average memory utilization of all Redis nodes. If the sampled memory consumption surpasses a configured value then a counter is incremented. If on the other hand the sampled memory average value falls below a scale down threshold, then another counter is incremented. Otherwise, if the sampled value falls between the two aforementioned threshold values (in other words it falls within the desired range), then the counters are reset. When the sampled memory value is in the desired range, this means the Redis cluster is in steady state. The second part of the memory metric auto-scaler periodically checks the value of the above mentioned counters. If the first counter is above a configured amount, then a scale up of the cluster is initiated. If the second counter is above a configured amount, then a scale down of the cluster is initiated. These two possibilities are mutally exclusive. Alternatively, when neither of the counters is above their configure value, then nothing is done.
6. Interesting challenges
6.1 Rate-limiting migrate slot workers
During development of Cornucopia, it was discovered that performing slot migrations of up to several thousand hash slots at a time caused the Redis clusters receiving the migrate slots commands to become over-burdened. If all migrate slot commands where dispatched at the same time, asynchronously, then this caused the receiving Redis cluster nodes to become unavailable, resulting in CLUSTERDOWN errors from the Redis cluster nodes. To solve this problem it was required to build some type of rate-limiting ability.
The rate-limiting of migrate slots commands was solved by implementing the work pulling pattern. Since Cornucopia is an Akka actor-based application, this was implemented by the use of a master actor that controlled a set of worker actors. The master actor creates a configurable number of worker actors on-the-fly when a migrate slot job is requested. As part of the work-pulling pattern, each worker actor requests a migrate slot job from the master actor when it is ready to process another command. Each of the workers must fully process a migrate slot job before requesting a new one. This results in the ability to control the number of migrate slot jobs that are in flight at any given time.
6.2 Scaling down the cluster using StatefulSets
As was mentioned earlier, during the removal of node from the cluster, Cornucopia expects to be given the URI of the node that should be removed. The reason this is necessary is because StatefulSets scale up in a given order. In hindsight, it might be better to use a Kubernetes resource such as a Deployment instead. Although, the current solution does work well enough for now.