lastminute.com logo

Technology

Chasing a Kubernetes connection reset issue

manuel_ranieri
manuel ranieri
paolo_vitali
paolo vitali
francesco_gualazzi
francesco gualazzi

When working with open source the help from the community can come in many ways, from a post on the Kubernetes blog we found a solution to a problem that we were not able to address.


Recently we’ve read an interesting post on the Kubernetes Blog talking about a connection reset issue we had seen in our clusters for long time. What we observed is exactly what is described in the post: when a cluster grows in size and enough pods are connecting to each other via services sometimes applications experience a connection reset error. This results in hanging connections and can be the cause of 5xx HTTP errors from applications or even worse split-brains scenarios if your service runs a distributed, consensus-based software.

When confronted with the challenge of testing the solutions proposed in the Github issue we found it was not so easy to replicate it in a testing environment; we’d like to share how we did replicate it and test the fix in a safe environment with low traffic so that we could confidently deploy the chosen fix in our production environment.

Disclaimer

If you run the commands in this post you will create issues or bring down your Kubernetes cluster, do not try it in a production environment! Try this procedure only in a safe environment.

Background

If you encounter this behavior you can solve it using the solution proposed by the paper above but if you want to test it in a demo cluster the story is different. Reproducing the issue in a small cluster can be quite complex because this behavior happens when your cluster is under heavy load. This specific issue happens when the conntrack hash table is out of capacity. What is [conntrack](http://conntrack-tools.netfilter.org/manual.html)? It’s a utility that interacts with kernel’s packet inspection made by iptables, keeping trace of the firewall connections and enabling the user to watch or manipulate connections’ state changes. Some of our kernel default configuration values for conntrack (/proc/sys/net/netfilter/nf_conntrack_*) are:

net.netfilter.nf_conntrack_buckets = 147456
net.netfilter.nf_conntrack_max = 589824
net.netfilter.nf_conntrack_tcp_be_liberal = 0

The numbers above can vary depending on your kernel version or Linux distribution in use. The aforementioned Kubernetes blog post is proposing some tests to verify if you are facing the issue; a simple app that continuously performs network requests using cURL is the first proposal. The simple app is based upon running cURL continuously and read logs searching for a message curl: (56) Recv failure: Connection reset by peer; in the github README the use of Stackdriver is suggested to check the logs but in case you are not running in a cluster in GKE and Stackdriver is not your monitoring tool it could be quite boring and tedious to find these log messages. On the same Github issue #74839 you can find another way to check the connection reset by using a slightly different setup and a custom crafted software: one application will exchange traffic from one node to another passing from a service (so that iptables is involved) to the other application that will forge a TCP packet with an out-of-window response to simulate the connection reset issue.

We preferred to use this last test as a canary, we’ll refer to it as boom-server as this is how it’s named in the Deployment descriptor; if the boom-server pod dies with a CrashLoopBackOff error, we know we are experiencing the connection reset. We need also to saturate the conntrack table in our test cluster, so we will use the simple app to increase the number of entries in the conntrack table and saturate it. With both tests in place we will trigger the connection reset between the services and we will apply a patch and test if the results are fixing the problem.

Prepare for the experiment

To prepare this demonstration you have to run the boom-server and the simple app in your test cluster, to do that you can follow the instructions on the corresponding repositories. You also need the conntrack package on your nodes to be able to control conntrack configurations easily; it should be already shipped with your kernel but in case it is not, just install it with

apt-get install conntrack #[ debian/ubuntu ]
yum install conntrack #[ centos/redhat ]

In our test environment we scaled the simple app deployment gradually from 0 to 10, 20, and 50 pods without experiencing any issues and we see the boom-server working as expected, that means this pod is in Running state. Before starting the simulation we need to verify that everything is working as expected; start by scaling down the simple app deployment to 0 replicas, this will permit to reduce the entropy on the simulation scenario.

kubectl scale deployment simple-app --replicas=0

Check conntrack configuration on your nodes:

sysctl net.netfilter.nf_conntrack_tcp_be_liberal

output should be 0; this means that conntrack will mark packets as INVALID if it’s not able to keep track of the connection between the IP originating the connection and IP in the response; this behaviour is the default configuration. Run the command conntrack -L on all the nodes of your test cluster, you will see the total flow entries in the conntrack table; you will see in the command output a message like “flow entries have been shown” and a value representing the number of entries; in our demo cluster the values vary from 200 to 1100.

Trigger the issue

Now we can work to create the issue in our testing environment. The number of buckets in the conntrack hash table and the maximum number of tracked connections are correlated by default, according to the kernel documentation the relation must be: nf_conntrack_max = nf_conntrack_buckets x 4. We chose to stick with this default and during the post we will change the maximum number of entries in the conntrack table according to this mathematical rule.

Saturate conntrack table

With this test we will set the max number of conntrack entries to 1200; as we stick with the default ratio, to manage 1200 connections at most we set the hash table to have 300 buckets:

sysctl -w net.netfilter.nf_conntrack_buckets=300
sysctl -w net.netfilter.nf_conntrack_max=1200

Now raise the number of simple app pods to a sufficiently high value, in our case 50 replicas is the right value to have a decent amount of TCP connections in the cluster and still have some capacity on the nodes. With 50 replicas of the simple app deployed, our conntrack entries are a bit less than 1200.

On nodes, with the command conntrack -L you will see the total flow entries in the conntrack table grow to 1200 or up to the number you set and then stop. This means we have saturated the conntrack table, and our nodes are no more able to keep traces of TCP connections. You will see also some CrashLoopBackOff error for the boom-server pod and you will see the Connection reset by peer from the simple app logs by running

kubectl logs -l app=client  | grep "reset by peer"
"curl: (56) Recv failure: Connection reset by peer"

Deplete cluster networking and fix the issue

Now we would like to reduce the conntrack hash table size to trigger the out-of-capacity error that would cause the connection reset errors; we lower to 600 the value of nf_conntrack_max and to 150 the value of nf_conntrack_buckets by issuing the commands:

sysctl -w net.netfilter.nf_conntrack_buckets=150
sysctl -w net.netfilter.nf_conntrack_max=600

on our nodes. At this point iptables on the nodes is not able to keep the state of connections and we will see that kubectl command returns error connecting to Kubernetes control plane, almost all pods are going in CrashLoopBackOff or the applications are not responding anymore. The boom-server pod is also in CrashLoopBackOff error.

Fixing the issue

At this point we can try to solve the issue using the magic flag as proposed by the paper, so let’s try setting conntrack with the liberal option. On all nodes run:

sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1

this instructed conntrack to not mark as INVALID the packets that it cannot process; now you will see that everything works smoothly.

Conclusion

We decided that setting conntrack to be liberal works better for us as it allows to deliver packets at destination even if marked invalid, speeding up the network transfers and reducing the footprint of processing time for single packet. We saw the same solution has been implemented in the kubelet systemd unit for AKS and are happy to be in good company. The other proposed solution on the Kubernetes blog is to instruct iptables to drop the packets marked as INVALID by conntrack, this is the solution that is probably going to land in the future versions of Kubernetes, by configuring kube-proxy to inject an additional rule in iptables. Another viable option is to configure a higher size for the hash table of conntrack by setting values higher for net.netfilter.nf_conntrack_buckets and net.netfilter.nf_conntrack_max: while we did not test this solution we thought it could be detrimental for the performance of the kernel to grow the size of entries, as it would mean higher memory usage for the networking stack.

We will look forward to see how the discussion progresses and if the connection reset issue can be addressed in a better way maybe switching to IPVS.

THANKS!

To Paolo Vitali for finding the solution and reviewing the whole work on testing and patching our clusters

To Francesco Gualazzi for finding the boom-server, the continuous requests of tests, the review of this article giving it a more readable structure and adding a lot of useful and valuable remarks


Read next

SwiftUI and the Text concatenations super powers

SwiftUI and the Text concatenations super powers

fabrizio_duroni
fabrizio duroni
marco_de_lucchi
marco de lucchi

Do you need a way to compose beautiful text with images and custom font like you are used with Attributed String. The Text component has everything we need to create some sort of 'attributed text' directly in SwiftUI. Let's go!!! [...]

A Monorepo Experiment: reuniting a JVM-based codebase

A Monorepo Experiment: reuniting a JVM-based codebase

luigi_noto
luigi noto

Continuing the Monorepo exploration series, we’ll see in action a real-life example of a monorepo for JVM-based languages, implemented with Maven, that runs in continuous integration. The experiment of reuniting a codebase of ~700K lines of code from many projects and shared libraries, into a single repository. [...]