Recently we’ve read an interesting post on the Kubernetes Blog talking about a connection reset issue we had seen in our clusters for long time.
What we observed is exactly what is described in the post: when a cluster grows in size and enough pods are connecting to each other via services sometimes applications experience a connection reset error. This results in hanging connections and can be the cause of 5xx HTTP errors
from applications or even worse split-brains scenarios if your service runs a distributed, consensus-based software.
When confronted with the challenge of testing the solutions proposed in the Github issue we found it was not so easy to replicate it in a testing environment; we’d like to share how we did replicate it and test the fix in a safe environment with low traffic so that we could confidently deploy the chosen fix in our production environment.
Disclaimer
If you run the commands in this post you will create issues or bring down your Kubernetes cluster, do not try it in a production environment! Try this procedure only in a safe environment.
Background
If you encounter this behavior you can solve it using the solution proposed by the paper above but if you want to test it in a demo cluster the story is different.
Reproducing the issue in a small cluster can be quite complex because this behavior happens when your cluster is under heavy load.
This specific issue happens when the conntrack
hash table is out of capacity.
What is [conntrack](http://conntrack-tools.netfilter.org/manual.html)
? It’s a utility that interacts with kernel’s packet inspection made by iptables
, keeping trace of the firewall connections and enabling the user to watch or manipulate connections’ state changes.
Some of our kernel default configuration values for conntrack
(/proc/sys/net/netfilter/nf_conntrack_*
) are:
net.netfilter.nf_conntrack_buckets = 147456
net.netfilter.nf_conntrack_max = 589824
net.netfilter.nf_conntrack_tcp_be_liberal = 0
The numbers above can vary depending on your kernel version or Linux distribution in use.
The aforementioned Kubernetes blog post is proposing some tests to verify if you are facing the issue; a simple app that continuously performs network requests using cURL is the first proposal.
The simple app is based upon running cURL continuously and read logs searching for a message curl: (56) Recv failure: Connection reset by peer
; in the github README the use of Stackdriver is suggested to check the logs but in case you are not running in a cluster in GKE and Stackdriver is not your monitoring tool it could be quite boring and tedious to find these log messages.
On the same Github issue #74839 you can find another way to check the connection reset by using a slightly different setup and a custom crafted software: one application will exchange traffic from one node to another passing from a service (so that iptables is involved) to the other application that will forge a TCP packet with an out-of-window response to simulate the connection reset issue.
We preferred to use this last test as a canary, we’ll refer to it as boom-server as this is how it’s named in the Deployment
descriptor; if the boom-server pod dies with a CrashLoopBackOff
error, we know we are experiencing the connection reset.
We need also to saturate the conntrack table in our test cluster, so we will use the simple app to increase the number of entries in the conntrack table and saturate it.
With both tests in place we will trigger the connection reset between the services and we will apply a patch and test if the results are fixing the problem.
Prepare for the experiment
To prepare this demonstration you have to run the boom-server and the simple app in your test cluster, to do that you can follow the instructions on the corresponding repositories.
You also need the conntrack
package on your nodes to be able to control conntrack
configurations easily; it should be already shipped with your kernel but in case it is not, just install it with
apt-get install conntrack #[ debian/ubuntu ]
yum install conntrack #[ centos/redhat ]
In our test environment we scaled the simple app deployment gradually from 0 to 10, 20, and 50 pods without experiencing any issues and we see the boom-server working as expected, that means this pod is in Running
state.
Before starting the simulation we need to verify that everything is working as expected; start by scaling down the simple app deployment to 0 replicas, this will permit to reduce the entropy on the simulation scenario.
kubectl scale deployment simple-app --replicas=0
Check conntrack configuration on your nodes:
sysctl net.netfilter.nf_conntrack_tcp_be_liberal
output should be 0
; this means that conntrack will mark packets as INVALID
if it’s not able to keep track of the connection between the IP originating the connection and IP in the response; this behaviour is the default configuration.
Run the command conntrack -L
on all the nodes of your test cluster, you will see the total flow entries in the conntrack table; you will see in the command output a message like “flow entries have been shown” and a value representing the number of entries; in our demo cluster the values vary from 200 to 1100.
Trigger the issue
Now we can work to create the issue in our testing environment.
The number of buckets in the conntrack hash table and the maximum number of tracked connections are correlated by default, according to the kernel documentation the relation must be: nf_conntrack_max = nf_conntrack_buckets x 4
.
We chose to stick with this default and during the post we will change the maximum number of entries in the conntrack
table according to this mathematical rule.
Saturate conntrack table
With this test we will set the max number of conntrack
entries to 1200; as we stick with the default ratio, to manage 1200 connections at most we set the hash table to have 300 buckets:
sysctl -w net.netfilter.nf_conntrack_buckets=300
sysctl -w net.netfilter.nf_conntrack_max=1200
Now raise the number of simple app pods to a sufficiently high value, in our case 50 replicas is the right value to have a decent amount of TCP connections in the cluster and still have some capacity on the nodes.
With 50 replicas of the simple app deployed, our conntrack
entries are a bit less than 1200.
On nodes, with the command conntrack -L
you will see the total flow entries in the conntrack
table grow to 1200 or up to the number you set and then stop.
This means we have saturated the conntrack
table, and our nodes are no more able to keep traces of TCP connections.
You will see also some CrashLoopBackOff
error for the boom-server pod and you will see the Connection reset by peer
from the simple app logs by running
kubectl logs -l app=client | grep "reset by peer"
"curl: (56) Recv failure: Connection reset by peer"
Deplete cluster networking and fix the issue
Now we would like to reduce the conntrack
hash table size to trigger the out-of-capacity error that would cause the connection reset errors; we lower to 600 the value of nf_conntrack_max
and to 150 the value of nf_conntrack_buckets
by issuing the commands:
sysctl -w net.netfilter.nf_conntrack_buckets=150
sysctl -w net.netfilter.nf_conntrack_max=600
on our nodes.
At this point iptables on the nodes is not able to keep the state of connections and we will see that kubectl
command returns error connecting to Kubernetes control plane, almost all pods are going in CrashLoopBackOff
or the applications are not responding anymore.
The boom-server pod is also in CrashLoopBackOff
error.
Fixing the issue
At this point we can try to solve the issue using the magic flag as proposed by the paper, so let’s try setting conntrack
with the liberal option.
On all nodes run:
sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1
this instructed conntrack
to not mark as INVALID
the packets that it cannot process; now you will see that everything works smoothly.
Conclusion
We decided that setting conntrack
to be liberal works better for us as it allows to deliver packets at destination even if marked invalid, speeding up the network transfers and reducing the footprint of processing time for single packet.
We saw the same solution has been implemented in the kubelet systemd unit for AKS and are happy to be in good company.
The other proposed solution on the Kubernetes blog is to instruct iptables to drop the packets marked as INVALID
by conntrack
, this is the solution that is probably going to land in the future versions of Kubernetes, by configuring kube-proxy
to inject an additional rule in iptables.
Another viable option is to configure a higher size for the hash table of conntrack
by setting values higher for net.netfilter.nf_conntrack_buckets
and net.netfilter.nf_conntrack_max
: while we did not test this solution we thought it could be detrimental for the performance of the kernel to grow the size of entries, as it would mean higher memory usage for the networking stack.
We will look forward to see how the discussion progresses and if the connection reset issue can be addressed in a better way maybe switching to IPVS.
THANKS!
To Paolo Vitali for finding the solution and reviewing the whole work on testing and patching our clusters
To Francesco Gualazzi for finding the boom-server, the continuous requests of tests, the review of this article giving it a more readable structure and adding a lot of useful and valuable remarks