May 28, 2019 - 8 min read ⏱️
When working with open source the help from the community can come in many ways, from a post on the Kubernetes blog we found a solution to a problem that we were not able to address.
Recently we’ve read an interesting post on the Kubernetes Blog talking about a connection reset issue we had seen in our clusters for long time.
What we observed is exactly what is described in the post: when a cluster grows in size and enough pods are connecting to each other via services sometimes applications experience a connection reset error. This results in hanging connections and can be the cause of
5xx HTTP errors from applications or even worse split-brains scenarios if your service runs a distributed, consensus-based software.
When confronted with the challenge of testing the solutions proposed in the Github issue we found it was not so easy to replicate it in a testing environment; we’d like to share how we did replicate it and test the fix in a safe environment with low traffic so that we could confidently deploy the chosen fix in our production environment.
If you run the commands in this post you will create issues or bring down your Kubernetes cluster, do not try it in a production environment! Try this procedure only in a safe environment.
If you encounter this behavior you can solve it using the solution proposed by the paper above but if you want to test it in a demo cluster the story is different.
Reproducing the issue in a small cluster can be quite complex because this behavior happens when your cluster is under heavy load.
This specific issue happens when the
conntrack hash table is out of capacity.
[conntrack](http://conntrack-tools.netfilter.org/manual.html)? It’s a utility that interacts with kernel’s packet inspection made by
iptables, keeping trace of the firewall connections and enabling the user to watch or manipulate connections’ state changes.
Some of our kernel default configuration values for
net.netfilter.nf_conntrack_buckets = 147456 net.netfilter.nf_conntrack_max = 589824 net.netfilter.nf_conntrack_tcp_be_liberal = 0
The numbers above can vary depending on your kernel version or Linux distribution in use.
The aforementioned Kubernetes blog post is proposing some tests to verify if you are facing the issue; a simple app that continuously performs network requests using cURL is the first proposal.
The simple app is based upon running cURL continuously and read logs searching for a message
curl: (56) Recv failure: Connection reset by peer; in the github README the use of Stackdriver is suggested to check the logs but in case you are not running in a cluster in GKE and Stackdriver is not your monitoring tool it could be quite boring and tedious to find these log messages.
On the same Github issue #74839 you can find another way to check the connection reset by using a slightly different setup and a custom crafted software: one application will exchange traffic from one node to another passing from a service (so that iptables is involved) to the other application that will forge a TCP packet with an out-of-window response to simulate the connection reset issue.
We preferred to use this last test as a canary, we’ll refer to it as boom-server as this is how it’s named in the
Deployment descriptor; if the boom-server pod dies with a
CrashLoopBackOff error, we know we are experiencing the connection reset.
We need also to saturate the conntrack table in our test cluster, so we will use the simple app to increase the number of entries in the conntrack table and saturate it.
With both tests in place we will trigger the connection reset between the services and we will apply a patch and test if the results are fixing the problem.
To prepare this demonstration you have to run the boom-server and the simple app in your test cluster, to do that you can follow the instructions on the corresponding repositories.
You also need the
conntrack package on your nodes to be able to control
conntrack configurations easily; it should be already shipped with your kernel but in case it is not, just install it with
apt-get install conntrack #[ debian/ubuntu ] yum install conntrack #[ centos/redhat ]
In our test environment we scaled the simple app deployment gradually from 0 to 10, 20, and 50 pods without experiencing any issues and we see the boom-server working as expected, that means this pod is in
Before starting the simulation we need to verify that everything is working as expected; start by scaling down the simple app deployment to 0 replicas, this will permit to reduce the entropy on the simulation scenario.
kubectl scale deployment simple-app --replicas=0
Check conntrack configuration on your nodes:
output should be
0; this means that conntrack will mark packets as
INVALID if it’s not able to keep track of the connection between the IP originating the connection and IP in the response; this behaviour is the default configuration.
Run the command
conntrack -L on all the nodes of your test cluster, you will see the total flow entries in the conntrack table; you will see in the command output a message like “flow entries have been shown” and a value representing the number of entries; in our demo cluster the values vary from 200 to 1100.
Now we can work to create the issue in our testing environment.
The number of buckets in the conntrack hash table and the maximum number of tracked connections are correlated by default, according to the kernel documentation the relation must be:
nf_conntrack_max = nf_conntrack_buckets x 4.
We chose to stick with this default and during the post we will change the maximum number of entries in the
conntrack table according to this mathematical rule.
With this test we will set the max number of
conntrack entries to 1200; as we stick with the default ratio, to manage 1200 connections at most we set the hash table to have 300 buckets:
sysctl -w net.netfilter.nf_conntrack_buckets=300 sysctl -w net.netfilter.nf_conntrack_max=1200
Now raise the number of simple app pods to a sufficiently high value, in our case 50 replicas is the right value to have a decent amount of TCP connections in the cluster and still have some capacity on the nodes.
With 50 replicas of the simple app deployed, our
conntrack entries are a bit less than 1200.
On nodes, with the command
conntrack -L you will see the total flow entries in the
conntrack table grow to 1200 or up to the number you set and then stop.
This means we have saturated the
conntrack table, and our nodes are no more able to keep traces of TCP connections.
You will see also some
CrashLoopBackOff error for the boom-server pod and you will see the
Connection reset by peer from the simple app logs by running
kubectl logs -l app=client | grep "reset by peer" "curl: (56) Recv failure: Connection reset by peer"
Now we would like to reduce the
conntrack hash table size to trigger the out-of-capacity error that would cause the connection reset errors; we lower to 600 the value of
nf_conntrack_max and to 150 the value of
nf_conntrack_buckets by issuing the commands:
sysctl -w net.netfilter.nf_conntrack_buckets=150 sysctl -w net.netfilter.nf_conntrack_max=600
on our nodes.
At this point iptables on the nodes is not able to keep the state of connections and we will see that
kubectl command returns error connecting to Kubernetes control plane, almost all pods are going in
CrashLoopBackOff or the applications are not responding anymore.
The boom-server pod is also in
At this point we can try to solve the issue using the magic flag as proposed by the paper, so let’s try setting
conntrack with the liberal option.
On all nodes run:
sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1
conntrack to not mark as
INVALID the packets that it cannot process; now you will see that everything works smoothly.
We decided that setting
conntrack to be liberal works better for us as it allows to deliver packets at destination even if marked invalid, speeding up the network transfers and reducing the footprint of processing time for single packet.
We saw the same solution has been implemented in the kubelet systemd unit for AKS and are happy to be in good company.
The other proposed solution on the Kubernetes blog is to instruct iptables to drop the packets marked as
conntrack, this is the solution that is probably going to land in the future versions of Kubernetes, by configuring
kube-proxy to inject an additional rule in iptables.
Another viable option is to configure a higher size for the hash table of
conntrack by setting values higher for
net.netfilter.nf_conntrack_max: while we did not test this solution we thought it could be detrimental for the performance of the kernel to grow the size of entries, as it would mean higher memory usage for the networking stack.
We will look forward to see how the discussion progresses and if the connection reset issue can be addressed in a better way maybe switching to IPVS.
To Paolo Vitali for finding the solution and reviewing the whole work on testing and patching our clusters
To Francesco Gualazzi for finding the boom-server, the continuous requests of tests, the review of this article giving it a more readable structure and adding a lot of useful and valuable remarks
The postings on this site are authors' opinions and experiences and do not necessarily represent the postings, strategies or opinions of lastminute.com group.