lastminute.com logo

Technology

Ingress-Nginx bug makes a comeback

francesco_gualazzi
francesco gualazzi
manuel_ranieri
manuel ranieri

A three years old bug in Kubernetes' Ingress-Nginx makes a comeback in an even more subtle form...


Using edge technologies like Kubernetes pushes our engineers into technical challenges every day. We face the complexities and uncertainty of multiple components stacked onto each other, each one with fast release cycles and thus prone to bugs introduced from version to version.
In such scenario there is a great value in the history that brought the software to the current point: knowing the decisions taken in the past by the community can help debug present issues and possibly save you a headache.

TL;DR

If you have Kubernetes and ingress-nginx at latest version (0.24.1 at the time of writing) you are likely to suffer a 3-years-old bug that has been reintroduced in version 0.22.0. The issue concerns the balancing type round_robin, the default config for the ingress, and causes applications immediately behind the ingress to have an imbalanced traffic against a subset of pods instead of an evenly distributed traffic; configure the ingress to use to the ewma algorithm to address this behaviour. To read more see the repo issue.

The long story

lastminute.com has been pioneering the adoption of Kubernetes and has been active in the community since the very beginning. In 2016 our Platform team discovered and reported an issue to the ingress-nginx maintainers regarding a problem with balancing of incoming requests. In the data recorded at that time (with ingress version 0.8) we noticed the list of IPs used by multiple ingress pods was exactly the same because it was sorted; this meant that all the first request of every ingress instance was routed to the same backend pod. This is not a problem with a small ingress deployment but can be dangerous increasing the ingress replicas because you risk to overload an application tuned for a certain QPS value. The bug was patched internally and unfortunately we were not able to contribute upstream, but it was eventually patched in the official repo too. Last week, one of our product development teams, reported instability in an application during deployment: we use canary releases and each application lives in a dedicated namespace, and when the canary was starting to roll out, the stable release pods were having higher CPU load and the GC and response times were growing; after some minutes the situation was back to normal. We started our investigation and found the problem a couple of hours later: the instability was due to a higher traffic delivered to only a subset of pods in the stable deployment! The ingress was not forwarding an equal number of requests to all pods but it was instead sending more requests to some pods and less to others… Blast from the past! The ingress bug had returned 3 years and 16 versions later!

Discovering

The following graph represents the input of our investigation: each line is the number of requests per seconds handled by a pod behind the same ingress rule, a publicly reachable service in our production cluster.

Round Robin: Server Requests per second per Pod
Received Server Requests per second per Pod

The graph shows that the normal behaviour is a uniform number of requests per pod but at a certain moment the deviation of requests per pod starts to increase: some pods gets more requests, some less; the turbulence goes on and amplifies further for some minutes, then the service recovers. We were able to match the moment where the requests starts to differ with the start of a canary release for the application and we started to wonder why the canary pods should have any effect on the requests received by other pods; the only correlations between canary and regular pods are the Service and the Ingress resources. When canary pods are added to a deployment the Endpoints object list is updated by the deployment controller and this generates 2 events:

  • the iptables rules are updated by kube-proxy with the list of Pod IPs behind a Service
  • the NGINX upstream configuration changes, updating the Pod IPs behind a Service

Reminding that a similar issue has happened before we ditched the iptables hypothesis and started looking into the NGINX ingress controller: what we found was that since version 0.22.0 the way of distributing requests across backends had been changed introducing a Lua script to perform balancing. When NGINX is configured like so it expects that a Lua implementation dispatches requests to a list of peers; we did a deep dive in the Lua code used specifically by the ingress and found the issue.

This is the round_robin balancing algorithm implementation (the default used if not specified in configMap or annotation)

local _M = balancer_resty:new({ factory = resty_roundrobin, name = "round_robin" })

function _M.new(self, backend)
  local nodes = util.get_nodes(backend.endpoints)
  local o = {
    instance = self.factory:new(nodes),
    traffic_shaping_policy = backend.trafficShapingPolicy,
    alternative_backends = backend.alternativeBackends,
  }
  setmetatable(o, self)
  self.__index = self
  return o
end

In here util.get_nodes(backend.endpoints) is passed the Endpoints API object from the Kubernetes API.

function _M.get_nodes(endpoints)
  local nodes = {}
  local weight = 1

  for _, endpoint in pairs(endpoints) do
    local endpoint_string = endpoint.address .. ":" .. endpoint.port
    nodes[endpoint_string] = weight
  end

  return nodes
end

The problem with this code is subtle because it’s the kind of bug that can be noticed only at scale and it’s very difficult to be tested: the Endpoints JSON that comes from the Kubernetes API is always the same, for all ingress pods. In Lua, tables are ordered implicitly by key and the pairs(endpoints) function will always yield the same key/value pair on every ingress instance; this means that if you have a number of (say 20) ingress instances behind a load balancer, all the first 20 requests will be routed to the first pod in the Endpoints list.

Here a graphical representation of the flow of the first N requests for an ingress deployment: given N the number of ingress pods, the first N requests all get routed to the first backend pod in the list.

Round Robin: flow of first N requests
Round Robin: flow of first N requests

This is not going to be a big issue for small scale ingress deployments and services: it only happens when the configuration is refreshed so when adding/removing pods from the deployment (e.g. during a deployment rollout). It represented a problem on one of our higher traffic services because with a high number of ingress pods the additional traffic that had to be managed by a subset of pods (the first ones in the list) was creating higher GC times, slower response and thus degrading the service.

Verifying

To prove our hypothesis we did 2 things:

  • ensure that the configuration between multiple ingress was the same, with the same order of pod IPs in the backend
  • experiment with a different balancing algorithm to see if the problem would be resolved

To verify ingress pods configuration we used the official troubleshooting guide. You can run a container locally on the workstation and connect it to the Kubernetes API with the following commands:

# after having run kubectl proxy in another shell...
docker run --rm -ti --name nginx-ingress-local --user root --privileged -v "$HOME/.kube:/kube" \
     -e "POD_NAME=nginx-ingress-ctrl-upgrade-78d9b95ccd-abcd1" \
     -e "POD_NAMESPACE=ingress-nonprd" \
    registry.bravofly.intra:5000/application/nginx-ingress-controller:0.24.1-gcr \
     /nginx-ingress-controller --kubeconfig /kube/config  \
     --default-backend-service=ingress-nonprd/default-backend-upgrade --configmap=ingress-nonprd/cfm-nginx-ingress-ctrl-upgrade \
     --annotations-prefix=ingress.kubernetes.io --sync-period=10s --stderrthreshold=0 --logtostderr=0 --log_dir=/var/log/nginx --v=5

# with the container running, exec anoter terminal in it
docker exec --user root -ti nginx-ingress-local /bin/bash
# ensure that nginx is running in debug mode in this terminal session, and start the master
nginx -V | grep '--with-debug'
nginx

From here you can dump the file using the gdb commands in the guide.

The same can be achieved with the running pods in the cluster first by getting a couple of pod IDs, then running

# get ingress pods
kubectl -n ingress-namespace get pods
kubectl exec -ti /bin/bash ingress-nginx-aaabbbccd-123ab

# once in the shell, get backends from the configuration URL
curl --unix-socket /tmp/nginx-status-server.sock http://fakehost/configuration/backends

You can dump this to a file too and run kubectl cp. We did it and we proved our assumptions: the list of IPs was exactly the same! The shuffling bug had returned!

Fixing

The only way to address this issue right now is to switch to the ewma algorithm; you can do so by setting load-balance: ewma in the configMap passed as the ingress controller parameter or by adding the annotation ingress.kubernetes.io/load-balance: "ewma" in the Ingress Kubernetes definition YAML. The difference in the two types of setting is that setting the algorithm in the configMap will set it globally for all ingresses while using the annotation you can pinpoint the algorithm into the desired namespace. This graph shows the average of requests dispatched to every pod per second after we switched to the ewma algorithm.

EWMA: Server Requests per second per Pod
Received Server Requests per second per Pod After switching to ewma

In the graph:

  • initially the round_robin algorithm is used, low standard deviation but no scaling in place
  • 14:10: the new ewma is configured as annotation in the ingress
  • higher standard deviation as requests gets different weights
  • 14:30: a service is scaled in with a reduction of 20% of the number of pods
  • the deviation is under control so the service can handle it gracefully

ewma is a different balancing implementation that will generate a weight for every pod IP based on the last server response time, basically it tries to dispatch more requests to the backends that reply faster, supposing that they are less loaded. As you can see from the graph, when using this algorithm we noticed a higher standard deviation then what we observed operatively with round robin, but on the contrary during the rollout of a deployment we don’t have a bigger deviation, it stands still! Our services are now safeguarded from ingress balancing DDoS.

Kudos

We are happy to say that the ingress-nginx project maintainers were super responsive on this issue and a patch has been already proposed in the library that performs round robin: the patch will make the round robin dispatch the first and subsequent requests starting from a random IP in the list, so even if the list is the same for multiple ingress pods, the first incoming in all pod request will be randomly routed to a backend. We hope the new library version will be included in the next release and that reading this blog saves you some headache while debugging your services!

Happy k8s’ing 😎!


Read next

SwiftUI and the Text concatenations super powers

SwiftUI and the Text concatenations super powers

fabrizio_duroni
fabrizio duroni
marco_de_lucchi
marco de lucchi

Do you need a way to compose beautiful text with images and custom font like you are used with Attributed String. The Text component has everything we need to create some sort of 'attributed text' directly in SwiftUI. Let's go!!! [...]

A Monorepo Experiment: reuniting a JVM-based codebase

A Monorepo Experiment: reuniting a JVM-based codebase

luigi_noto
luigi noto

Continuing the Monorepo exploration series, we’ll see in action a real-life example of a monorepo for JVM-based languages, implemented with Maven, that runs in continuous integration. The experiment of reuniting a codebase of ~700K lines of code from many projects and shared libraries, into a single repository. [...]