lastminute.com logo

Technology

Measuring KPIs of Continuous Delivery on Jenkins

staff
staff

Recording Four Key Metrics and beyond to track our way down to effective software delivery: this is the story of how we enriched our visibility inside pipelines execution steps to improve our CI/CD architecture.


At lastminute.com we love data; we like to measure everything and get feedback from data to drive our next actions and improve continuously. In the Platform team we take care of the Continuous Integration and Continuous Delivery systems (CI/CD) that enable our product teams to ship features to our users with confidence; when talking about collecting data from these systems the challenge is extracting the right data points to be able to measure KPIs in terms of software delivery; usually the off-the-shelf software that we use does not provide enough data to allow such reasoning.

Jenkins metrics

A part of our CI uses Jenkins to perform artifact builds; Jenkins has been adopted also as part of our Continuous Delivery system to orchestrate the deployments on Kubernetes. We have been pioneering the use of Kubernetes since early 2015 and at that time there was no Helm or Kustomize, so we built our own deployment tool and Jenkins was the obvious way to orchestrate deployments across multiple environments. The code that performs this orchestration is based on Jenkins scripted pipelines so it’s written in the Groovy programming language, interoperable with Java. When we started measuring our Jenkins we only had the standard infrastructure metrics (cpu, network, disk access, etc…) that are read by collectd standard plugins; this was not enough to have a decent feedback on the interactions between all the part of our CI/CD systems along the deployment pipeline. We found ourselves in need of gathering more data:

  • how many deployments were we shipping every minute?
  • how long did it take to deploy a canary to a staging environment compared to production?
  • how much time of the pipeline was spent on processing and how much on waiting for network transfers?

While most of this data may be accessible through the Jenkins API, in a complex system with hundreds of microservices deployed multiple times a day, it was very difficult to normalize the textual representation of this data (Jenkins has a JSON/REST API) into time-series data for visual analysis. Looking for a way of extending the surface of monitoring for our pipelines with Jenkins we found an interesting plugin called Metrics: it is an implementation of Dropwizard metrics to expose application metrics to the outside world; by installing the plugin you get JMX beans and a web UI to explore the standard metrics collected. In the standard metrics you can find the number of builds executed by each agent, the build queue length and times and the status of Jenkins jobs; this data are very handy and they can give a good representation of the status of what Jenkins is doing.

Standard metrics collection

To enable the JMX to be scraped by collectd we added an option to the Jenkins configurations using a variable JENKINS_JMX_OPTIONS and added the new variable to the standard options variable

JENKINS_JMX_OPTIONS="-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.rmi.port=4001 -Dcom.sun.management.jmxremote.port=4001 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false  -Djava.rmi.server.hostname=<PRIVATE_IP>"

## Type: string
## Default:     "-Djava.awt.headless=true"
## ServiceRestart: jenkins
#
# Options to pass to java when running Jenkins.
#
JENKINS_JAVA_OPTIONS="... $JENKINS_JMX_OPTIONS"

This is needed to allow the JMX connection to happen from collectd to the JVM instance running Jenkins. We configured collectd to read the Dropwizard beans using the GenericJMX plugin, with a file called jmx-jenkins.conf, here’s an excerpt of it with the relevant configuration.

LoadPlugin java
   <Plugin "java">
	   JVMArg "-verbose:jni"
     JVMArg "-Djava.class.path=/opt/collectd/share/collectd/java/collectd-api.jar:/opt/collectd/share/collectd/java/generic-jmx.jar"
     LoadPlugin "org.collectd.java.GenericJMX"

     <Plugin "GenericJMX">
        <MBean "ci-metrics">
           ObjectName "metrics:name=jenkins.*"
           InstancePrefix ""
           InstanceFrom "name"
           <Value>
             Type "requests"
             InstancePrefix "Count"
             Table false
             Attribute "Count"
           </Value>
           ...
         </MBean>

         <Connection>
           ServiceURL "service:jmx:rmi:///jndi/rmi://localhost:4001/jmxrmi"
           Collect "ci-metrics"
           InstancePrefix "cicd-"
         </Connection>

     </Plugin>
   </Plugin>

The name of the bean is specified in the ObjectName parameter; the beans exposed by default are: jenkins.*,http.* and vm.* which respectively represent the Jenkins job metrics (build queue, agents), the Jetty HTTP server metrics (number of requests, status codes) and the JVM metrics (heap, threads, gc). We are not very interested in the HTTP stats of the web server so we chose only to collect the jenkins and vm beans. Once collectd was reloaded with a systemctl reload collectd we started getting our data in and we built some interesting visualization of our system: for instance we were able to correlate Jenkins master instance load with the number of queued jobs and scale our agents accordingly. We leverage Jenkins distributed builds but we would never know how many executors we need without this data.

Custom metrics

After the first successful iteration we decided to investigate what we could do to get even more data and if it was possible to create our own custom metrics out of the Dropwizard API implemented in the plugin. Full credit goes to a former colleague (the one and only Mauro Ferratello) for finding the way to inject custom metrics into the JMX beans exposed by the plugin; with a couple of hours digging into the source code and some thread dumps he was able to discover a MetricsRegistry object, a publicly visible registry of all the metrics exposed. Starting from there we found a way of creating custom beans by simply adding a common prefix to the new metrics, we chose pipelines to discern from the already existing beans.

import com.codahale.metrics.MetricRegistry

class MetricsFactory implements Serializable
{
  public static final String COLLECTD_BEAN_NAME = "pipelines"
  private final Object metricsClazz
  private baseMetricPath

  MetricsFactory(metricsClazz, String nodeName, String appName){
    this.metricsClazz = metricsClazz
    this.baseMetricPath =  MetricRegistry.name(COLLECTD_BEAN_NAME, aggregateDynamicNodes(nodeName), appName)
  }
}

Then we created a Timer class to represent a nanoseconds-resolution timer to record spans of time under a custom metric name; using the Dropwizard Clock object is mandatory to record properly the elapsed time in a format that is understandable by the metrics collector.

import com.codahale.metrics.Clock
import java.util.concurrent.TimeUnit

class Timer implements Serializable
{
  private long startTime
  String name
  private final Object metricsClazz

  Timer(metricsClazz, String name)
  {
    this.metricsClazz = metricsClazz
    this.name = name
    this.startTime = Clock.defaultClock().getTick()
  }

  @NonCPS
  void stop()
  {
    def thisTimer = metricsClazz.metricRegistry().timer(name)
    thisTimer.update(Clock.defaultClock().getTick() - startTime, TimeUnit.NANOSECONDS)
  }
}

We added the call to MetricsFactory in our pipelines and when we verified that the new metrics were there, by connecting to the JMX and checking the new beans, we knew we had found the holy grail! The complete configuration file now tells collectd to fetch also the custom bean pipelines and all the metrics below it: we are now able to measure every single step in our pipelines execution by adding a timer where we need it. For example one of the first thing we wanted to measure was how much time of the pipeline was spent waiting for other components (artifact repository, Docker registry, Kubernetes controllers) and network transfers between them, so that we could optimize our network and deployment process.

An example pipeline shows how we are now collecting multiple timers for every step along the flow, and below an example of visualization that we can get from these metrics.

This graph represents for each microservice the time spent in network transfers between different components of the CI system, it is very valuable for us to understand what operations are network-bound; in the selected metric in light blue we see a Docker image push time with high deviation, so we know we should keep an eye on the artifact size: maybe the team is adding/removing dependencies?

CI time spent on network transfers
CI network graph

In this graph we record for each microservice the time spent to rollout canary and stable deployments after we have sent the Kubernetes API the Deployment request; we want to know if some microservices have slower rollout times so we can tune the maxSurge and maxUnavailable deployment parameters to have a faster rollout.

Kubernetes deployment time per stage
Kubernetes deployment time graph

Wrapping up

It is useful to analyze data from the CI/CD environment: it enables product teams to get powerful insights into how their software is built and delivered along the pipeline, and the infrastructure teams can optimize around the data to allow faster and faster iterations. We presented here just a few examples of the awesome data we have at our disposal with the metrics we are now collecting from our pipelines; we hope this small piece of code can help improve your CI/CD as well!

Credit

Mauro Ferratello is a Jenkins expert, an Agile mentor and a solid engineer; part of the amazing CI/CD platform we run was architected and implemented by him and we owe him a lot for the work he’ve done at lastminute.com. This post represents just a very little of the tremendous impact he had on the company with his contributions, so: thank you Mauro ❤️.


staff

We are the European Travel Tech leaders in Dynamic Holiday Packages. We leverage our Technology to simplify, personalise, and enhance customers’ travel experience.


Read next

React Universe 2024

React Universe 2024

fabrizio_duroni
fabrizio duroni
sam_campisi
sam campisi

Let's dive into the talks from React Universe 2024 that stood out to us the most and share the key insights we gained. From innovative debugging tools to cross-platform development strategies, we’ll walk you through what we found valuable and how it’s shaping our approach to React and React Native development. [...]

Tech Radar As a Collaboration Tool

Tech Radar As a Collaboration Tool

rabbani_kajamohideen
rabbani kajamohideen

A tech radar is a visual and strategic tool used by organizations to assess and communicate the status and future direction of various technologies, frameworks, tools, and platforms. [...]