Technology

DevOps & Observability

alessandro gazzoni

lorenzo girardi

January 27, 2022 - ⏱️ 3 min read

In this fourth blog post of the series about our Tech ecosystem we introduce the principles and practices we've put in place to implement a devOps model and mindset as well as how we're going to implement a comprehensive observability strategy.

Alessandro Gazzoni and Lorenzo Girardi are Engineering Managers within the Technology department. They began their journey at lastminute.com group several years ago, starting as sysadmins and witnessing technological changes that led to the current infrastructure.
Alessandro is responsible for the Automation Site Reliability Engineers and Datastore Engineers team, while Lorenzo is responsible for the Data Center, Middleware Monitoring and Enterprise Services.

DevOps

Team’s autonomy is one of our pillars: we want to decentralise the responsibility of the infrastructure and empower functional teams to self use the platform. There are different benefits on increasing the team’s autonomy:

Improve time to market
Increase the innovation
Infrastructure teams can focus on optimisation and automation

To facilitate the development of the culture supporting this way of working, we recently introduced the DevOps role, which will collaborate with Site Reliability Engineers and Developers to extend the accountability for development teams on underlying services hosted on Cloud environments (shared responsibility model).

Below are some of the activities that will be taken care of by developers, with the support of DevOps:

Maintain CI/CD pipelines
Provision and manage specific part of infrastructure through IaC (Terraform)
Automation implementation (Rundeck self-Service portal capabilities contribution with SREs)
New Technologies adoption
Monitoring and optimisations

and much more.

In order to support our vision, the CI/CD has been designed to be very flexible: modular pipelines allow us to manage projects based on different languages like Python, Java, Golang, .Net, etc.
As Kubernetes is our main technology, pipelines are capable of building different types of manifests applicable to kubernetes, like, deployment, cron jobs, configmaps, secrets and so on.
GitlabCI runners are hosted on AWS EKS with Cluster Autoscaler and Horizontal Pod Autoscaling in order to dynamically resize the number of nodes and pods replicas based on the developer requests.
Spinnaker allows engineers to deploy microservices across different Kubernetes clusters (both AWS and on-premise). In the future the entire CI/CD workflow will be centralised on GitLab including the Kubernetes deployments.

devops tech stack — The devops technology stack

Observability

Logs and metrics are an interesting source of information, but they can easily become ‘noise’, if not well managed.

Governance becomes even more important when enabling Observability which is the case for us, and therefore:

We created an internal framework (contracts) to describe the structure of the data in the logs, the max size allowed, and the threshold rate so that, instead of tracking every event, we throw alerts that highlight issues, and enable us to react in the smallest amount of time:
- Only errors are logged
- Applications go in debug mode only when we are troubleshooting and we actually need verbosity
- A standard format has been defined, accordingly with elastic search indexes
- There’s rate limiting, to prevent flooding in case of misconfiguration
We identified three macro clusters of metrics, that provide insights on our products at different levels:
- Business (customer/business view of the application) - business and developers are accountable for these metrics, which are the entry point for any investigation
- Framework (application technology characteristic) - developers are accountable for these metrics, which provide deeper insights into what needs to be investigated
- System (standard OS information) - SREs are accountable for these metrics, which cover the deepest level of investigation we can conduct

As a result, our dashboards are now easier to read, smaller, and more useful to the business.

The next steps we have in mind are:

Introducing an error budget (development teams can ‘spend’ this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed)
Introducing OpenTracing, to get us even closer to real Observability
Complementing our current tech stack with Loki, to correlate metrics and logs, helping developers in understanding the reliability of their applications.

Want to discover more?

This is the fourth in a series of articles where we talk about our pink world. If you want to discover more, read:

About alessandro gazzoni

Head of Platform Engineering

Alessandro is our Head of Platform Engineering. His primary responsibility lies in overseeing the strategic direction and execution of the Platform Engineering area.

About lorenzo girardi

Platform Architect

Lorenzo is responsible for the Data Center, Middleware Monitoring and Enterprise Services.

Tech Radar As a Collaboration Tool

rabbani kajamohideen

July 10, 2024 - ⏱️ 5 min read

A tech radar is a visual and strategic tool used by organizations to assess and communicate the status and future direction of various technologies, frameworks, tools, and platforms. [...]

Ease your Android widgets development with help of Jetpack

alejandro weichandt

omar shatani

May 15, 2024 - ⏱️ 9 min read

Quick introduction of related Jetpack libraries which will help on your Android widget's development [...]

Written by lastminute.com folks, who live for the holidays. You should follow us on Twitter.

Want to express your love for travel and tech? Well, you could read another article, or you could just come and join us. We’re always looking for talent to help us enrich the lives of travellers - find your role here.

The postings on this site are authors' opinions and experiences and do not necessarily represent the postings, strategies or opinions of lastminute.com group.