Good morning everyone, as you surely know, on Tuesday morning, a group of starfighters from the Rebel Force managed to penetrate the defenses of our magnificent Pink Star and, inflicting a surgical blow on its core, caused its complete destruction.
This Incident led to the death of about 100,000 men and the total loss of an infrastructure that cost 2 years of work and about 15 billion euros.
Obviously this is a very serious fact, which casts a shadow on our reputation at a galactic level. Today we are here to analyze in detail what happened, identify the weaknesses of the project and give back to the Pink Empire some useful actions to avoid new similar incidents in the future.
So arm yourself with post-it notes, the Pink Star Destruction Post Mortem Meeting is about to begin.
Beginning of the attack
At 10:32, the Monitoring Grafana Dashboard sent a notification to inform us about an alert on the dashboard “Unknown starfighters approaching”. The Radar Team set up the monitoring to send them a notification on the Slack channel whenever the number of unknown starfighters exceeded 2 quantities. In this case, they detected as many as 35 approaching starfighters, but no specific procedure was foreseen.
The Radar Team manager wrote a Slack message to the Defense Director, but he was busy in another meeting to decide the correct shade of pink of the base, and he never replied.
Action plan: It is essential that the company adopts specific procedures to follow in case of alert. Everyone must know what to do during the emergency and it is necessary to have a dedicated team and/or space to engage all the involved stakeholders.
The moment the starfighters reached the surface of the base, a heavy firefight began with our defense systems. During the conflict, many enemies were shot down, but unfortunately some of our turrets also suffered damage. The Defense Team then opened a ticket on the Jira portal for each damaged turret.
The information about Jira’s project was not shared and no other team learned about it in real time.
Action plan: it is essential that the company shares information relating to ongoing incidents. This will allow all teams to know what is happening and react accordingly.
Entry into the base
In the continuation of the firefight, at 10:53 three starfighters managed to penetrate inside the base, where the defenses are much lower. In this case, the obvious question we have been asking ourselves from the beginning is “how did they overcome the force field that should surround the base?”.
Our subsequent analysis revealed that the force field had been deactivated for extraordinary maintenance and, in the absence of information on the attack in progress, the Shield Team was unable to reactivate the protections in time
Action plan: extraordinary maintenance must always be communicated to the company. Also, if the team had immediate notice of the attack, they could have stopped the update and rolled back to the previous version.
Once inside the base, a single starfighter reached the core and hit it with a laser beam. The core stability problems had been known for some time; it had already generated various internal incidents, but the technicians had always intervened to solve the effects of these problems, without correcting the causes. The single laser beam to the core created a chain reaction which, within seconds, led to the destruction of the base.
*Action plan: it is essential that the company also works on the root causes of past incidents, to identify vulnerabilities and make its systems more stable.*
Report to the emperor
The day after the attack, the Emperor organized a video call to collect the first data about the impact of the destruction and all the other incidents that occurred on the base in the last year. Unfortunately we hadn’t tracked the impacts on Jira and we weren’t able to return accurate statistics that take into account the overall costs, but also the volume of incidents for each service used within the Pink Star. The Emperor said that because of these shortcomings, our year-end valuation is at risk, and we may not take the full productivity bonus!
Action plan: it is essential to collect and analyze all the data relating to incidents that have occurred over time. Specifically, it is important to estimate the impact that a particular event has had in terms of lost revenues, costs and disruption for the company. The teams should provide reporting on a regular basis, to give everyone visibility into company incidents.
Why in lastminute.com this could never have happened?
At lastminute.com we empathize with the destruction of the Pink Star very much because we know that even large companies can suffer serious damage if they do not learn how to handle Incidents correctly! For this reason, the company has formalized an Incident Management Process that would undoubtedly have mitigated the impacts of an event like the one described above.
Following the example of the alert relating to the number of “unknown starfighters approaching”, in lastminute.com the owner Team would have reported the event within the Slack channel dedicated to emergencies, monitored 24/7, and the on-call incident manager would have immediately taken charge of the accident.
With the support of the Monitoring Team, they would track the incident via a Jira ticket and email the entire company to align stakeholders. Any related incidents, such as turret destruction, would be linked to the main incident to have all information organized.
The On Call Incident Manager would have checked for any maintenance in progress and would certainly have warned the Shield Team to ask them to immediately interrupt their maintenance, restoring the Force Shield to the previous version and this would have substantially led to the failure of the attack.
Within the story, however, two other relevant aspects emerge: the management of problems and the calculation of impacts.
At lastminute.com we try to analyze recurring incidents and, with the collaboration of all the development teams, we trace them as problems and work to resolve the root cause. In this case, the core vulnerability would have been managed in time, and we probably would have avoided such an obvious weakness within our structure.
Furthermore, we do not have an Emperor (yet), but we certainly have many top managers who are anxiously awaiting the reports relating to the incidents that occurred in the previous month, to understand how much they cost us and what we are doing to improve management and reduce resolution times.
Ultimately, within a complex and evolving system, incidents are inevitable. The company’s goal is obviously to reduce them, but it must come to terms with their inevitability and for this reason it must equip itself with the right people and the right processes to be able to deal with them. If you want to conquer the galaxy, make sure you have a good IT Service Management Team!