Incident Management
A good incident management process is fast and predictable. It quickly turns detection into response, escalates to the right people, makes communications clear, and keeps customers in the loop.
Incident management in software development consists of having a plan to act in case of incidents.
To be honest the term Incident Management just became a thing to me recently, when I moved to a more critical management role. I had to learn the hard way why it's so important to have a process when something goes wrong.
In this post, I’ll present my own version of how to manage incidents. I built this over the last few years solving and participating in incidents.
—
Table of contents
A System crash story
What is an incident
Who are the stakeholders
The importance of having a reliable system
Detecting an incident
Delivery channels
Post-mortems
Keeping track of the solution
—
—
1. A System crash story
Suddenly, you are receiving emails and calls from all kinds of people complaining they can't use your product. Your system just crashed.
There are external and internal customers that depend on your system, so the issue should be fixed ASAP.
You try to understand what component may have caused the incident and if you don't know how to fix it, you send a message to someone that does.
Once you find someone that knows how to solve the problem you keep asking for updates so customers can be calm down as the problem will be solved soon.
After a couple of hours, a developer found out that one migration did not run properly on the previous deployment, so he fixes it by running a command and the system is back online.
You send an email to everyone saying that the issue has been resolved and you move on.
—
There are many wrongs in this story. Communications, Code Reviews, CI/CD, and more importantly, lack of incident management.
I won't focus on how to solve incidents here, but instead, discuss what it means to have a solid process to act when an incident happens.
—
2. What is an incident
“An incident is an event that could lead to loss of, or disruption to, an organization's operations, services, or functions.” - [1]
One of the problems that prevent people from managing incidents properly is the absence of a solid plan to follow when something goes wrong.
A good incident process is fast and predictable. It quickly turns detection into response, escalates to the right people on the shortest path, makes communications clear and keeps customers in the loop. [2]
In other words, when you have a plan written down, it's just like following a recipe.
You'll know what to do.
—
3. Who are the stakeholders
Before we dig into the actual plan, you may need to take a step back and identify who are the product stakeholders.
When an incident happens those are people you need to care about to maintain in the loop and to make sure they know what happens, why it happens, and that it won’t happen again.
To make it practical, I suggest having a document (example below) with all stakeholders.
Hypothetical scenario
Consider a product that anyone can sign up for free.
Eventually, part of the users that creates an account will become a paid user. That's how it makes revenue.
In this scenario, it's easy to spot a stakeholder, that is the average customer.
Also, there are developers, designers, and support analysts that work on this application, right? Those are also stakeholders.
However, stakeholders should not be treated equally. There are stakeholders that need more attention and should be satisfied — customers — and others who need to be aware of every important change in the system — developers, support analysts, etc.
To identify stakeholders and how to treat them, the following criteria may be useful:
High power, highly interested people — Manage Closely
High power, less interested people — Keep Satisfied
Low power, highly interested people — Keep Informed
Low power, less interested people — Monitor
It's important to have a document describing who the product stakeholders are.
Platform Stakeholders
Manage Closely
Management
Product Team
Keep satisfied
Customers
Keep Informed
Developers
Support Analysts
Monitor
Other folks on the same company
—
4. The importance of having a reliable system
Reliability is an important pillar of software development. Systems that lack of reliability are doomed to suddenly crash. It may be during a deployment or due to a high number of requests. No one knows.
For a business, this is bad news and its something that you can't afford on the enterprise world.
Monitoring and Alerting
You should receive the bad news before your customers. The team will have a huge advantage to identify, notify, and continue to communicate with the customers about the incident.
Trust is something that you only lose once. If your customer needs to tell you that something is broken or down, you have a serious problem right there.
That's why solid monitoring and alerting services are essentials.
Popular monitoring services (Full list here):
Dynatrace
New Relic APM
Prometheus
Zabbix
Nagios XI
Grafana
Datadog
Popular alerting services: (Full list here)
Opsgenie
xMatters
Pager Duty
New Relic Alerts
This is an automated way to be aware of ongoing problems, but you also need a way to collect organized incident reports from your customers.
Instead of receiving calls and emails, you should have a place where customers can report what happened. — Not incidents, but usually bugs, misconfigurations, or just questions.
For incident reports and ticket management, there are tools like Jira Service Desk, Zendesk, and Intercom.
Such services provide efficient ways to have a single source of incident management and a good communication channel for both internal and external customers.
Also, they provide a unique identifier so you can use it when talking about a specific incident.
Good coding process
The way code is produced on a given product has a direct impact on incident management. It means how you produce and maintain code is also important.
Commit messages
You received an alert regarding a failure when adding items to the shopping cart.
You checked the logs and a weird error showed up complaining about a missing database column when you're adding an item to the cart.
The first thing you try to do is identifying when it was introduced and why.
You run git log and receives a list like this one:
837125619 wip
738125619 fix typo
127125619 add cart
837125619 wip
That's great, right? No!
Now, what if instead, you encounter something like this?
Simplify serialize.h's exception handling
Remove the 'state' and 'exceptmask' from serialize.h's stream implementations, as well as related methods.
As exceptmask always included 'failbit', and setstate was always called with bits = failbit, all it did was immediately raise an exception. Get rid of those variables, and replace the setstate with direct exception throwing (which also removes some dead code).
As a result, good() is never reached after a failure (there are
only 2 calls, one of which is in tests), and can just be replaced by !eof().
fail(), clear(n) and exceptions() are just never called. Delete them.
Closes #123
See also #456
Good commit messages make the codebase easier to understand, therefore, easier to identify and solve an incident related to the codebase.
Any fool can write code that a computer can understand. Good programmers write code that humans can understand. [3]
Staging environment
That's critical — seriously.
Having a production system without a staging environment it's like walking on a rope between two buildings hoping that nothing wrong it’s going happen.
The key lesson here is not just having a staging environment, but it should be almost exactly like production.
Same Linux/Windows version, same node/ruby/python/etc version, same MySQL/PostgreSQL/MongoDB/etc versions … etc…
Which brings me to the next topic — you must keep all environments in sync when it comes to software versions and state.
Infrastructure as a code (IaC)
By representing resources as code, we can parameterize the code to support multiple environments, share the code with our teammates, and even test the code to ensure accuracy. [4]
But most important than that, it will prevent the entire system from crashing if someone changes something in production!
By writing down the infrastructure specification in code, a process is put in place to make any kind of changes to the production environment. — it ensures safety changes
Using IaC dramatically reduces the number of incidents related to misconfigurations and software updates.
—
5. Detecting an incident
There are many ways to be ahead of the customer and detect incidents in the early stages. It's important to have good monitoring and alerting systems.
When the system is monitored, the issue should popup somewhere indicating that there is a failure. The next step is to solve or escalate the problem.
The person who received the alert should fix or delegate the problem — Remember that this is an incident, it's not the time to learn something new.
If you're on-call (Opsgenie, Pager Duty) it's important to know where the failure might come from and who is the owner of that specific area. Meaning owner a person that is usually the creator or maintainer (accountable) of a specific feature or domain.
6. Delivery channels
Keep customers (stakeholders) knowing what's going on
Professional support teams and site reliability engineers don’t decide on the fly what channels to communicate over. They make a plan ahead of time.
There are five main communication channels for incident communication:
A dedicated status page
Embedded status
Email
Workplace chat tool
Social media
SMS
If the product has many different public services, it’s a good idea to maintain a status page. But don't just create the page, use it!
Incident Manager
Everything that does not have an owner will be abandoned at some point, so the status page should be updated maintained by at least one person.
Let's call this role the Incident Manager.
This person should be able to communicate between multiple areas, such as customer support, engineering, and product.
They are responsible for updating the status page and communicating with customers during incidents.
A good workflow I see in practice that works is the following.
Acknowledge the problem - Check with engineering if the problem is real and what’s the impact;
Update the status page to keep customers down. Your customers should already have access to this page. If they don,t send them an email. Explain that you know that something is wrong and the team is working on the fix;
If necessary, fire up some internal emails with the status page link to a manager that needs to know about the issue;
Keep the page updated on every finding - Use slack channels to keep the communication flowing between teams;
When the incident is resolved, start to write the Post Mortem.
7. Post-mortems
In my opinion, post mortems are one of the most important practices of incidents management.
The difference between average programmers and excellent developers is not a matter of knowing the latest language or buzzword-laden technique. Rather, it can boil down to something as simple as not making the same mistakes over and over again. Fortunately, there's a powerful tool that any developer can use to help learn from the past: the project postmortem. — [5]
The outline of the post-mortem is simple: [6]
Acknowledge the problem, empathize with those affected and apologize
Explain what went wrong and why
Explain what was done to fix the incident and what was done to prevent repeat incidents
Acknowledge, empathize, and apologize once again
By forcing yourself and your team to spend some time documenting and discussing what happened on an incident will prevent further similar issues and also light out problems you didn't know you had.
It's important to conduct this ritual without looking for someone to blame. That's not the point. Instead, focus on discovering the root issue so you can solve it completely after the incident is solved.
Postmortems should have a special place
Store post-mortems in a dedicated place.
You can use Google Drive, Dropbox, Github, Confluence, etc.. It doesn't matters where you'll store them, as long as you have a place that everyone on the team has access to.
Also, it's important to designate an owner for conducting post-mortem meetings and to publish the final document (example below). However, it's important that every team member participates in the solution writes it's own postmortems in a simple way.
Documenting facts
When acting on an incident, it is recommended to take notes on a simple notes app.
It's important to document facts while you're working on the solution. I like to write down important things that happened during the process.
9:10 am - Received an alert from Opsgenie of an increased error rate on the deployments system
9:24 am - Discovered several error logs on AWS Beanstalk regarding a service X due to a memory limit issue
9:35 am - Reviewed last commits on the deployment services and identified a new NPM package introduced yesterday
9:45 am - Searched for similar issues on Google and identified that the package Y version 1.2.3 has an incompatibility with Node JS Version 10.x
10:00 am - Had a chat with platform team that introduced the package and we decide to revert the commit that introduced the package
10:22 am - We re-deployed the system and it worked as expected
Merging to a final document
After the incident is resolved we need to take a break and work on the post-mortem document.
It's important to have the following sections on the document:
Summary — Overview of what happened
Impact — Who was impacted and how much they were impacted
Root Cause - Description of the root cause
Resolution - Description of what solved the problem. If was a temporary fix, describe the long-term solution
Timeline — Looks like what I wrote on the previous subsection
Action Items - List of what should be done to prevent it from happening again. Mostly related to the Root Cause and the Resolution.
Final document
Here is a template of an incident post-mortem in Markdown:
8. Keeping track of the solution
When a post mortem is done, the incident is not truly completed.
How will you make sure it won't happen again?
The team may have to produce a quick code or solution to fix the incident as soon as possible — a common word used for this is a hotfix.
Hotfixes are usually urgent fixes designed to be implemented as quickly as possible.
So when the incident is completed the team should start working on the definitive solution, that's called technical debt.
Technical debt is something that is working but may — and probably will — causes problems in the future.
It's good to have a specific backlog for Technical debts, and more importantly, you must kill your debts — Don't just add them to a list and forget it.
If the team uses scrum, it’s a good practice to use a part of the sprint to work on technical debts.
—
Recap
In this piece, I bring many learnings from my career in software development. Most of the things I learned from incidents.
After many failures, you realize how important documentation and solid processes are.
If I could make a simple ordered list of important topics on incident management, it will be something like this.
Have monitoring and alerts for your systems
Have CI/CD
Have staging and if possible preview environments
Have a code review process
Have a status page
Know your customers
Have an incident manager responsible for communications during incidents — Note, this is just a role, not a job title. Anyone can be an incident manager and do something else.
Write post mortems
Communicate more
I hope you find this useful.
Thanks!
—
Resources
[1] - https://en.wikipedia.org/wiki/Incident_management
[2] - https://www.atlassian.com/blog/it-teams/how-to-create-an-incident-response-playbook
[3] - https://en.wikiquote.org/wiki/Martin_Fowler
[4] - https://blog.kylegalbraith.com/2018/08/21/the-benefits-you-need-to-know-about-infrastructure-as-code/
[5] - https://www.developer.com/design/article.php/3637441
[6] - https://www.atlassian.com/incident-management/postmortem