What is Critical Incident Management?
Critical incident management defines the alignment of company operations, services and functions to manage high-priority assets and situations. Coordinated response between multiple teams requires critical incident management.
The first step in defining a critical incident is to define what type of situation the team is facing. There are multiple severities that can describe an incident. Usually, IT teams will use “SEV” definitions. These severities can range from a severity five (SEV-5), which is a low-priority incident, to a severity one (SEV-1) incident which is high-priority event. Anything above a SEV-3 is considered a “major event” and becomes a critical incident requiring critical incident management.
“Severity” determines the importance of an incident based on pre-defined guidelines. The intent is to guide responders on the type of response they can provide. The high severity equates to risky responder decisions.
|Critical production issue that severely impacts use of the service. Often called a show stopper. This type of situation has no workarounds.|
Severity one issues require a dedicated resource to work on the issue.
E.g., Internet service is down which prevents the application from running.
Critical situation where functionality is impacted or customer experience is seriously degraded. High impact to portions of the business and no reasonable workaround exists.
E.g., Server is down preventing storage of new files or records.
A partial, loss of service with a medium-to-low impact on the business. Business is still able to function. Short-term workaround is available, but not scalable. Issue could escalate to SEV-2 if not managed properly
E.g., Part of a solution’s functionality is unavailable.
|Performance of systems is delayed but still functioning. Bug affects a small number of users. Acceptable workaround available.|
E.g., Website is slow in responding to requests
How CIM Differs From Incident Management
Incident management defines the orchestration of personnel, technology and processes to resolve IT service interruptions. It is not different from critical incident management. At times, the terms might be used interchangeably. However, critical incident management differs from straight incident management based on the severity of the incident. Much of the change is one based on mindset.
An incident management situation might correspond to a SEV-5 on the chart above or SEV-4. This differs from a critical incident management situation which describes a SEV-2 or a SEV-1. In either of these later two situations, the decision-making process changes. Actions might be riskier during a SEV-1 given the importance of what is at stake.
At times, it can be difficult for team members to understand the difference between critical incident management and incident management. That is why it is important to have experienced team managers who can help shepherd the thinking of the team.
The Cost of Downtime
Proper critical incident management requires understanding the actual impact of downtime. According to a January 2016 article in Network Computing on the high price of IT downtime, organizations face:
“… An average of five downtime events each month, with each downtime event being expensive indeed: from $1 million a year for a typical midsize company to more than $60 million for a large enterprise.”
The major cause of this downtime is equipment failures, accounting for nearly 40 percent of downtime. The second most frequent cause of downtime is human error which accounts for 25 percent of downtime.
Traditional workflows have help or service desks alerted to downtime incidents via pagers or emails. The use of email alerts assumes—falsely—that an email will get the attention of the appropriate data center manager or service desk engineer. Unfortunately, critical messages often get buried in email inboxes. Instead, IT support teams need immediate incident management platforms for their teams.
Critical Incident Management Best Practices
An organized approach to addressing and managing an incident requires teams to not just solve the incident, but to handle the situation in a way that limits damage and reduces recovery time and costs. Critical to the success of this process is establishing protocols for managing IT roles not just during an incident, but also before and after the urgent event.
1) Critical Incident Preparation
Establish a workflow for how incidents are handled by the IT operations team so everyone knows their role. This could mean that the help desk is the first to receive the incident and they either create a ticket and send it to the proper service desk, or use the persistent alerting feature of their incident management platform to alert the proper service desk based on the problem. The help desk will use the high-priority alerting feature if the incident is SEV-3 through SEV-1, while using low-priority on a SEV-4.
Once the proper service team is alerted they have to have a protocol on how to manage the situation. Do they call in a subject matter expert (SME) or can they handle it internally? If it is a SEV-2 or SEV-1, the protocol might be to contact the SME to ensure they are following best practices. The team might also want to consider how they will communicate with one another while they work on resolving the issue. Will they use OnPage to exchange messages or will they hop on a conference bridge?
It is also important to determine when to notify management that an issue has arisen. Again, this should all run according to a prescribed script. There should be no guesswork on what role everyone needs to play during a high-priority incident.
2) During an Incident
Incidents are best managed by maintaining a constant flow of information. Engineers are fond of exchanging text messages so that they can provide run books and advice to colleagues. A solution like OnPage is ideal for this use as it allows end-users the opportunity to not only exchange messages, but also see the status of the message sent. Has the message been delivered? Has the message been seen? Has the message been read?
Additionally, it is important to see the status of the colleague one is supposed to be working with. Is that colleague logged on and available? If they are not, then the engineer can call the colleague and get her up to speed.
Colleagues should also be able to see the status of the incident from the console. Has an engineer received the ticket for the incident and begun work on it? Has the incident not yet been assigned?
3) After an Incident
After a SEV-3, SEV-2 or SEV-1, teams should conduct a post-mortem analysis as the final step of the critical incident management process. In the analysis, one of the team’s engineers should write up details such as:
- What caused the incident?
- Which team members were called to resolve the incident?
- How long did it take for the team to get alerted on the issue?
- What resources were required to resolve the incident?
- What did the team do to resolve the issue?
- How long did it take to resolve the incident?
- What lessons did the team learn from resolving the issue?