What Is Critical Incident Management?

Critical incident management defines the alignment of company operations, services and functions to manage high-priority assets and situations. Coordinated response between multiple teams requires critical incident management.

The first step in defining a critical incident is to determine what type of situation the team is facing. There are multiple severities that can describe an incident. Usually, IT teams will use “SEV” definitions. These severities can range from a severity five (SEV-5), which is a low-priority incident, to a severity one (SEV-1) incident which is high-priority event. Anything above a SEV-3 is considered a “major event” and becomes a critical incident requiring critical incident management.

Classifying and Responding to Priority Incidents

“Severity” determines the importance of an incident based on pre-defined guidelines. The intent is to guide responders on the type of response they can provide. High severity equates to risky responder decisions.

 

Severity

Definition

Action

SEV-1

Critical production issue that severely impacts use of the service. Often called a show stopper. This type of situation has no workarounds.

Severity one issues require a dedicated resource to work on the issue.

e.g., Internet service is down which prevents the application from running.

  • High-priority alert sent to service team
  • Multiple technicians are required to remedy the situation
  • Use high-priority messages when exchanging information with colleagues
  • Notify internal stakeholders

SEV-2

 

Critical situation where functionality is impacted or customer experience is seriously degraded. High impact to portions of the business and no reasonable workaround exists.

e.g., Server is down preventing storage of new files or records.

  • High-priority alert sent to service team
  • Multiple technicians are required to remedy the situation
  • Use high-priority messages when exchanging information with colleagues

SEV-3

 

A partial loss of service with a medium-to-low impact on the business. Business is still able to function. Short-term workaround is available, but not scalable. Issue could escalate to SEV-2 if not managed properly.

e.g., Part of a solution’s functionality is unavailable.

 

  • High-priority alert is sent to service team
  • Usually requires one engineer to work on issue although it is her top priority
  • Continue to monitor situation in case it needs escalation

 

SEV-4

Performance of systems is delayed but still functioning. Bug affects a small number of users. Acceptable workaround available.

e.g., Website is slow in responding to requests.

  • Low-priority alert is sent to service team
  • Engineer works on the issue as her first priority
  • Monitors situation to ensure it doesn’t escalate

 

SEV-5

Systems experience minor issues that affect a small, limited number of users. SEV-5 issues are classified as low-priority events. They do not require immediate attention and resolution.

e.g., Users do not remember login credentials.

  • Low-priority alert is sent to service team
  • Engineer works on the issue when possible
  • Occasionally monitors situation to ensure it doesn’t escalate

How CIM Differs From Incident Management

Incident management defines the orchestration of personnel, technology and processes to resolve IT service interruptions. It is not different from critical incident management. At times, the terms might be used interchangeably. However, critical incident management differs from straight incident management based on the severity of the incident. Much of the change is one based on mindset.

An incident management situation might correspond to a SEV-5 on the chart above or SEV-4. This differs from a critical incident management situation which describes a SEV-2 or a SEV-1. In either of these later two situations, the decision-making process changes. Actions might be riskier during a SEV-1 given the importance of what is at stake.

At times, it can be difficult for team members to understand the difference between critical incident management and incident management. That is why it is important to have experienced team managers, who can help shepherd the thinking of the team.

The Cost of Downtime

Proper critical incident management requires understanding the actual impact of downtime. According to a January 2016 article in Network Computing on the high price of IT downtime, organizations face:

“An average of five downtime events each month, with each downtime event being expensive indeed: from $1 million a year for a typical midsize company to more than $60 million for a large enterprise.”

The major cause of this downtime is equipment failures, accounting for nearly 40 percent of downtime. The second most frequent cause of downtime is human error which accounts for 25 percent of downtime.

Traditional workflows have help or service desks alerted of downtime incidents via pagers or emails. The use of email alerts assumes—falsely—that an email will get the attention of the appropriate data center manager or service desk engineer. Unfortunately, critical messages often get buried in email inboxes. Instead, IT support teams need immediate incident management platforms for their teams.

Critical Incident Management Best Practices

An organized approach to addressing and managing an incident requires teams to not just solve the incident, but to handle the situation in a way that limits damage and reduces recovery time and costs. Critical to the success of this process is establishing protocols for managing IT roles not just during an incident, but also before and after the urgent event.

1: Critical Incident Preparation

Establish a workflow for how incidents are handled by the IT operations team, so everyone knows their role. This could mean that the help desk is the first to receive the incident and they either create a ticket and send it to the proper service desk or use the persistent alerting feature of their incident management platform to alert the proper service desk based on the problem. The help desk will use the high-priority alerting feature if the incident is SEV-3 through SEV-1, while using low-priority on a SEV-4.

Once the proper service team is alerted, they must have a protocol on how to manage the situation. Do they call in a subject matter expert (SME) or can they handle it internally? If it is a SEV-2 or SEV-1, the protocol might be to contact the SME to ensure they are following best practices. The team might also want to consider how they will communicate with one another while they work on resolving the issue. Will they use OnPage to exchange messages or will they hop on a conference bridge?

It is also important to determine when to notify management that an issue has occurred. Again, this should all run according to a prescribed script. There should be no guesswork on what role everyone needs to play during a high-priority incident.

Critical Incident Management

2: During an Incident

Incidents are best managed by maintaining a constant flow of information. Engineers are fond of exchanging text messages so that they can provide runbooks and advice to colleagues. A solution like OnPage is ideal for this use as it allows end users the opportunity to not only exchange messages, but also see the status of the message sent. Has the message been delivered? Has the message been seen? Has the message been read?

Additionally, it is important to see the status of the colleague one is supposed to be working with. Is that colleague logged in and available? If they are not, then the engineer can call the colleague and get her up to speed.

Colleagues should also be able to see the status of the incident from the console. Has an engineer received the ticket for the incident and begun work on it? Has the incident not yet been assigned?

Critical Incident Management

3: After an Incident

After a SEV-3, SEV-2 or SEV-1, teams should conduct a post-incident analysis as the final step of the critical incident management process. In the analysis, one of the team’s engineers should write up details such as:

  • What caused the incident?
  • Which team members were called to resolve the incident?
  • How long did it take for the team to get alerted on the issue?
  • What resources were required to resolve the incident?
  • What did the team do to resolve the issue?
  • How long did it take to resolve the incident?
  • What lessons did the team learn from resolving the issue?

OnPage