Major Incident Management

What’s an Incident?

ITIL defines an incident as an unplanned interruption to a service or reduction in the quality of a service. Incidents have four priority levels: Critical, High, Medium and Low. Major incidents are typically classified as high-priority events based on the urgency and business impact of the situation.

What's a Major Incident?

A major incident impacts major business operation. Major incidents bring an organization’s entire operation to a standstill and impacts their revenue and bottom line. This can also have far-reaching consequences for the company’s reputation.

Major incidents commonly include:

  • Unfunctional or unresponsive eCommerce websites
  • Client access portals are down
  • Severe outages in airline check-in processes

Time is critical during a major incident and an organization’s ability in bringing back business to normalcy makes all the difference. Major incident management (MIM) helps distinguish between a real major incident versus an outage. The goal of MIM is to manage the incident life cycle and remediate the issue while minimizing disruption.

The Cloudflare 2019 global outage is an example of a major incident. A minor change in the rules used to detect anomalies resulted in major outages. Per Cloudflare’s estimates, systems were down for approximately 27 minutes and affected almost half of the internet’s accessibility.

 

 

Major Incident Management

The Four Stages of Major Incident Management

 

The Process in Detail ...

Stage 1: Detection

Identifying the Major Incident

The first step in the MIM process is identifying a major incident. Organizations encounter incidents every few minutes, so the challenge is distinguishing major incidents from the rest.

A key indicator of a major incident is that it affects many users, disrupting one or several critical services of a business. Incidents are often reported to a service desk technician or detected by monitoring tools that automatically trigger notifications when anomalies are identified.

Stakeholder Communication

When a major incident is detected, the relevant stakeholders need to be engaged to contain the situation and minimize business losses. There are three key groups that must be informed of the situation:

  • Incident Response Team: The team manages and takes control of the critical situation. It is important that teams have the right tools in place to detect issues and alert engineers of severe outages.
  • Senior Management: The command manager must send timely situational reports (SITREPS) with timelines to senior management. Incident alert management platforms can be used to streamline this process through pre-configured recipient groups and email templates.
  • Users: Being transparent and apprising users of a major incident helps alleviate stress and anxiety. It also reflects good on the company, solidifying trust and fostering better relationships with customers. Mass messaging platforms can be deployed to broadcast timely updates to users via many message channels.

Stage 2: Orchestration

Assemble the major incident team to remediate the major incident. The team must consist of engineers, incident commanders and other key stakeholders, such as external consultants. All parties aim to minimize damage and resolve the issue.

Centralized Communication

In critical situations, emails and SMS are ineffective message channels. Messages are often missed and unaddressed, and the channels are unable to elevate high-priority messages. Incident alert management applications allow for real-time, secure messages for team collaboration.

Stage 3: Resolution

The resolution stage occurs when the outage has subsided, and systems have been restored to full functionality.

Once resolution is achieved, the team should document the entire incident management process for future reference. If necessary, carefully implement process changes and ensure that other dependencies are not affected.

check

audit trail

Stage 4: Post-Incident Analysis

Post-incident analysis measures the performance of resources and systems in place. Incident managers conduct post-incident reviews to gain insight into the situation and event resolution process. This helps organizations become well prepared for future incidents. Message audit trails can be analyzed to get information about the team’s incident response.

Adopt a Major Incident Management Tool

While you can catalog an incident using your IT service management (ITSM) ticketing tools, there is very little you can do to manage the incident by simply using tickets. Ticketing tools only allow for unprioritized SMS and email incident alerts. This limitation inhibits team collaboration and slows down the incident remediation process.

Integrate ticketing tools with an incident alert management system. This way, you can convert tickets into intelligent alerts that can be sent to response teams whenever there is a critical incident. Teams can further collaborate on tickets and include resources into the conversation thread if required.

The incident alert management tool needs to be more than just an alerting service. Here are some requirements of any incident alert management solution:

  • High-priority notifications that bypass the silent switch on mobile
  • Secure messaging to aid team communication
  • Ability to integrate with ticketing tools
  • Persistent, distinguishable mobile alerts
  • Digital on-call schedulers
  • Alert escalation policies
  • Fail-over options if an alert escalation fails
  • High and low-priority alerting
  • Ability to track incoming and outgoing alerts and messages
  • Reporting to summarize and gain insights into historical data

OnPage Customer Testimonial

“We service a large number of clients, and with OnPage, we are able to respond very quickly to user issues. The alerts contain all the information and the canned response feature allows us to reply quickly with predefined messages. If I need to get in touch with another technician, I can easily send a message directly from the [OnPage] app.”

Enterprise IT | OnPage Customer

More Reviews

OnPage