What is Incident Management?
Incident management refers to the IT processes and people put in place to identify, analyze and correct incidents that cause company downtime or service interruption.
The professionals who handle these incidents are part of an IT incident response team. This team is usually directed by an incident manager. The goal is to resolve major incidents as quickly as possible.
MTTR - How important is incident response and resolution speed?
Speed is of the utmost importance for IT incident response teams and the Mean Time To Resolution (MTTR) is the metric that’s used for measurement. If an IT team doesn’t know how long it takes to fix issues, they can’t improve performance.
There are many roadblocks to minimizing MTTR.
• Inconsistent data channel connectivity. As an example, let’s say there’s an IT team in India as well as in the US. The US-based team should complement the hours not worked in India and vice-versa. Yet due to the high cost of the data channel, the team in India turns their data channel off and is only reachable if they are in the office. Since the India team is delayed in receiving and responding to messages, MTTR increases.
• Lack of effective monitoring tools. Without quality monitoring solutions and processes, it will take more time than necessary to do root cause analysis of the incident. Techs can also use monitoring tools to see the change in data as they apply fixes and tweaks to ensure that they are headed in the right direction towards resolution.
• No escalation. When an engineer is alerted to a critical incident, he or she may want to escalate the issue if the scope of the problem is larger than originally anticipated. Often, effective resolution of problems requires bringing in other members of the team to resolve an issue and if there’s not a fast way to alert the team or determine who’s available, the incident will take much longer to fix.
• Lack of audit trails. If no trail exists of who was alerted based on what criterion, management is unable to see incident reports with a history of the cause of the most recent alert and who was notified and in which order. This is a missed opportunity to help the IT team discuss their performance during a post-mortem review and work on continuously improving MTTR.
• No scheduling tools. Management cannot coordinate who’s to be alerted based on the type of incident. Instead, the whole team is alerted regardless of their ability to provide insight or assistance.
• Excessive alerting. The team receives too many false positives, inevitably begins to ignore alerts and eventually starts to miss important ones. Alert fatigue not only affects MTTR, but also leads to employee burnout and high employee churn rates in the IT organization.
Effective incident management
To be successful, IT incident teams must monitor and manage deviations from, and threats to, the standard operation of services to ensure that they meet service level agreements (SLAs). Even the best IT department will eventually experience a critical incident. How IT reacts to resolve incidents is a key driver of MTTR (mean time to repair) as well as customer satisfaction.
Not everything should be an emergency. Management should take steps to reduce the noise so IT teams know which alerts truly require action at 2 a.m. Too many events and alerts (false positives) will reduce the effectiveness of IT operations and the team will start to overlook critical events or alerts. To reduce noise it’s important to determine the few occurrences, metrics and levels which command a high priority response. The rest can be classified as lower priority and don’t require immediate action.
High priority alerts need to be distinctive and get immediate attention. That means that they should not be transmitted via email, instant messaging or text messaging, where they’ll be buried under a multitude of non-important content. High priority alerting should be delivered on a device that is always available and convenient – a smartphone.
Make it easy to escalate alerts
To ensure that alerts are never missed, the workflow must include a way to automatically escalate the notification in case the tech assigned to the incident does not respond within a predetermined length of time. Some IT teams have found that by incorporating this technology and process the number of missed alerts have been reduced quickly and dramatically (in many cases to zero) and responses to critical incidents have sped up by 300% or more.
Invest in automation as well as the right processes
Most IT teams have an abundance of tools, so a lack of solutions for automation is not as much of an issue as determining which ones are crucial in a time of need.
If a task can be automated, then there is no reason an engineer needs to be alerted to the event. For example, if automated backups are available the IT team should bring on the technology and tools which enable this to happen, saving time, labor hours and avoiding the potential for human error.
Engineers should really only be assigned to work on a problem where their knowledge can add value. This is particularly true for issues picked up by monitoring and alerting tools. In fact, precise identification of problems is the first step in the incident management workflow.
OnPage is the ideal incident alert management solution
OnPage is a SaaS-based incident alert management system hosted in secure, SSAE-16 compliant hosting facilities across the USA. With OnPage, IT professionals
- Get instant visibility and feedback on incident status
- Track alert delivery, ticket status and responses to tickets
- Depend on rock-solid reliability – a must for those who need to elevate critical incidents and ensure fast resolution
OnPage provides powerful integrations with mission-critical systems through the industry’s easiest integration framework.
You may also be interested in these whitepapers: