What is Incident Management?
What is Incident Management?
Incident Management refers to the processes and people put in place to identify, analyze and correct incidents that cause company downtime or service interruption.
The people who handle these incidents are a part of a incident response team or an incident management team. This team is usually over seen by an incident manager. The goal of incident management is to restore the service as quickly as possible.
How important is speed in Incident Management?
Speed is of the utmost importance within incident management and the Mean Time To Resolution (MTTR) needs to be recorded and measured. If you don’t know how long it takes for you or your team to fix issues, you cannot improve on that time.
Issues impeding fast Incident Management
While the importance of MTTR is generally acknowledged, the impediments to its effective management are many.
• Data channel connectivity. Consider, for example, the situation where you have a team in India. Your U.S. based team should complement the hours not worked in India and vice-versa. Yet due to the high cost of the data channel, your team in India turns their data channel off and is only reachable if they are in the office. Since your India team is delayed in receiving and responding to messages, MTTR increases.
• Lack of effective monitoring tools. There is often no baseline for how your system should operate. In this situation, ITIL’s framework for providing best practices for aligning IT with business needs has been degraded. Instead your teams use homegrown tools to monitor and create a baseline. Effective ITSM best-practices are ignored. Without these tools or with tools that lack the necessary robustness, you are unable to truly understand your monitoring system.
• No escalation. Even if an engineer is alerted to the incident, he or she has no easy way to escalate the issue when they realize the scope of the problem. Often, effective resolution of problems requires bringing in other members of your team to resolve issues.
• Audit trails. No trail exists of who was alerted based on what criterion. Looking back, management is unable to see a history of the cause of the most recent alert and who was notified and in which order.
• Scheduling tools. Management cannot coordinate who’s to be alerted based on the type of incident. Instead, the whole team is alerted regardless of their ability to provide insight or assistance.
• Excessive alerting. Team receives too many false positives and inevitably begins to ignore alerts and eventually starts to miss important ones.
How to Improve Incident Management
Effective IT incident management is concerned with deviations from, and threats to, the standard operation of services. During the course of time, even the best IT of department will experience incidents. How IT reacts to incidents is a key driver of MTTR (mean time to repair) as well as customer satisfaction.
Effective incident management is all about reducing the noise so IT teams know which alerts truly require a reaction at 2 a.m. Too many events and alerts (false positives) will reduce the effectiveness of IT operations. You’ll start to overlook important events or alerts. Consequently, it is important to learn what are the important statistics to keep track of.
Invest in Automation
Effective IT incident management requires effective use of tools. Most IT teams have an abundance of tools so having them is not as much of an issue as determining which ones are crucial in time of need.
If a task can be automated, then there is no reason an engineer needs to be alerted to the event. For example, if automated backups are available, then the IT team should bring on the technology and tools which enable this to happen. Enabling automation will mean that teams save money on man hours and avoid the potential for mistakes.
Engineers should really only be brought into work on a problem where their knowledge can add value. This is particularly true for issues picked up by monitoring and alerting tools. Indeed, effective identification of problems is the first step in successful incident management. Effective alerting brings the need for incident management to the forefront.
OnPage is the perfect tool for Incident Management
OnPage is a SaaS based Incident Management system hosted in secure, SSAE-16 compliant hosting facilities across the USA.
- Get instant visibility and feedback on incident status.
- Track alert delivery, ticket status and responses to tickets.
- Rock Solid Reliability – A must for those who need to elevate critical incidents and ensure fast resolution.
OnPage provides powerful integrations with mission critical systems through the industry’s easiest integration framework.
OnPage has written several relevant whitepapers that can assist you in understanding the complexities of an effective IT on-call policy.