Monitoring and Alerting Best Practices
Not all alerts are created equal! Even though most IT teams have adopted IT alerting practices, they are often far from monitoring and alerting best practices. It’s not enough to just have an alerting tool. Like a monitoring tool, if left uncalibrated, alerts will simply produce a sea of noisy data. Instead, teams should calibrate alerts so that they are meaningful.
For example, a meaningful alert might be something along the line of web requests are taking more than x seconds to process and respond or new servers are failing to spin up as expected. And these are great examples of what could be high priority alerts for a company.
Alternatively, alerts that are less high priority, such as server is 90% full can be a low priority alert that can be forwarded to the on call engineer but don’t rise to the level of a 2am wakeup call. In OnPage, you can send this low priority alert to go to the engineer’s account but ensure the account notifies the engineer during normal business hours.
Make sure your alerts are calibrated
Establish a baseline so you know how your systems are supposed to work
Ensure alerts are tied to a schedule.
As weird as it sounds, some shops just alert everyone. You never want to alert everyone. Make sure your alerts are tied to a schedule so that one person is alerted. If the engineer is unavailable, then escalate to the next person on call.
Ensure alerts are actionable
Who wants to be woken up to a message that is pointless such as there’s a problem with deployment in the test environment. Instead, ensure alerts have a direct piece of information that needs to be investigated and resolved.
Develop run books.
Publish operating procedures so on-call can become more standardized.
Review audit trails.
Make sure alerts went to the right person on the team who is best able to resolve the issue
Review on call at weekly meetings
Review alerts that were received during the week to ensure sufficient information is arriving with alerts and that alerts are actionable. If they are not, then alter the alert messaging so it is more effective.
What is Critical Incident Management
Critical Incident Management defines the alignment of company operations, services and functions to manage high priority assets and situations. Any incident that requires a co-ordinated response between multiple teams can be defined as requiring critical incident management. To learn more click here .