Monitoring and Alerting Best Practices

Calibrate alerts so that they are meaningful and powerful.

Not all alerts are created equal! Even though most response teams have adopted IT alerting practices, they are often far from monitoring and alerting best practices. It’s not enough to just have an alerting system. If monitoring tools are left uncalibrated, alerts will simply produce a sea of noisy data. Instead, teams should calibrate alerts so that they are prioritized and meaningful.

Meaningful alerts notify engineers of a line of web requests that are taking more than “x” seconds to process and respond. Quality alerts can also inform engineers of critical server failures and other high-priority IT incidents. High-priority notifications can bypass the mute switch on all smartphone devices to ensure incidents are responded to immediately.

Low-priority alerts, on the other hand, notify engineers of less important incidents. These alerts can inform on-call engineers that a server is “90 percent full” without triggering a loud, persistent high-priority notification. With OnPage, you can send low-priority alerts to an engineer’s account and ensure the account notifies the IT specialist during normal business hours.

Monitoring and Alerting

Make Sure Your Alerts Are Calibrated

Establish a baseline so you know how your systems are supposed to work.

Ensure Alerts Are Tied to a Schedule

As odd as it sounds, some shops just alert everyone. You never want to notify everyone of an incident. Make sure your alerts are tied to an actionable on-call schedule so that one person is alerted. If an on-call engineer is unavailable, the notification will always escalate to the next person in line.

Ensure Alerts Are Actionable

No one wants to be woken up in the middle of the night by a pointless message, such as alerts that notify engineers of deployment problems in a test environment. Instead, ensure that alerts have contextual, meaningful information that needs to be investigated and resolved immediately.

Develop Runbooks

Publish operating procedures so on call is more standardized and effective.

Review Audit Trails

Review real-time audit trails to ensure that incidents are being managed effectively by the right people at the right time. Gain instant visibility into incident alerts and message acknowledgements.

Review On Call at Weekly Meetings

Review all alerts received during the week to determine whether contextual information is being attached to notifications. Ensure that critical incident alerts are detailed and actionable.

What Is Critical Incident Management?

Critical incident management defines the alignment of company operations, services and functions to manage high and low-priority IT issues. Incidents that require a coordinated response from multiple teams require critical incident management.