How to Manage a Critical Incident
A critical incident is any unplanned event that disrupts business- or mission-critical operations—such as a production outage, cybersecurity incident, clinical workflow disruption, or facilities failure—and demands immediate, coordinated response. For IT and operations teams, the cost of delays is real: revenue loss, compliance exposure, patient safety risks, and reputational damage.
OnPage is the leader in secure incident alerting and mission-critical communication. Our platform ensures that the right responder is reached the first time, every time—automating on-call routing, secure notifications, and escalations to reduce response times and drive measurable improvement.
This guide aligns with recognized incident management best practices (including FEMA’s National Incident Management System [NIMS] and Coordinated Incident Management System [CIMS]) to help you build a resilient, repeatable program grounded in standards-based workflows.
- Learn how to prioritize alerts to avoid fatigue and missed escalations
- See how secure, persistent notifications and automated on-call routing accelerate response
- Measure outcomes using KPIs that matter (MTTA, MTTR, escalation frequency)
- Integrate with your monitoring and ITSM tools for end-to-end automation
For a deeper foundation, review our overview of critical incident management, then continue below to operationalize best practices with OnPage.
Note: For more on standards and terminology, see FEMA’s NIMS and New Zealand’s CIMS, which inform scalable command, communication, and escalation protocols used throughout this guide.
Planning Phase
Effective incident management starts before the event. Conduct a risk assessment and develop a documented incident response plan grounded in incident management best practices and aligned with NIMS/CIMS principles (clear command structure, common terminology, scalable response).
Plan for precision and repeatability:
- Roles and responsibilities (incident commander, communications lead, technical responders)
- Severity definitions and prioritization criteria (what is critical vs. non-critical)
- Communication protocols and approval workflows (internal/external, templates, timing)
- Escalation paths and on-call coverage (primary, secondary, managerial)
- RACI and SLAs/OLAs for acknowledgment and resolution
- Toolchain integrations (monitoring, ITSM, ticketing, chat, collaboration)
- Training, tabletop exercises, and post-incident reviews
How OnPage strengthens the Planning Phase:
- Centralized, policy-based on-call scheduling aligned to your escalation matrix
- Secure incident alerting and HIPAA-compliant messaging with receipt, read, and acknowledgment tracking
- Automated routing rules that reflect your documented severity and escalation policies
- Audit trails and reporting to validate compliance with SLAs and support continuous improvement
Step One: Categorizing Alerts (High Versus Low-Priority Notifications)
Start with clear, documented severity definitions aligned to NIMS/CIMS principles so teams use a common language.
Examples:
- High priority (P1): Production outage affecting customers, ransomware activity detected, EHR downtime impacting patient care, critical safety system failure
- Medium priority (P2): Performance degradation, intermittent service errors, partial site disruption
- Low priority (P3): Non-urgent maintenance notifications, informational updates without operational impact
Alert fatigue is real—when teams are inundated with undifferentiated notifications, they miss the few that matter. OnPage’s secure incident alerting combats fatigue through:
- Priority-based notifications with distinct critical alert tones and persistent, repeat-until-acknowledged behavior
- Intelligent routing and escalations based on on-call schedules and severity policy
- Message status tracking (delivered, read, acknowledged) to eliminate ambiguity
- Event consolidation from integrated monitoring tools to reduce noise
| Severity | Example Events | OnPage Action | Response Expectation |
|---|---|---|---|
| P1 – Critical | Production outage, ransomware indicator, EHR downtime | Trigger high-priority, persistent alert; route to primary on-call; auto-escalate on timeout | Acknowledge within minutes; incident commander engaged |
| P2 – Major | Performance degradation, partial failure | Standard priority alert; route to service owner; escalate to secondary on-call as needed | Acknowledge within defined SLA (e.g., 15 minutes) |
| P3 – Informational | Maintenance complete, advisory | Low-priority message; no escalation | Asynchronous review |
Learn more about configuring severity-based routing with OnPage incident alerting for IT operations.
Execution Phase
During an active incident, execution depends on discipline and automation. Effective protocols include:
- Predefined escalation paths mapped to on-call schedules and severities
- Incident roles and decision authority (incident commander, communications lead, technical leads)
- Communication templates for stakeholders (internal, customers, regulators) and predefined channels
- Regular handoff and status cadence (e.g., 15/30/60-minute updates) with audit-friendly records
- Scheduled drills and tabletop exercises to validate readiness
Aligned with NIMS/CIMS, these protocols ensure clear command and control, coordinated operations, and consistent communication.
OnPage operationalizes execution:
- Automated, policy-based alert routing and escalations to eliminate manual paging
- Secure, HIPAA-compliant mobile messaging with read/acknowledge receipts and threaded context
- Integration with popular platforms across industries to trigger alerts from real-time signals and incidents
- Reliable, persistent alerts that cut through noise so mission-critical communication reaches the right responder consistently
Step Two: Learning About the Right Tools
Selecting the right platform determines whether your protocols work under pressure. OnPage delivers secure incident alerting capabilities that solve the problems responders face:
- On-call schedules: Close coverage gaps with centralized, role-based scheduling and easy rotations—no spreadsheets required. Aligns to your escalation matrix and supports after-hours and holiday policies.
- Automated escalations: Prevent stalled incidents by auto-advancing to the next responder on timeout. Managers receive visibility when policies trigger.
- Persistent critical alerts: Distinct high-priority tones and repeat-until-acknowledged notifications ensure urgent messages aren’t missed—even during device Do Not Disturb.
- Message status tracking and audit trails: Real-time delivered/read/acknowledged states, plus downloadable logs for audits and post-incident analysis.
- Integrations and automation: Reduce manual steps and speed time-to-engage by leveraging OnPage’s integrations. Explore OnPage integrations to streamline your toolchain.
Step Three: Adopting the Right Tools
With stakes high during a crisis or emergency, choose a platform that won’t fail when it matters most. Use this adoption checklist to evaluate solutions and see how OnPage aligns:
Platform evaluation checklist:
- Security and compliance: End-to-end encryption, HIPAA-compliant messaging, role-based access (OnPage provides secure, compliant workflows and audit trails)
- Reliability and delivery assurance: Persistent alerts, high-priority tones, redundant delivery paths (OnPage ensures repeat-until-acknowledged delivery)
- On-call and escalations: Centralized scheduling, policy-based routing, time-based escalations (OnPage’s on-call management routes to the right responder automatically)
- Integration and automation: Native integrations/APIs with monitoring and ITSM (OnPage integrates with popular tools to automate alert creation and updates)
- Visibility and reporting: Real-time message status, post-incident reports, exportable logs (OnPage surfaces delivered/read/acknowledged states and detailed analytics)
- Usability and adoption: Intuitive mobile app and web console, minimal training (OnPage is designed for fast adoption across IT, MSPs, and clinical teams)
- Support and partnership: Implementation guidance and best practices (OnPage is a trusted partner for mission-critical communication)
Real-world scenario:
A managed service provider configured OnPage to route high-priority alerts from its monitoring stack to the primary on-call engineer, escalating to a team lead if not acknowledged within five minutes. With persistent notifications and clear acknowledgment tracking, handoffs became reliable, nighttime pages were no longer missed, and customer updates were consistent and timely.
See how these capabilities work in practice with OnPage’s incident alerting platform and learn more about OnPage integrations.
How Did Your Team Perform?
A repeatable post-incident review closes the loop and drives continuous improvement. Measure what happened, why it happened, and how you will prevent recurrence—using objective KPIs and auditable communication records. In Step Four, we outline the metrics and process that align to industry best practices and show how OnPage reporting simplifies the work.
Step Four: Post-Mortem Analysis and Reports
Perform a structured after-action review aligned with NIMS/CIMS disciplines. Use objective, repeatable metrics to understand performance and inform improvements:
Key KPIs to track:
- MTTA (Mean Time to Acknowledge)
- MTTR (Mean Time to Resolve)
- Notification delivery success and time-to-deliver
- Acknowledgment time by role/team
- Escalation frequency and depth (how often/timeouts triggered)
- False-positive rate and noise sources
- SLA/OLA adherence
- Communication effectiveness (stakeholder update cadence, clarity)
How OnPage accelerates post-incident improvement:
- Threaded, HIPAA-compliant conversations maintain full context of decisions and actions
- Real-time message status (delivered/read/acknowledged) and timestamps create an auditable trail
- Downloadable, post-incident reports support compliance requirements and trend analysis
- Analytics highlight recurring issues (e.g., frequent escalations at a certain hour) to guide staffing and policy changes
- OnPage’s AI-powered reporting automates the post-event review process by analyzing all notes related to a message, timestamps and incident details to generate a comprehensive report.
Recommended review workflow:
- Gather artifacts: OnPage reports, monitoring timelines, ticket history, change records
- Facilitate a blameless debrief with key stakeholders; identify contributing factors and systemic gaps
- Document corrective actions with owners and due dates; update runbooks, escalation policies, and on-call schedules in OnPage
- Validate changes through drills and track KPI improvements over time
For a deeper look at OnPage’s reliable alert-until-read capabilities, see our overview of OnPage’s state-of-the-art Alert Engine.
Lessons Learned
High-performing teams institutionalize learning. After each incident, translate insights into updated runbooks, routing policies, and training—then measure the effect.
Example in practice:
Following a late-night P1 outage, a team discovered acknowledgments frequently timed out between 2–4 a.m. They updated their OnPage on-call schedule to add a secondary engineer during that window and tuned monitoring thresholds to reduce noise. In the next incident, acknowledgment times improved and escalations decreased, demonstrating measurable resilience.
Incident management lifecycle at a glance:
Incident management lifecycle with OnPage at the center: Detect → Prioritize → Notify → Escalate → Resolve → Review → Improve
Readiness checklist:
- Documented severity definitions and escalation matrix
- Centralized on-call schedules in OnPage with primary/secondary coverage
- Integrated systems via email, API or native integrations to trigger automated alerts
- Secure messaging policies and communication templates
- Defined KPIs (MTTA, MTTR, escalation frequency) and reporting cadence
- Scheduled drills/tabletops and a blameless review process
When you’re ready to elevate mission-critical communication, we’re here to help. Contact our team or Start a free trial to experience secure, dependable incident alerting with OnPage.



