Routing Alerts by Team and Severity with Grafana Alerting Contact Points & Notification Policies
There's a mistake people almost always make when first building a monitoring system. They create a bunch of alerts and funnel them all into a single Slack channel. I did this too. The first few days are fine. Then one night at 3 AM, hundreds of alerts explode into your #alerts channel, and the truly important severity=critical ones get buried in the noise where nobody sees them. That incident led me to redesign the entire alert structure from scratch, and this post is a summary of what I learned through that process.
Grafana Unified Alerting's Contact Points and Notification Policies solve this problem with a label-based routing tree. Add Grafana Cloud IRM's Escalation Chains to the mix, and you can declaratively define the entire flow from the moment an alert fires to when an on-call engineer responds or the alert escalates. After reading this, you'll be able to declare a severity- and team-based routing tree with Terraform and wire up Grafana IRM escalation chains, building a structure where alerts are never missed.
One thing I want to flag upfront: Grafana OnCall OSS was officially archived on March 24, 2026. If you were running self-hosted OnCall, the IRM escalation chain section of this post should be read with Grafana Cloud IRM as the reference point. If you must stay fully self-hosted, directly integrating Alertmanager with PagerDuty or OpsGenie is the realistic alternative. Confirm this premise before reading the rest — it'll make things much clearer.
Core Concepts
The Three Pillars of Grafana Unified Alerting
Grafana's alerting system is made up of three main components. Each has a clearly defined role, and once you internalize them, it's much easier to change or extend the structure later.
| Component | Role | Analogy |
|---|---|---|
| Alert Rule | Defines "when to fire an alert" | Fire detection sensor |
| Contact Point | Defines "where to send the alert" | Fire station contact info |
| Notification Policy | Routes "which alerts go to which Contact Point" | 911 dispatcher |
The entire flow in one line:
Alert Rule fires → labels evaluated → Notification Policy matched → Contact Point delivers → IRM Escalation ChainContact Point — The Destination Where Alerts Land
A Contact Point is an object that defines the channel where alerts are actually delivered. It supports 30+ integrations including Slack, PagerDuty, Microsoft Teams, Email, and Webhook, and you can bundle multiple channels into a single Contact Point. For example, if you register both PagerDuty and Slack under a Contact Point named backend-critical, a single alert will be sent to both channels simultaneously.
Contact Point vs. Notification Channel: The "Notification Channel" concept used in the old Grafana Legacy Alerting was replaced by "Contact Point" in Unified Alerting. The concept is largely the same, but Contact Points are fully decoupled from the routing tree, making them much easier to reuse.
Notification Policy — The Label-Based Routing Tree
Notification Policy is the heart of this system. It's structured as a routing tree with a Default Policy always at the root. Every alert traverses this tree from top to bottom and is matched to the most specific child policy.
Each policy filters alerts using label matcher conditions. There are four matcher operators:
| Operator | Meaning | Example |
|---|---|---|
= |
Exact match | team = backend |
!= |
Does not match | severity != info |
=~ |
Regex match | namespace =~ prod-.* |
!~ |
Regex non-match | team !~ test-.* |
Multiple conditions are combined with AND logic, and the most specific (deepest) child policy takes precedence. Alerts that don't match any child policy fall through to the Default Policy. The =~ operator is especially useful for cases with dynamically varied values, like Kubernetes namespaces.
Alert Grouping — group_by, group_wait, group_interval, repeat_interval
Understanding how these four settings work together will save you confusion later:
| Setting | Role | Example |
|---|---|---|
group_by |
Which label combination to use for grouping alerts | [alertname, cluster] |
group_wait |
After receiving the first alert, how long to wait for additional alerts in the same group before sending them together | 30s |
group_interval |
After sending a group, how long to wait before sending another update if new alerts arrive | 5m |
repeat_interval |
How long before re-sending an already-delivered alert | 4h |
For example, setting group_wait: 0s and group_interval: 1m means critical alerts are sent immediately, and any additional alerts in the same group will trigger updates at 1-minute intervals.
Grouping caveat: Alerts routed to different routes are never grouped together. Grouping only works within the same route, so it's worth thinking about your grouping strategy alongside your routing design.
Enabling Continue matching subsequent sibling nodes allows a single alert to be processed by multiple policies. This is useful when you want to send all alerts to a logging channel while simultaneously routing them to severity-specific channels, but it can produce unintended duplicate notifications — enable it only when you have a clear purpose.
Grafana Cloud IRM — Escalation Chains
IRM's Escalation Chain is the step-by-step workflow that kicks in after a Contact Point receives an alert. This is where you define automated escalation logic like "if nobody responds within 5 minutes, escalate to the next on-call engineer."
Escalation Chain: backend-critical
├── Step 1: Notify on-call from schedule [backend-primary]
├── Step 2: Wait 5 minutes
├── Step 3: Notify on-call from schedule [backend-primary] (Important)
├── Step 4: Wait 10 minutes
├── Step 5: Notify on-call from schedule [backend-backup]
├── Step 6: Wait 15 minutes
└── Step 7: Notify all team members [backend-team]The chain continues executing until the alert is Acknowledged, Resolved, or Silenced.
Default vs. Important notifications: In IRM, each Step can deliver notifications as either "Default" or "Important." Individual engineers can configure their personal notification rules so that Default notifications go to app push and Important ones trigger a phone call. This separates team policy from personal preference, which meaningfully helps reduce alert fatigue.
Key Summary — TL;DR
- Alert Rule: Attaches labels (
severity,team) to define alert properties- Notification Policy: Uses a label matcher tree to decide which Contact Point receives each alert
- Contact Point: Defines the actual delivery channels (Slack, PagerDuty, IRM webhook)
- IRM Escalation Chain: The workflow that automatically escalates to the next on-call engineer if there's no response
Practical Application
Example 1: Routing by Severity (critical → PagerDuty, warning → Slack)
When to use this pattern: Ideal for small-to-medium teams where everyone shares the same on-call pool and separating channels by severity alone is sufficient.
This is the most fundamental yet effective pattern. You use a single severity label to assess urgency and route to the appropriate channel. Even this alone lets you pick out what matters during a 3 AM alert storm.
Start by creating three Contact Points:
| Contact Point Name | Channel | Purpose |
|---|---|---|
pagerduty-critical |
PagerDuty | Critical alerts only |
slack-warning |
Slack #alerts-warning |
Warning alerts |
slack-general |
Slack #alerts-general |
Default, unclassified alerts |
This is how you attach labels to an Alert Rule. If PromQL is unfamiliar, check the Prometheus query language documentation first — it'll make this much easier to follow.
groups:
- name: backend-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) > 0.05
labels:
severity: critical
team: backend
service: api-gateway
annotations:
summary: "API Gateway error rate spike"Then configure the Notification Policy tree:
[Default Policy]
contact: slack-general
group_by: [alertname, cluster]
│
├── [severity=critical]
│ contact: pagerduty-critical
│ group_wait: 0s ← critical is sent immediately
│ group_interval: 1m
│ repeat_interval: 30m
│ │
│ └── [team=backend]
│ contact: pagerduty-backend-critical
│ ← Uses the backend team's dedicated PagerDuty service key
│
├── [severity=warning]
│ contact: slack-warning
│ group_wait: 30s
│ group_interval: 5m
│ repeat_interval: 4h
│
└── [severity=info]
contact: slack-general
group_wait: 1m
repeat_interval: 12hOptional deep dive — Managing as IaC with Terraform: Teams that haven't adopted IaC can skip this block and move to the next example. Everything here can be configured identically through the UI.
# Contact Point resource definitions (must be declared before policy references)
resource "grafana_contact_point" "pagerduty_critical" {
name = "pagerduty-critical"
pagerduty {
integration_key = var.pagerduty_integration_key
}
}
resource "grafana_contact_point" "slack_general" {
name = "slack-general"
slack {
url = var.slack_webhook_url
}
}
# Notification Policy tree
resource "grafana_notification_policy" "main" {
group_by = ["alertname", "cluster"]
contact_point = grafana_contact_point.slack_general.name
policy {
matcher {
label = "severity"
match = "="
value = "critical"
}
contact_point = grafana_contact_point.pagerduty_critical.name
group_wait = "0s"
group_interval = "1m"
repeat_interval = "30m"
policy {
matcher {
label = "team"
match = "="
value = "backend"
}
contact_point = grafana_contact_point.pagerduty_backend_critical.name
}
}
policy {
matcher {
label = "severity"
match = "="
value = "warning"
}
contact_point = grafana_contact_point.slack_warning.name
group_wait = "30s"
group_interval = "5m"
repeat_interval = "4h"
}
}With this setup, severity=critical, team=backend alerts go to the backend team's dedicated PagerDuty service, severity=critical alerts from teams other than backend go to the shared PagerDuty, and severity=warning alerts route to the Slack warning channel. After switching to this structure, our team stopped having to hunt for critical alerts buried in nighttime Slack threads.
Example 2: Team-First Routing with Special Handling for the Security Team
When to use this pattern: Ideal when each team has a completely different way of handling alerts, or when a team like security needs a pipeline entirely isolated from external channels.
Depending on your organization, you may want to route by team ownership before considering severity.
[Default Policy]
contact: slack-general
│
├── [team=security]
│ contact: security-slack ← Dedicated Slack channel for the security team
│ Continue matching: OFF ← Prevent overlap with other policies
│ │
│ └── [severity=critical]
│ contact: security-pagerduty ← Additional escalation for critical only
│
├── [team=backend]
│ contact: backend-slack
│ │
│ └── [severity=critical]
│ contact: pagerduty-backend
│
└── [team=frontend]
contact: frontend-slackTurning off Continue matching subsequent sibling nodes prevents team=security alerts from flowing into other sibling policies (such as severity-based ones). Since this option is off by default, it's recommended to enable it explicitly only when cross-routing is intentional.
Example 3: Configuring an Escalation Chain with Grafana IRM
When to use this pattern: Ideal for teams that need automatic escalation on unanswered critical alerts, or teams with an on-call schedule.
Honestly, this connection was the most confusing part when I first set it up. Walk through the steps below to see how Grafana Alerting and IRM connect.
Step 1: Create an IRM integration and retrieve the webhook URL
Go to Grafana Cloud IRM → Integrations → + New Integration → Grafana Alerting to create an integration. This generates a webhook URL that acts as the bridge between Grafana Alerting and IRM.
Step 2: Register the IRM integration as a Contact Point
In Grafana → Alerts & IRM → Alerting → Contact Points, click + Add contact point, select Grafana OnCall (or Grafana IRM) as the integration type, and enter the webhook URL from Step 1.
Step 3: Define an Escalation Chain in IRM
Go to Grafana Cloud IRM → Escalation Chains → + New Escalation Chain, give it a name, and add steps:
Step 1: Notify on-call from schedule
Schedule: backend-primary-schedule
Notification type: Default
Step 2: Wait for 5 minutes
Step 3: Notify on-call from schedule
Schedule: backend-primary-schedule
Notification type: Important ← Stronger notification method (phone call, etc.)
Step 4: Wait for 10 minutes
Step 5: Notify on-call from schedule
Schedule: backend-backup-schedule
Notification type: Important
Step 6: Wait for 15 minutes
Step 7: Notify users and teams
Team: backend-team
Step 8: Repeat escalation (max: 5 times)Step 4: Connect label conditions to the escalation chain in IRM routing
Go to IRM → Integration → backend-irm-integration → Routes → + Add Route and enter the routing condition:
Routing condition: {{ labels.severity == "critical" }}
Escalation Chain: backend-critical-chainThe {{ labels.severity == "critical" }} syntax is IRM's own routing DSL, an expression derived from Jinja2. This is entered directly into the Route configuration UI in Grafana Cloud IRM, and you can reference Prometheus label values directly. Before saving, it's recommended to use the "Preview" feature in the IRM UI to verify that the condition actually matches as expected.
IRM's two-stage routing philosophy: Grafana's official documentation recommends "coarse-grained routing by team and service in Grafana Alerting, and fine-grained escalation by severity in IRM." I tried adding detailed conditions in both places at first, and it became very hard to trace which side was making the final routing decision. Keeping the roles separated is clearly the better approach.
Example 4: Namespace- and Cluster-Based Routing in Kubernetes Environments
When to use this pattern: Ideal for multi-cluster or multi-namespace environments where you want different escalation levels for staging versus production.
When operating Kubernetes clusters, you'll frequently need to handle alerts differently based on namespace or cluster.
groups:
- name: k8s-pod-alerts
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
labels:
severity: critical
cluster: production
# The namespace label is already included in the metric,
# so it can be used directly in Notification Policy matchers
annotations:
summary: "Pod restart detected: {{ $labels.namespace }}/{{ $labels.pod }}"Dynamic label mapping caveat: Rather than generating dynamic values in the
labelsfield of an Alert Rule using something liketeam: "{{ $labels.namespace }}", it's clearer and more predictable to match directly in the Notification Policy using a regex matcher likenamespace =~ prod-.*, since thekube_pod_container_status_restarts_totalmetric already includes thenamespacelabel.
[Default Policy]
contact: slack-general
│
├── [cluster=production, severity=critical]
│ contact: irm-production-critical ← Routes to IRM escalation chain
│ group_by: [cluster, namespace, alertname]
│ group_wait: 0s
│
├── [cluster=production, severity=warning]
│ contact: slack-production-warning
│
└── [cluster=staging]
contact: slack-staging ← Staging goes to Slack onlyWhen a cluster=production + severity=critical combination matches, it's handed off to IRM, which escalates to the on-call engineer. Staging alerts are handled with Slack notifications only, preventing middle-of-the-night phone calls.
If you want to version-control the actual Notification Policy structure as a file, call the GET /api/v1/provisioning/policies endpoint — it returns the full tree as JSON:
{
"receiver": "slack-general",
"group_by": ["alertname", "cluster"],
"routes": [
{
"receiver": "irm-production-critical",
"matchers": ["cluster=\"production\"", "severity=\"critical\""],
"group_by": ["cluster", "namespace", "alertname"],
"group_wait": "0s"
},
{
"receiver": "slack-staging",
"matchers": ["cluster=\"staging\""]
}
]
}Pros and Cons
Summarized based on what our team actually experienced.
Advantages
| Item | Details |
|---|---|
| Flexible routing tree | Hierarchical structure with label matchers lets you declaratively express complex routing by team, severity, and service |
| Single platform integration | Manage dashboards, alerting, on-call, and incidents all within Grafana Cloud IRM |
| Broad channel support | 30+ integrations including Slack, PagerDuty, Microsoft Teams, OpsGenie, and Webhook |
| Personalized notification rules | Each engineer configures their own Default/Important notification preferences, reducing alert fatigue |
| Automated escalation | Chains automatically hand off to the next on-call engineer on no response, preventing missed alerts |
| Alert grouping | Bundles multiple alerts matching the same conditions into a single message, suppressing alert storms |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| OnCall OSS end-of-life | Self-hosted OnCall is unavailable after March 2026 | Switch to a paid Grafana Cloud IRM plan, or explore Alertmanager with alternative tools |
| IRM is cloud-only | IRM escalation chains cannot be used in fully self-hosted environments | Replace with direct Alertmanager + OpsGenie/PagerDuty integration |
| Routing tree complexity | Debugging becomes harder as the number of policies grows | Use the "Route Preview" feature to validate routing paths for specific label combinations ahead of time |
| Default Policy dependence | Unmatched alerts pile up in the Default Policy and are easily missed | Keep the Default Policy's Contact Point pointing to a valid channel at all times, and audit regularly |
| Label standardization prerequisite | Routing quality depends entirely on label design | Agree on label naming conventions (severity, team, service) as a team beforehand |
| Migration complexity | No official migration tool supports OnCall OSS as a source for migration to IRM | Use the Terraform grafana provider or the OnCall API for manual migration |
The Most Common Mistakes in Practice
-
Creating Alert Rules without labels: Without labels, nothing matches in the Notification Policy and all alerts fall to the Default Policy. Make it a habit to define
severityandteamlabels whenever you create an Alert Rule. -
Enabling the
Continue matchingoption indiscriminately: Turning this on causes a single alert to be sent to multiple Contact Points simultaneously. Because it can produce unintended duplicate notifications, it's recommended to enable it only when you have a clear purpose. -
Over-complicating both IRM routing and Alerting routing: It's better to maintain a clear separation — coarse routing by team unit in Grafana Alerting, and fine-grained handling by severity in IRM. Adding complex conditions in both places makes it hard to determine which side is actually making the final routing decision.
Closing Thoughts
Grafana Alerting's Contact Point and Notification Policy routing tree, combined with IRM escalation chains, lets you declaratively define the entire flow from the moment an alert fires to the moment an on-call engineer responds — like writing code.
One tip from experience: trying to design the perfect tree from the very beginning is actually what gets you stuck. Starting with the attitude of solving just the single most painful problem first is genuinely faster. I spent two months trying to draw out the entire organization's routing structure all at once and never changed a thing. If you're still in an environment where all alerts pour into a single channel, try going through these steps one at a time.
Three steps you can start right now:
-
Start by defining labels: Add
severity: critical | warning | infoandteam: [team name]labels to your existing Alert Rules. Opening the full Alert Rule list and filling in labels for the ones missing them, one by one, is enough to start. -
Build the Notification Policy tree: In
Alerts & IRM → Alerting → Notification Policies, create a single[severity=critical]child policy under the Default Policy and connect it to a separate Contact Point (such as a Slack#alerts-criticalchannel). This alone immediately isolates critical alerts. -
Connect the IRM escalation chain: In Grafana Cloud IRM, create a
Integrations → Grafana Alertingintegration and register the webhook URL as a Contact Point. Create a simple two-step Escalation Chain — "wait 5 minutes → page backup on-call" — and connect it to the critical route. You'll immediately see firsthand how escalation works.
References
Essential reading to start with:
- Notification Policies | Grafana Official Docs — The best official reference for understanding the routing tree concept
- Best Practices for Alert Routing | Grafana Cloud IRM Docs — The source of the coarse/fine-grained separation philosophy
- Best Practices for Escalation Chains | Grafana Cloud IRM Docs — Best practices for designing escalation chains
Additional references:
- Configure Notification Policies | Grafana Official Docs
- Contact Points | Grafana Official Docs
- Configure Contact Points | Grafana Official Docs
- Configure Escalation Chains | Grafana Cloud IRM Docs
- Grafana Alerting Integration for Grafana OnCall | Grafana OnCall Docs
- Migrate from Grafana OnCall OSS to Grafana Cloud IRM | Grafana Cloud Docs
- Grafana OnCall OSS Archival Notice | Grafana OnCall Docs
- Attach the Schedule to the Escalation Chain | Grafana Labs Learning Path