Monitoring Flagger Canary Deployments on Kubernetes: Grafana Dashboard + AlertManager Slack Notification Guide
At 2 AM, the canary deployment was automatically rolled back. No one on the team knew. It wasn't until we arrived at work at 9 AM that we dug through Flagger event logs to trace the cause of the rollback — during those seven hours, some of the traffic had been returning incorrect responses. Flagger automates the canary analysis loop, but without a pipeline to deliver the results immediately to the team, it is only a half-baked tool.
This article covers how to build a structure that enables on-call personnel to detect and respond to canary analysis failures within 30 seconds by directly connecting the observation stack from Flagger → Prometheus → Grafana/AlertManager → Slack. While the target audience is teams already operating Flagger, prerequisites are specified first so that teams configuring the entire stack for the first time can also follow along.
Prerequisites: A Kubernetes cluster, Helm 3, Prometheus, Grafana, AlertManager, Flagger, and Istio (or NGINX/Traefik ingress) must be installed. To install Prometheus + Grafana + AlertManager at once, refer to kube-prometheus-stack Helm chart, and for Istio, refer to Official Installation Guide.
Key Concepts
Overall Data Flow — Start with the Big Picture
Flagger ──(메트릭 기록)──▶ Prometheus
│
┌────────────────┴─────────────────┐
▼ ▼
Grafana AlertManager
(실시간 대시보드) (알림 그루핑·라우팅)
│
▼
Slack (#deployments-alert)
(30초 이내 팀 전체 인지)You must keep this flow in mind before diving into the detailed settings. The roles of Grafana for visualization and AlertManager for notification policies are clearly separated.
Flagger's Canary Analysis Loop
Flagger splits the Deployment into two services: Primary (stable version) and Canary (new version). It evaluates Prometheus metrics at configured intervals to determine whether to increment traffic weights or roll back. Istio's VirtualService converts these weights into actual traffic splitting — the structure is such that when Flagger updates the weight field of VirtualService at intervals, the Istio sidecar reflects this to change the actual packet routing.
Primary (90%) ──▶ 실 사용자 트래픽
Canary (10%) ──▶ 점진적으로 증가 (stepWeight: 10)
└▶ 분석 실패 시 0%로 즉시 롤백There are two key metrics that Flagger exposes to Prometheus.
| Metrics | Meaning |
|---|---|
flagger_canary_status |
Canary status code. 0=Reset, 1=In Progress, 2=Success (Promotion), 3=Failure (Rollback) |
flagger_canary_weight |
Current traffic flow to Canary (%) |
Important: The flagger_canary_status value may vary depending on the Flagger version. The above value is based on the official source code, and it is recommended to check the actual metric value directly with kubectl exec before deployment and write the Alert Rule.
For Prometheus to collect this metric, a ServiceMonitor CRD for the Flagger Pod is required. It is automatically generated as --set serviceMonitor.enabled=true when installing the Flagger Helm chart, or applied manually as shown below.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: flagger
namespace: flagger-system
labels:
release: kube-prometheus-stack # Prometheus Operator의 selector에 맞게 조정
spec:
selector:
matchLabels:
app.kubernetes.io/name: flagger
namespaceSelector:
matchNames:
- flagger-system
endpoints:
- port: http
path: /metrics
interval: 15sWithout this ServiceMonitor, Flagger does not appear in the Prometheus target list, so the flagger_canary_status metric itself cannot be viewed. If there is no data on the Grafana dashboard, this is the first point to check.
Canary Deployment: A deployment strategy that exposes a new version to a small number of users first and gradually increases traffic only when metrics such as error rates and response times remain within thresholds. Derived from the "canary in a mine" (early warning for detecting danger).
AlertManager: The notification routing hub of the Prometheus ecosystem. It is responsible for grouping recurring alerts, silencing alerts to turn them off during specific time periods, and routing alerts to different channels based on conditions.
Practical Application
Example 1: Visualizing Canary Status with Grafana Dashboard
Step 1 — Install Grafana for Flagger
helm upgrade -i flagger-grafana flagger/grafana \
--create-namespace \
--namespace=flagger-system \
--set url=http://prometheus:9090Step 2 — Import Official Dashboard (ID: 15158)
Grafana UI → Dashboards → Import → Enter ID 15158 → Select Prometheus data source.
If there is no data: Check in order whether the Prometheus data source URL is correct (check the service name within the namespace, such as http://prometheus-operated:9090) and whether ServiceMonitor is applied.
Step 3 — Core PromQL Queries
# 카나리 현재 상태 (3 = 실패/롤백)
flagger_canary_status{name="my-app", namespace="default"}
# 현재 트래픽 가중치 (%)
flagger_canary_weight{name="my-app", namespace="default"}
# 요청 성공률 (5xx 제외)
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# 카나리 분석 경과 시간 (초)
flagger_canary_duration_seconds{name="my-app"}Example 2: Prometheus Alert Rule — Rollback Detection
Apply the YAML below as the PrometheusRule CRD, or add it to additionalPrometheusRulesMap in Helm values.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: flagger-alerts
namespace: flagger-system
labels:
release: kube-prometheus-stack # Prometheus Operator selector에 맞게 조정
spec:
groups:
- name: flagger.rules
rules:
- alert: CanaryRollback
expr: flagger_canary_status == 3
for: 0m
labels:
severity: critical
annotations:
summary: "카나리 롤백 발생: {{ $labels.name }}"
description: >
네임스페이스 {{ $labels.namespace }}의
{{ $labels.name }} 카나리 분석 실패로 롤백됨
- alert: CanaryProgressing
expr: flagger_canary_status == 1
for: 1m
labels:
severity: info
annotations:
summary: "카나리 배포 진행 중: {{ $labels.name }}"
- alert: CanaryWeightHigh
expr: flagger_canary_weight > 50
for: 0m
labels:
severity: warning
annotations:
summary: "카나리 트래픽 50% 초과: {{ $labels.name }} ({{ $value }}%)"Apply kubectl apply -f flagger-alerts.yaml. If using kube-prometheus-stack, the Prometheus Operator automatically detects PrometheusRule.
Example 3: AlertManager → Slack Routing (Recommended)
Issuing Slack Incoming Webhook URL: Slack Workspace → Apps → Incoming WebHooks → Add → Select Channel → Copy Webhook URL.
There are two ways to apply AlertManager settings to Kubernetes.
Method A — Apply with kube-prometheus-stack Helm values (Recommended)
# values.yaml
alertmanager:
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'namespace']
group_wait: 10s
group_interval: 10m
repeat_interval: 1h
receiver: 'slack-default'
routes:
- match:
alertname: CanaryRollback
severity: critical
receiver: 'slack-canary-rollback'
continue: false
receivers:
- name: 'slack-canary-rollback'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T.../B.../XXXX'
channel: '#deployments-alert'
send_resolved: true
title: ':rotating_light: 카나리 롤백 발생'
text: |
*앱:* {{ .GroupLabels.name }}
*네임스페이스:* {{ .GroupLabels.namespace }}
*상태:* {{ .CommonAnnotations.description }}
color: 'danger'
- name: 'slack-default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T.../B.../XXXX'
channel: '#deployments'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.summary }}'helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
-f values.yamlMethod B — Directly Apply ConfigMap
kubectl create secret generic alertmanager-kube-prometheus-stack-alertmanager \
--from-file=alertmanager.yaml=./alertmanager.yaml \
--namespace monitoring \
--dry-run=client -o yaml | kubectl apply -f -For security reasons, it is correct to specify api_url separately for each receiver. Since global.slack_api_url exposes the same Webhook URL to all receivers, it becomes difficult to manage when using different Webhooks for each channel.
Example 4: Directly Integrating Flagger with Slack Without AlertManager (Simple Alternative)
This example is an alternative to Example 3. It is used for small teams that do not operate AlertManager or when only simple notifications are needed. If you set up Example 3 and Example 4 simultaneously, two Slack messages will be sent at once, so you must choose only one.
Step 1 — Create AlertProvider CRD
Save the Slack Webhook URL as a Secret, then create AlertProvider.
kubectl create secret generic slack-webhook-secret \
--from-literal=address='https://hooks.slack.com/services/T.../B.../XXXX' \
--namespace flagger-systemapiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Provider
metadata:
name: slack-provider
namespace: flagger-system
spec:
type: slack
channel: deployments
secretRef:
name: slack-webhook-secretStep 2 — Connect Notifications to Canary CRD
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: my-app
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 30s
alerts:
- name: "rollback-alert"
severity: error
providerRef:
name: slack-provider
namespace: flagger-systemIf set to severity: error, notifications are triggered only for failure events. If changed to info, all deployment start, completion, and rollback events are received.
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Automatic Rollback | Minimize downtime with unattended rollback when metric thresholds are exceeded |
| Gradual Risk Exposure | Validate with a small number of users first using stepWeight, then scale |
| Multiple Metric Providers | Integrate Prometheus, Datadog, and CloudWatch via MetricTemplate CRD |
| GitOps Friendly | Native Integration with Flux/Argo CD |
| Multi-channel Notifications | Supports Slack, Teams, Discord, and Rocket.Chat |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Prometheus Dependency | Analysis loop cannot operate without Prometheus | Batch install kube-prometheus-stack using Helm chart |
| Minimum Traffic Requirement | Unreliable metric if traffic is too low | Secure artificial traffic within the analysis period with flagger-loadtester |
| Short-term Data Retention | Flagger-exclusive Prometheus default retention period 2 hours | Long-term retention extended to Thanos·Cortex |
| Risk of Duplicate Notifications | Sending duplicate notifications when Example 3 + Example 4 are set simultaneously | Use only one of them or suppress with inhibit_rules |
| Service mesh required | Traffic splitting requires Istio, Linked, or Ingress | Can be configured without a mesh using NGINX or Traefik |
| RBAC Permissions | ServiceAccount permission error occurs depending on the cluster | Check rbac.create: true in Flagger Helm chart |
The Most Common Mistakes in Practice
- Create Alert Rule without verifying
flagger_canary_statusvalue — Status codes vary depending on the Flagger version. After deployment, you must directly access the Flagger metric endpoint viakubectl port-forwardto verify the actual value before applying it to the Alert Rule. If you write it with an incorrect value, no notification will be received even if a rollback occurs. - Example 3 + Example 4 Dual Configuration — Rollback: Two Slack messages arrive at once. You must suppress them with
inhibit_rulesor use only one of them. - Passing Canary Analysis with No Traffic — If there is no traffic during a dawn deployment, the success rate appears as 100% and the analysis passes. You must secure artificial traffic within the analysis period using
flagger-loadtesteror connect a load test to the analysis loop usingwebhooksconfiguration.
In Conclusion
Flagger Canary Analysis is truly complete when an on-call manager receives a 4 AM deployment failure via Slack within 30 seconds and can track the cause by viewing traffic weights and success rate graphs on a Grafana dashboard.
3 Steps to Start Right Now:
- Install Grafana using
helm upgrade -i flagger-grafana flagger/grafana --create-namespace --namespace=flagger-system --set url=http://prometheus:9090and import dashboard ID15158. If there is no data, first check the Prometheus URL and whetherServiceMonitoris applied. - Apply
CanaryRollbackandPrometheusRuleof theflagger_canary_status == 3condition to the cluster, and set it tofor: 0mso that a notification occurs immediately upon rollback. Verify the actual metric values directly before applying. - Add
slack-canary-rollbackreceiver to AlertManager values, connect routing to#deployments-alertchannel, and then reflect it inhelm upgrade.
Next Post: Flagger MetricTemplate How to integrate external APM metrics such as Datadog and New Relic based on canary analysis using CRD