Burn Rate SLO-Based Canary Auto-Rollback with Kubernetes Argo Rollouts AnalysisTemplate and Datadog
Have you ever been jolted awake at 3 AM by a PagerDuty alert? I have. More than once. Every time, I'd dig through logs and eventually land on the same thought: "We were doing a canary deployment — why didn't we catch this at deploy time?" The error rate was under 1%, so as far as the system was concerned, nothing was wrong. An error rate of 0.9% had been holding steady for 30 minutes, and nobody — no system — had intervened.
The cause was clear, but the solution demanded more nuanced judgment than I expected. Instead of a fixed threshold like "error rate below 1%", what I needed was a pattern that automates canary promotion and rollback based on how quickly the error budget is being consumed — the Burn Rate. After switching to this approach, instead of getting woken up by a phone call, I'd check Slack the next morning to find messages like "Canary auto-rolled back, no impact" — and that happened more than once.
This post is aimed at developers with a basic understanding of Kubernetes. No prior SLO knowledge required. I'll walk through the process step by step: connecting Argo Rollouts' AnalysisTemplate to Datadog and expressing SLO-based decision logic as Kubernetes resources. Along with YAML examples inspired by real-world patterns, I'll be honest about the pitfalls and gotchas of this approach.
Core Concepts
SLO, Error Budget, Burn Rate — Why They Work as Deployment Decision Criteria
When I first encountered these concepts, my biggest confusion was "isn't this just error rate?" That's exactly where I started too. But once I understood burn rate, my thinking completely changed.
Let's start with the terminology.
| Concept | Meaning | Example |
|---|---|---|
| SLI (Service Level Indicator) | The measurement metric | HTTP 5xx rate, p95 latency |
| SLO (Service Level Objective) | Target value for the SLI | 99.9% availability over a 30-day window |
| Error Budget | Allowable error margin of 1 - SLO |
30 days × 0.1% = 43.2 minutes |
| Burn Rate | Speed at which the error budget is consumed | Burn rate 1 = budget exhausted after 30 days |
The burn rate formula is straightforward.
Burn Rate = Current Error Rate / (1 - SLO)If the SLO is 99.9%, the allowable error rate is 0.1% (= 0.001). If the current error rate is 0.1%, the burn rate is 1.0 — the budget will be exactly exhausted in 30 days. What if the error rate is 1.44%?
Burn Rate = 0.0144 / 0.001 = 14.4A burn rate of 14.4 means you're consuming the error budget 14.4 times faster than normal. At that rate, the entire 30-day error budget would be gone in roughly 2 days (30 days ÷ 14.4 ≈ 2.1 days). If you observe this figure during a canary deployment, it's a signal that immediate action is needed.
Why burn rate instead of a simple error rate? Whether an error rate of 1% is "dangerous" depends on the service. For a service with an SLO of 99%, 1% is a burn rate of 1.0 — normal consumption. For a service with an SLO of 99.99%, 1% is a burn rate of 100 — immediate danger. Burn rate automatically captures this context.
What AnalysisTemplate Does
AnalysisTemplate is a Kubernetes CRD that enables Argo Rollouts to make decisions during a canary deployment — "keep going, pause, or roll back" — based on external metrics.
Rollout Controller
│ (instantiates AnalysisTemplate)
▼
AnalysisRun (analysis execution)
│ (queries Datadog Metrics API v2)
▼
Datadog
│ (returns burn rate calculation result)
▼
successCondition / failureCondition evaluation
│
├─ Successful → auto-promote to next step
├─ Failed → auto-rollback
└─ Inconclusive → pause (awaiting manual intervention)Of the three possible outcomes, Inconclusive is the important one. It's returned when neither successCondition nor failureCondition is met; the Rollout automatically pauses and waits for manual review. It's a safety net that reduces the risk of full automation.
How the Datadog Connection Works
For Argo Rollouts to communicate with Datadog, you need to prepare an API key and App key as a Kubernetes Secret first.
apiVersion: v1
kind: Secret
metadata:
name: datadog-api-key
namespace: argo-rollouts
stringData:
api-key: "<DATADOG_API_KEY>"
app-key: "<DATADOG_APP_KEY>"Since stringData is stored in plain text, integrating with Sealed Secrets or External Secrets Operator is recommended in production. It's also good practice to explicitly configure namespace isolation (namespaced: true) — it makes things much easier when you expand to a multi-tenant environment.
The Full Picture at a Glance
Before diving into the code examples, it helps to see the overall picture first — how AnalysisTemplate, Datadog, and the Rollout Controller connect together.
┌─────────────────────────────────────────────────────────────┐
│ Git Repository │
│ ┌──────────────────┐ ┌─────────────────────────────┐ │
│ │ Rollout.yaml │ │ AnalysisTemplate.yaml │ │
│ └────────┬─────────┘ └──────────────┬──────────────┘ │
└───────────┼───────────────────────────┼────────────────────┘
│ GitOps deploy │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────────┐ instantiate ┌─────────────────────┐ │
│ │ Rollout │ ──────────→ │ AnalysisRun │ │
│ │ Controller │ ←────────── │ (metric query + eval)│ │
│ │ │ verdict └──────────┬──────────┘ │
│ └──────────────────┘ │ │
│ │ Datadog API query │
│ ┌──────▼───────────┐ │ │
│ │ Canary Pods │ ▼ │
│ │ (5%→20%→50%→100%)│ ┌──────────────────────┐ │
│ └──────────────────┘ │ Datadog Metrics v2 │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────┘AnalysisTemplate is a "template" that defines how to query metrics and what criteria to evaluate against, while AnalysisRun is the actual execution instance created for each deployment. The Rollout Controller receives the verdict from the AnalysisRun and decides whether to advance the canary steps.
Practical Application
Basic — Validating Canary Quality with Error Rate
If you're using Argo Rollouts + Datadog for the first time, it's worth starting with a simple error rate check before getting into complex burn rate calculations. This is the basic pattern: query a 5-minute error rate using the Datadog Metrics API v2 and mark it as a failure if it exceeds 1%.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: datadog-error-rate
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 5m
successCondition: default(result, 0) <= 0.01
failureLimit: 3
provider:
datadog:
apiVersion: v2
queries:
errors: sum:requests.errors{service:{{args.service-name}}}.as_count()
total: sum:requests{service:{{args.service-name}}}.as_count()
formula: "moving_rollup(errors, 300, 'sum') / moving_rollup(total, 300, 'sum')"| Setting | Description |
|---|---|
interval: 5m |
Run a query against Datadog every 5 minutes |
default(result, 0) |
Treat nil query results (no traffic) as 0. Without this, the analysis itself is treated as an error in the early stages of a canary |
failureLimit: 3 |
Requires 3 consecutive failures before rolling back. Prevents false positives from transient spikes |
moving_rollup(errors, 300, 'sum') |
Summed aggregation over a 5-minute (300-second) window |
You may notice there's no failureCondition. In this case, Argo Rollouts counts any measurement that doesn't satisfy successCondition as a "failure", and triggers a rollback when failureLimit is reached. Without failureCondition, there's no Inconclusive state — it's a binary pass-or-fail judgment.
Core — SLO Auto-Rollback Based on Burn Rate
Now for the main event. The failureCondition threshold value was what I agonized over the most in this pattern, and ultimately following the Google SRE Workbook standard turned out to be the most sensible approach. The burn rate is calculated in real time using the Datadog Formula by dividing the current error rate by the allowable error rate.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: slo-burn-rate-check
spec:
args:
- name: service-name
- name: slo-error-rate # e.g., "0.001" (allowable error rate for SLO 99.9%)
metrics:
- name: error-budget-burn-rate
interval: 5m
successCondition: result <= 2.0 # Burn rate ≤ 2x → safe
failureCondition: result > 14.4 # Burn rate > 14.4 → immediate rollback
failureLimit: 1
provider:
datadog:
apiVersion: v2
queries:
errors: sum:requests.errors{service:{{args.service-name}}}.as_count()
total: sum:requests{service:{{args.service-name}}}.as_count()
formula: >-
(moving_rollup(errors, 300, 'sum') / moving_rollup(total, 300, 'sum'))
/ {{args.slo-error-rate}}The range between successCondition and failureCondition — that is, a burn rate between 2.0 and 14.4 — is treated as Inconclusive. The Rollout automatically pauses and waits for manual intervention. This can serve as a safety net between full automation and human review.
Where does 14.4 come from? It's a value derived from the Google SRE Workbook. If a burn rate of 14.4 persists for one hour, approximately 2% of the 30-day error budget is consumed (14.4 ÷ 720 hours = 2%). While the absolute consumption isn't huge, observing this speed during the short observation window of a canary deployment is a signal that an underlying problem could impact all of production. It should be treated as an alert level that demands immediate action.
Recommended Pattern — Multi-Window Burn Rate to Minimize False Positives
Honestly, I started out running on a single-window burn rate too. After watching several perfectly fine deployments roll back because of a 10-minute platform hiccup, I switched to this pattern. The approach is to check both a long window (5 minutes) and a short window (1 minute) simultaneously, and only treat it as a failure when both exceed the threshold.
One important point: count is omitted in the example below. If you set something like count: 5, combined with interval: 1m, the analysis terminates after 5 minutes. That completely defeats the purpose of using Background Analysis to monitor the entire deployment window. Omitting count keeps the analysis running until the Rollout completes.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: multiwindow-burn-rate
spec:
args:
- name: service
- name: slo-threshold
metrics:
# Long window: burn rate based on 5-minute aggregation
- name: burn-rate-5m
interval: 1m
successCondition: default(result, 0) <= 2.0
failureCondition: default(result, 0) > 14.4
failureLimit: 1
provider:
datadog:
apiVersion: v2
queries:
e: sum:http.errors{service:{{args.service}},version:canary}.as_count()
r: sum:http.requests{service:{{args.service}},version:canary}.as_count()
formula: >-
(moving_rollup(e, 300, 'sum') / moving_rollup(r, 300, 'sum'))
/ {{args.slo-threshold}}
# Short window: burn rate based on 1-minute aggregation (fast spike detection)
- name: burn-rate-1m
interval: 1m
successCondition: default(result, 0) <= 14.4
failureCondition: default(result, 0) > 14.4
failureLimit: 2
provider:
datadog:
apiVersion: v2
queries:
e: sum:http.errors{service:{{args.service}},version:canary}.as_count()
r: sum:http.requests{service:{{args.service}},version:canary}.as_count()
formula: >-
(moving_rollup(e, 60, 'sum') / moving_rollup(r, 60, 'sum'))
/ {{args.slo-threshold}}Pay attention to the explicit version:canary tag in the queries. When a canary is only handling 5% of total traffic, using global metrics without version differentiation lets the stable old-version traffic from the other 95% dilute the burn rate, masking real problems. By tagging canary pods with a version: canary label and configuring Datadog to collect this as a tag, you can accurately measure the burn rate for the canary slice alone.
You'll also notice that burn-rate-1m uses the same threshold (14.4) for both successCondition and failureCondition. This is binary — no Inconclusive range — but failureLimit: 2 means it requires two consecutive failures before being treated as a failure, which naturally filters out one-minute noise. If you need finer-grained control, you could explicitly create an Inconclusive range with something like successCondition: result <= 10.0.
Connecting to a Rollout — Monitoring the Full Window with Background Analysis
Here's how to attach the AnalysisTemplate you built to an actual Rollout. Using Background Analysis keeps the analysis running continuously throughout all canary steps.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
spec:
replicas: 10
strategy:
canary:
analysis:
templates:
- templateName: multiwindow-burn-rate
startingStep: 1 # analysis begins immediately after setWeight: 5
args:
- name: service
value: payment-service
- name: slo-threshold
value: "0.001" # SLO 99.9%
steps:
- setWeight: 5
- pause: {duration: 10m}
- setWeight: 20
- pause: {duration: 10m}
- setWeight: 50
- pause: {duration: 15m}
- setWeight: 100Setting startingStep: 1 means the analysis begins immediately after the first setWeight. You're effectively monitoring SLO impact in real time from when canary traffic is at 5%.
Checking Status During Deployment — Debugging AnalysisRun
The first place people get stuck when adopting this pattern is "something went wrong, but where do I look?" I spent a long time lost here myself when I first started.
# List AnalysisRuns and check current status
kubectl get analysisrun -n <namespace>
# View details for a specific AnalysisRun
kubectl describe analysisrun <analysisrun-name> -n <namespace>
# Extract only the status field
kubectl get analysisrun <analysisrun-name> -n <namespace> -o json \
| jq '.status | {phase, message, metricResults}'status.phase will be one of Running, Successful, Failed, or Inconclusive. status.metricResults shows the most recent measured values and verdict for each metric. If the state is Inconclusive, checking which threshold range the measured value falls between — and whether the Datadog query is returning nil — will usually surface the cause.
Pros and Cons
Advantages
| Item | Detail |
|---|---|
| Automated safety net | Automatically rolls back on SLO violations without human intervention, reducing MTTR |
| Business impact alignment | Decisions are based on error budget consumption speed rather than raw technical metrics, making it directly tied to business SLAs |
| Reduced false positives | The multi-window burn rate pattern reduces unnecessary rollbacks caused by transient spikes |
| Gradual risk exposure | Stepwise traffic shifting at 5% → 20% → 50% minimizes blast radius when issues occur |
| GitOps integration | AnalysisTemplate is managed as code, making deployment policy auditing and reproduction possible |
Disadvantages and Caveats
| Item | What's the problem | Mitigation |
|---|---|---|
| Cold start | At the start of a canary, low request counts make burn rate statistics unstable | Use the default(result, 0) function and allow sufficient initial pause |
| Low-traffic services | A single error request can calculate as 100% error rate | Set failureLimit generously or add a minimum request count condition |
| Canary slice isolation | Without a service mesh, isolating metrics for canary pods alone is difficult | Combining with a service mesh like Istio or Linkerd is recommended |
| Datadog query cost | A Datadog API call is made on every analysis interval | Avoid setting the interval too short |
| SLO window mismatch | There's a gap in statistical reliability between a 30-day SLO window and a 15-minute canary observation window | Treat burn rate judgments as signals rather than absolute criteria, and leverage the Inconclusive range |
The Most Common Mistakes in Practice
-
Omitting the
default()function — When a Datadog query returns an empty result (nil), thesuccessConditionexpression itself errors out, causing the AnalysisRun to end asFailed. The first 5 minutes of a canary deployment is the highest-risk window. It's best to apply thedefault(result, 0)pattern to all metrics by default. -
Misconfiguring
countso the analysis ends too early — Usingcount: 5together withinterval: 1mcauses the analysis to terminate after 5 minutes. This completely contradicts the goal of using Background Analysis to monitor the entire deployment window. Omittingcountkeeps the analysis running until the Rollout completes. -
Judging canary quality using global metrics — When a canary is only handling 5% of total traffic, the stable old-version traffic from the other 95% dilutes the burn rate. Explicitly including a tag like
version:canaryin Datadog queries, or clearly separating traffic with a service mesh, is the most reliable way to avoid this pitfall.
Closing Thoughts
Thinking back to that 3 AM situation I opened with — if burn rate-based rollback had been in place, that alert probably never would have come. When error rate 0.9% held steady for 30 minutes, the burn rate would have been 9.0: not high enough to trigger failureCondition: result > 14.4, but not satisfying successCondition: result <= 2.0 either — an Inconclusive state. The Rollout would have automatically paused, and I could have reviewed the situation calmly the next morning.
The core of SLO-based canary auto-rollback is moving away from simple fixed thresholds like "error rate below 1%" and directly connecting the deployment pipeline to error budget burn rate — a business reliability metric.
If you try to configure multi-window burn rate and Background Analysis all at once from the start, it can feel overwhelming. The following approach makes it much more manageable.
-
Start by defining your service's SLO in Datadog. Using
Service Level Objectives > New SLOto set your SLI and target value lets you immediately use the1 - SLOvalue (e.g.,0.001) as theslo-thresholdargument. -
It's recommended to attach the basic error rate
AnalysisTemplateabove to an existing Rollout first. This step is about confirming that Datadog query results are reaching the AnalysisRun correctly, before introducing burn rate calculations. You can check verdict results in real time withkubectl get analysisrun. -
Once stability is confirmed, you can swap in the burn rate formula and evolve toward the multi-window pattern. Externalizing
slo-thresholdas a Rollout argument lets you flexibly apply different SLOs per service.
References
- Argo Rollouts Official Docs: Analysis & Progressive Delivery
- Argo Rollouts Official Docs: Datadog Metrics Provider
- Datadog Official Docs: Burn Rate Alerts
- Datadog Blog: Burn rate is a better error rate
Further Reading
- Datadog Official Docs: Argo Rollouts Integration
- Datadog Official Docs: Error Budget Alerts
- InfraCloud Blog: Progressive Delivery with Argo Rollouts: Canary with Analysis
- Mario Fernandez Blog: Multiwindow, Multi-Burn-Rate Alerts in DataDog
- Google SRE Workbook: Canarying Releases
- Argo Rollouts GitHub Example: rollout-analysis-step.yaml