Burn Rate SLO-Based Canary Auto-Rollback with Kubernetes Argo Rollouts AnalysisTemplate and Datadog

Have you ever been jolted awake at 3 AM by a PagerDuty alert? I have. More than once. Every time, I'd dig through logs and eventually land on the same thought: "We were doing a canary deployment — why didn't we catch this at deploy time?" The error rate was under 1%, so as far as the system was concerned, nothing was wrong. An error rate of 0.9% had been holding steady for 30 minutes, and nobody — no system — had intervened.

The cause was clear, but the solution demanded more nuanced judgment than I expected. Instead of a fixed threshold like "error rate below 1%", what I needed was a pattern that automates canary promotion and rollback based on how quickly the error budget is being consumed — the Burn Rate. After switching to this approach, instead of getting woken up by a phone call, I'd check Slack the next morning to find messages like "Canary auto-rolled back, no impact" — and that happened more than once.

This post is aimed at developers with a basic understanding of Kubernetes. No prior SLO knowledge required. I'll walk through the process step by step: connecting Argo Rollouts' AnalysisTemplate to Datadog and expressing SLO-based decision logic as Kubernetes resources. Along with YAML examples inspired by real-world patterns, I'll be honest about the pitfalls and gotchas of this approach.

Core Concepts

SLO, Error Budget, Burn Rate — Why They Work as Deployment Decision Criteria

When I first encountered these concepts, my biggest confusion was "isn't this just error rate?" That's exactly where I started too. But once I understood burn rate, my thinking completely changed.

Let's start with the terminology.

Concept	Meaning	Example
SLI (Service Level Indicator)	The measurement metric	HTTP 5xx rate, p95 latency
SLO (Service Level Objective)	Target value for the SLI	99.9% availability over a 30-day window
Error Budget	Allowable error margin of `1 - SLO`	30 days × 0.1% = 43.2 minutes
Burn Rate	Speed at which the error budget is consumed	Burn rate 1 = budget exhausted after 30 days

The burn rate formula is straightforward.

Burn Rate = Current Error Rate / (1 - SLO)

If the SLO is 99.9%, the allowable error rate is 0.1% (= 0.001). If the current error rate is 0.1%, the burn rate is 1.0 — the budget will be exactly exhausted in 30 days. What if the error rate is 1.44%?

Burn Rate = 0.0144 / 0.001 = 14.4

A burn rate of 14.4 means you're consuming the error budget 14.4 times faster than normal. At that rate, the entire 30-day error budget would be gone in roughly 2 days (30 days ÷ 14.4 ≈ 2.1 days). If you observe this figure during a canary deployment, it's a signal that immediate action is needed.

Why burn rate instead of a simple error rate? Whether an error rate of 1% is "dangerous" depends on the service. For a service with an SLO of 99%, 1% is a burn rate of 1.0 — normal consumption. For a service with an SLO of 99.99%, 1% is a burn rate of 100 — immediate danger. Burn rate automatically captures this context.

What AnalysisTemplate Does

AnalysisTemplate is a Kubernetes CRD that enables Argo Rollouts to make decisions during a canary deployment — "keep going, pause, or roll back" — based on external metrics.

Rollout Controller
    │ (instantiates AnalysisTemplate)
    ▼
AnalysisRun (analysis execution)
    │ (queries Datadog Metrics API v2)
    ▼
Datadog
    │ (returns burn rate calculation result)
    ▼
successCondition / failureCondition evaluation
    │
    ├─ Successful   → auto-promote to next step
    ├─ Failed       → auto-rollback
    └─ Inconclusive → pause (awaiting manual intervention)

Of the three possible outcomes, Inconclusive is the important one. It's returned when neither successCondition nor failureCondition is met; the Rollout automatically pauses and waits for manual review. It's a safety net that reduces the risk of full automation.

How the Datadog Connection Works

For Argo Rollouts to communicate with Datadog, you need to prepare an API key and App key as a Kubernetes Secret first.

yaml

apiVersion: v1
kind: Secret
metadata:
  name: datadog-api-key
  namespace: argo-rollouts
stringData:
  api-key: "<DATADOG_API_KEY>"
  app-key: "<DATADOG_APP_KEY>"

Since stringData is stored in plain text, integrating with Sealed Secrets or External Secrets Operator is recommended in production. It's also good practice to explicitly configure namespace isolation (namespaced: true) — it makes things much easier when you expand to a multi-tenant environment.

The Full Picture at a Glance

Before diving into the code examples, it helps to see the overall picture first — how AnalysisTemplate, Datadog, and the Rollout Controller connect together.

┌─────────────────────────────────────────────────────────────┐
│  Git Repository                                             │
│  ┌──────────────────┐  ┌─────────────────────────────┐     │
│  │   Rollout.yaml   │  │  AnalysisTemplate.yaml      │     │
│  └────────┬─────────┘  └──────────────┬──────────────┘     │
└───────────┼───────────────────────────┼────────────────────┘
            │ GitOps deploy             │
            ▼                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Kubernetes Cluster                                         │
│                                                             │
│  ┌──────────────────┐  instantiate  ┌─────────────────────┐  │
│  │ Rollout          │ ──────────→ │ AnalysisRun         │  │
│  │ Controller       │ ←────────── │ (metric query + eval)│  │
│  │                  │  verdict     └──────────┬──────────┘  │
│  └──────────────────┘                        │              │
│         │                          Datadog API query        │
│  ┌──────▼───────────┐                        │              │
│  │  Canary Pods     │                        ▼              │
│  │ (5%→20%→50%→100%)│             ┌──────────────────────┐  │
│  └──────────────────┘             │  Datadog Metrics v2  │  │
│                                   └──────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

AnalysisTemplate is a "template" that defines how to query metrics and what criteria to evaluate against, while AnalysisRun is the actual execution instance created for each deployment. The Rollout Controller receives the verdict from the AnalysisRun and decides whether to advance the canary steps.

Practical Application

Basic — Validating Canary Quality with Error Rate

If you're using Argo Rollouts + Datadog for the first time, it's worth starting with a simple error rate check before getting into complex burn rate calculations. This is the basic pattern: query a 5-minute error rate using the Datadog Metrics API v2 and mark it as a failure if it exceeds 1%.

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: datadog-error-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 5m
      successCondition: default(result, 0) <= 0.01
      failureLimit: 3
      provider:
        datadog:
          apiVersion: v2
          queries:
            errors: sum:requests.errors{service:{{args.service-name}}}.as_count()
            total: sum:requests{service:{{args.service-name}}}.as_count()
          formula: "moving_rollup(errors, 300, 'sum') / moving_rollup(total, 300, 'sum')"

Setting	Description
`interval: 5m`	Run a query against Datadog every 5 minutes
`default(result, 0)`	Treat nil query results (no traffic) as 0. Without this, the analysis itself is treated as an error in the early stages of a canary
`failureLimit: 3`	Requires 3 consecutive failures before rolling back. Prevents false positives from transient spikes
`moving_rollup(errors, 300, 'sum')`	Summed aggregation over a 5-minute (300-second) window

You may notice there's no failureCondition. In this case, Argo Rollouts counts any measurement that doesn't satisfy successCondition as a "failure", and triggers a rollback when failureLimit is reached. Without failureCondition, there's no Inconclusive state — it's a binary pass-or-fail judgment.

Core — SLO Auto-Rollback Based on Burn Rate

Now for the main event. The failureCondition threshold value was what I agonized over the most in this pattern, and ultimately following the Google SRE Workbook standard turned out to be the most sensible approach. The burn rate is calculated in real time using the Datadog Formula by dividing the current error rate by the allowable error rate.

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: slo-burn-rate-check
spec:
  args:
    - name: service-name
    - name: slo-error-rate  # e.g., "0.001" (allowable error rate for SLO 99.9%)
  metrics:
    - name: error-budget-burn-rate
      interval: 5m
      successCondition: result <= 2.0   # Burn rate ≤ 2x → safe
      failureCondition: result > 14.4   # Burn rate > 14.4 → immediate rollback
      failureLimit: 1
      provider:
        datadog:
          apiVersion: v2
          queries:
            errors: sum:requests.errors{service:{{args.service-name}}}.as_count()
            total: sum:requests{service:{{args.service-name}}}.as_count()
          formula: >-
            (moving_rollup(errors, 300, 'sum') / moving_rollup(total, 300, 'sum'))
            / {{args.slo-error-rate}}

The range between successCondition and failureCondition — that is, a burn rate between 2.0 and 14.4 — is treated as Inconclusive. The Rollout automatically pauses and waits for manual intervention. This can serve as a safety net between full automation and human review.

Where does 14.4 come from? It's a value derived from the Google SRE Workbook. If a burn rate of 14.4 persists for one hour, approximately 2% of the 30-day error budget is consumed (14.4 ÷ 720 hours = 2%). While the absolute consumption isn't huge, observing this speed during the short observation window of a canary deployment is a signal that an underlying problem could impact all of production. It should be treated as an alert level that demands immediate action.

Recommended Pattern — Multi-Window Burn Rate to Minimize False Positives

Honestly, I started out running on a single-window burn rate too. After watching several perfectly fine deployments roll back because of a 10-minute platform hiccup, I switched to this pattern. The approach is to check both a long window (5 minutes) and a short window (1 minute) simultaneously, and only treat it as a failure when both exceed the threshold.

One important point: count is omitted in the example below. If you set something like count: 5, combined with interval: 1m, the analysis terminates after 5 minutes. That completely defeats the purpose of using Background Analysis to monitor the entire deployment window. Omitting count keeps the analysis running until the Rollout completes.

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: multiwindow-burn-rate
spec:
  args:
    - name: service
    - name: slo-threshold
  metrics:
    # Long window: burn rate based on 5-minute aggregation
    - name: burn-rate-5m
      interval: 1m
      successCondition: default(result, 0) <= 2.0
      failureCondition: default(result, 0) > 14.4
      failureLimit: 1
      provider:
        datadog:
          apiVersion: v2
          queries:
            e: sum:http.errors{service:{{args.service}},version:canary}.as_count()
            r: sum:http.requests{service:{{args.service}},version:canary}.as_count()
          formula: >-
            (moving_rollup(e, 300, 'sum') / moving_rollup(r, 300, 'sum'))
            / {{args.slo-threshold}}
 
    # Short window: burn rate based on 1-minute aggregation (fast spike detection)
    - name: burn-rate-1m
      interval: 1m
      successCondition: default(result, 0) <= 14.4
      failureCondition: default(result, 0) > 14.4
      failureLimit: 2
      provider:
        datadog:
          apiVersion: v2
          queries:
            e: sum:http.errors{service:{{args.service}},version:canary}.as_count()
            r: sum:http.requests{service:{{args.service}},version:canary}.as_count()
          formula: >-
            (moving_rollup(e, 60, 'sum') / moving_rollup(r, 60, 'sum'))
            / {{args.slo-threshold}}

Pay attention to the explicit version:canary tag in the queries. When a canary is only handling 5% of total traffic, using global metrics without version differentiation lets the stable old-version traffic from the other 95% dilute the burn rate, masking real problems. By tagging canary pods with a version: canary label and configuring Datadog to collect this as a tag, you can accurately measure the burn rate for the canary slice alone.

You'll also notice that burn-rate-1m uses the same threshold (14.4) for both successCondition and failureCondition. This is binary — no Inconclusive range — but failureLimit: 2 means it requires two consecutive failures before being treated as a failure, which naturally filters out one-minute noise. If you need finer-grained control, you could explicitly create an Inconclusive range with something like successCondition: result <= 10.0.

Connecting to a Rollout — Monitoring the Full Window with Background Analysis

Here's how to attach the AnalysisTemplate you built to an actual Rollout. Using Background Analysis keeps the analysis running continuously throughout all canary steps.

yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 10
  strategy:
    canary:
      analysis:
        templates:
          - templateName: multiwindow-burn-rate
        startingStep: 1   # analysis begins immediately after setWeight: 5
        args:
          - name: service
            value: payment-service
          - name: slo-threshold
            value: "0.001"   # SLO 99.9%
      steps:
        - setWeight: 5
        - pause: {duration: 10m}
        - setWeight: 20
        - pause: {duration: 10m}
        - setWeight: 50
        - pause: {duration: 15m}
        - setWeight: 100

Setting startingStep: 1 means the analysis begins immediately after the first setWeight. You're effectively monitoring SLO impact in real time from when canary traffic is at 5%.

Checking Status During Deployment — Debugging AnalysisRun

The first place people get stuck when adopting this pattern is "something went wrong, but where do I look?" I spent a long time lost here myself when I first started.

bash

# List AnalysisRuns and check current status
kubectl get analysisrun -n <namespace>
 
# View details for a specific AnalysisRun
kubectl describe analysisrun <analysisrun-name> -n <namespace>
 
# Extract only the status field
kubectl get analysisrun <analysisrun-name> -n <namespace> -o json \
  | jq '.status | {phase, message, metricResults}'

status.phase will be one of Running, Successful, Failed, or Inconclusive. status.metricResults shows the most recent measured values and verdict for each metric. If the state is Inconclusive, checking which threshold range the measured value falls between — and whether the Datadog query is returning nil — will usually surface the cause.

Pros and Cons

Advantages

Item	Detail
Automated safety net	Automatically rolls back on SLO violations without human intervention, reducing MTTR
Business impact alignment	Decisions are based on error budget consumption speed rather than raw technical metrics, making it directly tied to business SLAs
Reduced false positives	The multi-window burn rate pattern reduces unnecessary rollbacks caused by transient spikes
Gradual risk exposure	Stepwise traffic shifting at 5% → 20% → 50% minimizes blast radius when issues occur
GitOps integration	`AnalysisTemplate` is managed as code, making deployment policy auditing and reproduction possible

Disadvantages and Caveats

Item	What's the problem	Mitigation
Cold start	At the start of a canary, low request counts make burn rate statistics unstable	Use the `default(result, 0)` function and allow sufficient initial `pause`
Low-traffic services	A single error request can calculate as 100% error rate	Set `failureLimit` generously or add a minimum request count condition
Canary slice isolation	Without a service mesh, isolating metrics for canary pods alone is difficult	Combining with a service mesh like Istio or Linkerd is recommended
Datadog query cost	A Datadog API call is made on every analysis interval	Avoid setting the interval too short
SLO window mismatch	There's a gap in statistical reliability between a 30-day SLO window and a 15-minute canary observation window	Treat burn rate judgments as signals rather than absolute criteria, and leverage the `Inconclusive` range

The Most Common Mistakes in Practice

Omitting the default() function — When a Datadog query returns an empty result (nil), the successCondition expression itself errors out, causing the AnalysisRun to end as Failed. The first 5 minutes of a canary deployment is the highest-risk window. It's best to apply the default(result, 0) pattern to all metrics by default.
Misconfiguring count so the analysis ends too early — Using count: 5 together with interval: 1m causes the analysis to terminate after 5 minutes. This completely contradicts the goal of using Background Analysis to monitor the entire deployment window. Omitting count keeps the analysis running until the Rollout completes.
Judging canary quality using global metrics — When a canary is only handling 5% of total traffic, the stable old-version traffic from the other 95% dilutes the burn rate. Explicitly including a tag like version:canary in Datadog queries, or clearly separating traffic with a service mesh, is the most reliable way to avoid this pitfall.

Closing Thoughts

Thinking back to that 3 AM situation I opened with — if burn rate-based rollback had been in place, that alert probably never would have come. When error rate 0.9% held steady for 30 minutes, the burn rate would have been 9.0: not high enough to trigger failureCondition: result > 14.4, but not satisfying successCondition: result <= 2.0 either — an Inconclusive state. The Rollout would have automatically paused, and I could have reviewed the situation calmly the next morning.

The core of SLO-based canary auto-rollback is moving away from simple fixed thresholds like "error rate below 1%" and directly connecting the deployment pipeline to error budget burn rate — a business reliability metric.

If you try to configure multi-window burn rate and Background Analysis all at once from the start, it can feel overwhelming. The following approach makes it much more manageable.

Start by defining your service's SLO in Datadog. Using Service Level Objectives > New SLO to set your SLI and target value lets you immediately use the 1 - SLO value (e.g., 0.001) as the slo-threshold argument.
It's recommended to attach the basic error rate AnalysisTemplate above to an existing Rollout first. This step is about confirming that Datadog query results are reaching the AnalysisRun correctly, before introducing burn rate calculations. You can check verdict results in real time with kubectl get analysisrun.
Once stability is confirmed, you can swap in the burn rate formula and evolve toward the multi-window pattern. Externalizing slo-threshold as a Rollout argument lets you flexibly apply different SLOs per service.

References

Burn Rate SLO-Based Canary Auto-Rollback with Kubernetes Argo Rollouts AnalysisTemplate and Datadog

Core Concepts

SLO, Error Budget, Burn Rate — Why They Work as Deployment Decision Criteria

When I first encountered these concepts, my biggest confusion was "isn't this just error rate?" That's exactly where I started too. But once I understood burn rate, my thinking completely changed.

Let's start with the terminology.

Concept	Meaning	Example
SLI (Service Level Indicator)	The measurement metric	HTTP 5xx rate, p95 latency
SLO (Service Level Objective)	Target value for the SLI	99.9% availability over a 30-day window
Error Budget	Allowable error margin of `1 - SLO`	30 days × 0.1% = 43.2 minutes
Burn Rate	Speed at which the error budget is consumed	Burn rate 1 = budget exhausted after 30 days

The burn rate formula is straightforward.

Burn Rate = Current Error Rate / (1 - SLO)

Burn Rate = 0.0144 / 0.001 = 14.4

Why burn rate instead of a simple error rate? Whether an error rate of 1% is "dangerous" depends on the service. For a service with an SLO of 99%, 1% is a burn rate of 1.0 — normal consumption. For a service with an SLO of 99.99%, 1% is a burn rate of 100 — immediate danger. Burn rate automatically captures this context.

What AnalysisTemplate Does

AnalysisTemplate is a Kubernetes CRD that enables Argo Rollouts to make decisions during a canary deployment — "keep going, pause, or roll back" — based on external metrics.

Rollout Controller
    │ (instantiates AnalysisTemplate)
    ▼
AnalysisRun (analysis execution)
    │ (queries Datadog Metrics API v2)
    ▼
Datadog
    │ (returns burn rate calculation result)
    ▼
successCondition / failureCondition evaluation
    │
    ├─ Successful   → auto-promote to next step
    ├─ Failed       → auto-rollback
    └─ Inconclusive → pause (awaiting manual intervention)

How the Datadog Connection Works

For Argo Rollouts to communicate with Datadog, you need to prepare an API key and App key as a Kubernetes Secret first.

yaml

apiVersion: v1
kind: Secret
metadata:
  name: datadog-api-key
  namespace: argo-rollouts
stringData:
  api-key: "<DATADOG_API_KEY>"
  app-key: "<DATADOG_APP_KEY>"

The Full Picture at a Glance

Before diving into the code examples, it helps to see the overall picture first — how AnalysisTemplate, Datadog, and the Rollout Controller connect together.

┌─────────────────────────────────────────────────────────────┐
│  Git Repository                                             │
│  ┌──────────────────┐  ┌─────────────────────────────┐     │
│  │   Rollout.yaml   │  │  AnalysisTemplate.yaml      │     │
│  └────────┬─────────┘  └──────────────┬──────────────┘     │
└───────────┼───────────────────────────┼────────────────────┘
            │ GitOps deploy             │
            ▼                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Kubernetes Cluster                                         │
│                                                             │
│  ┌──────────────────┐  instantiate  ┌─────────────────────┐  │
│  │ Rollout          │ ──────────→ │ AnalysisRun         │  │
│  │ Controller       │ ←────────── │ (metric query + eval)│  │
│  │                  │  verdict     └──────────┬──────────┘  │
│  └──────────────────┘                        │              │
│         │                          Datadog API query        │
│  ┌──────▼───────────┐                        │              │
│  │  Canary Pods     │                        ▼              │
│  │ (5%→20%→50%→100%)│             ┌──────────────────────┐  │
│  └──────────────────┘             │  Datadog Metrics v2  │  │
│                                   └──────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Practical Application

Basic — Validating Canary Quality with Error Rate

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: datadog-error-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 5m
      successCondition: default(result, 0) <= 0.01
      failureLimit: 3
      provider:
        datadog:
          apiVersion: v2
          queries:
            errors: sum:requests.errors{service:{{args.service-name}}}.as_count()
            total: sum:requests{service:{{args.service-name}}}.as_count()
          formula: "moving_rollup(errors, 300, 'sum') / moving_rollup(total, 300, 'sum')"

Setting	Description
`interval: 5m`	Run a query against Datadog every 5 minutes
`default(result, 0)`	Treat nil query results (no traffic) as 0. Without this, the analysis itself is treated as an error in the early stages of a canary
`failureLimit: 3`	Requires 3 consecutive failures before rolling back. Prevents false positives from transient spikes
`moving_rollup(errors, 300, 'sum')`	Summed aggregation over a 5-minute (300-second) window

Core — SLO Auto-Rollback Based on Burn Rate

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: slo-burn-rate-check
spec:
  args:
    - name: service-name
    - name: slo-error-rate  # e.g., "0.001" (allowable error rate for SLO 99.9%)
  metrics:
    - name: error-budget-burn-rate
      interval: 5m
      successCondition: result <= 2.0   # Burn rate ≤ 2x → safe
      failureCondition: result > 14.4   # Burn rate > 14.4 → immediate rollback
      failureLimit: 1
      provider:
        datadog:
          apiVersion: v2
          queries:
            errors: sum:requests.errors{service:{{args.service-name}}}.as_count()
            total: sum:requests{service:{{args.service-name}}}.as_count()
          formula: >-
            (moving_rollup(errors, 300, 'sum') / moving_rollup(total, 300, 'sum'))
            / {{args.slo-error-rate}}

Where does 14.4 come from? It's a value derived from the Google SRE Workbook. If a burn rate of 14.4 persists for one hour, approximately 2% of the 30-day error budget is consumed (14.4 ÷ 720 hours = 2%). While the absolute consumption isn't huge, observing this speed during the short observation window of a canary deployment is a signal that an underlying problem could impact all of production. It should be treated as an alert level that demands immediate action.

Recommended Pattern — Multi-Window Burn Rate to Minimize False Positives

yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: multiwindow-burn-rate
spec:
  args:
    - name: service
    - name: slo-threshold
  metrics:
    # Long window: burn rate based on 5-minute aggregation
    - name: burn-rate-5m
      interval: 1m
      successCondition: default(result, 0) <= 2.0
      failureCondition: default(result, 0) > 14.4
      failureLimit: 1
      provider:
        datadog:
          apiVersion: v2
          queries:
            e: sum:http.errors{service:{{args.service}},version:canary}.as_count()
            r: sum:http.requests{service:{{args.service}},version:canary}.as_count()
          formula: >-
            (moving_rollup(e, 300, 'sum') / moving_rollup(r, 300, 'sum'))
            / {{args.slo-threshold}}
 
    # Short window: burn rate based on 1-minute aggregation (fast spike detection)
    - name: burn-rate-1m
      interval: 1m
      successCondition: default(result, 0) <= 14.4
      failureCondition: default(result, 0) > 14.4
      failureLimit: 2
      provider:
        datadog:
          apiVersion: v2
          queries:
            e: sum:http.errors{service:{{args.service}},version:canary}.as_count()
            r: sum:http.requests{service:{{args.service}},version:canary}.as_count()
          formula: >-
            (moving_rollup(e, 60, 'sum') / moving_rollup(r, 60, 'sum'))
            / {{args.slo-threshold}}

Connecting to a Rollout — Monitoring the Full Window with Background Analysis

Here's how to attach the AnalysisTemplate you built to an actual Rollout. Using Background Analysis keeps the analysis running continuously throughout all canary steps.

yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 10
  strategy:
    canary:
      analysis:
        templates:
          - templateName: multiwindow-burn-rate
        startingStep: 1   # analysis begins immediately after setWeight: 5
        args:
          - name: service
            value: payment-service
          - name: slo-threshold
            value: "0.001"   # SLO 99.9%
      steps:
        - setWeight: 5
        - pause: {duration: 10m}
        - setWeight: 20
        - pause: {duration: 10m}
        - setWeight: 50
        - pause: {duration: 15m}
        - setWeight: 100

Setting startingStep: 1 means the analysis begins immediately after the first setWeight. You're effectively monitoring SLO impact in real time from when canary traffic is at 5%.

Checking Status During Deployment — Debugging AnalysisRun

The first place people get stuck when adopting this pattern is "something went wrong, but where do I look?" I spent a long time lost here myself when I first started.

bash

# List AnalysisRuns and check current status
kubectl get analysisrun -n <namespace>
 
# View details for a specific AnalysisRun
kubectl describe analysisrun <analysisrun-name> -n <namespace>
 
# Extract only the status field
kubectl get analysisrun <analysisrun-name> -n <namespace> -o json \
  | jq '.status | {phase, message, metricResults}'

Pros and Cons

Advantages

Item	Detail
Automated safety net	Automatically rolls back on SLO violations without human intervention, reducing MTTR
Business impact alignment	Decisions are based on error budget consumption speed rather than raw technical metrics, making it directly tied to business SLAs
Reduced false positives	The multi-window burn rate pattern reduces unnecessary rollbacks caused by transient spikes
Gradual risk exposure	Stepwise traffic shifting at 5% → 20% → 50% minimizes blast radius when issues occur
GitOps integration	`AnalysisTemplate` is managed as code, making deployment policy auditing and reproduction possible

Disadvantages and Caveats

Item	What's the problem	Mitigation
Cold start	At the start of a canary, low request counts make burn rate statistics unstable	Use the `default(result, 0)` function and allow sufficient initial `pause`
Low-traffic services	A single error request can calculate as 100% error rate	Set `failureLimit` generously or add a minimum request count condition
Canary slice isolation	Without a service mesh, isolating metrics for canary pods alone is difficult	Combining with a service mesh like Istio or Linkerd is recommended
Datadog query cost	A Datadog API call is made on every analysis interval	Avoid setting the interval too short
SLO window mismatch	There's a gap in statistical reliability between a 30-day SLO window and a 15-minute canary observation window	Treat burn rate judgments as signals rather than absolute criteria, and leverage the `Inconclusive` range

The Most Common Mistakes in Practice

Omitting the default() function — When a Datadog query returns an empty result (nil), the successCondition expression itself errors out, causing the AnalysisRun to end as Failed. The first 5 minutes of a canary deployment is the highest-risk window. It's best to apply the default(result, 0) pattern to all metrics by default.
Misconfiguring count so the analysis ends too early — Using count: 5 together with interval: 1m causes the analysis to terminate after 5 minutes. This completely contradicts the goal of using Background Analysis to monitor the entire deployment window. Omitting count keeps the analysis running until the Rollout completes.
Judging canary quality using global metrics — When a canary is only handling 5% of total traffic, the stable old-version traffic from the other 95% dilutes the burn rate. Explicitly including a tag like version:canary in Datadog queries, or clearly separating traffic with a service mesh, is the most reliable way to avoid this pitfall.

Closing Thoughts

If you try to configure multi-window burn rate and Background Analysis all at once from the start, it can feel overwhelming. The following approach makes it much more manageable.

Start by defining your service's SLO in Datadog. Using Service Level Objectives > New SLO to set your SLI and target value lets you immediately use the 1 - SLO value (e.g., 0.001) as the slo-threshold argument.
It's recommended to attach the basic error rate AnalysisTemplate above to an existing Rollout first. This step is about confirming that Datadog query results are reaching the AnalysisRun correctly, before introducing burn rate calculations. You can check verdict results in real time with kubectl get analysisrun.
Once stability is confirmed, you can swap in the burn rate formula and evolve toward the multi-window pattern. Externalizing slo-threshold as a Rollout argument lets you flexibly apply different SLOs per service.

Burn Rate SLO-Based Canary Auto-Rollback with Kubernetes Argo Rollouts AnalysisTemplate and Datadog

Core Concepts

SLO, Error Budget, Burn Rate — Why They Work as Deployment Decision Criteria

What AnalysisTemplate Does

How the Datadog Connection Works

The Full Picture at a Glance

Practical Application

Basic — Validating Canary Quality with Error Rate

Core — SLO Auto-Rollback Based on Burn Rate

Recommended Pattern — Multi-Window Burn Rate to Minimize False Positives

Connecting to a Rollout — Monitoring the Full Window with Background Analysis

Checking Status During Deployment — Debugging AnalysisRun

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Further Reading

Burn Rate SLO-Based Canary Auto-Rollback with Kubernetes Argo Rollouts AnalysisTemplate and Datadog

Core Concepts

SLO, Error Budget, Burn Rate — Why They Work as Deployment Decision Criteria

What AnalysisTemplate Does

How the Datadog Connection Works

The Full Picture at a Glance

Practical Application

Basic — Validating Canary Quality with Error Rate

Core — SLO Auto-Rollback Based on Burn Rate

Recommended Pattern — Multi-Window Burn Rate to Minimize False Positives

Connecting to a Rollout — Monitoring the Full Window with Background Analysis

Checking Status During Deployment — Debugging AnalysisRun

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Further Reading

추천 포스트

Core Concepts

SLO, Error Budget, Burn Rate — Why They Work as Deployment Decision Criteria

What AnalysisTemplate Does

How the Datadog Connection Works

The Full Picture at a Glance

Practical Application

Basic — Validating Canary Quality with Error Rate

Core — SLO Auto-Rollback Based on Burn Rate

Recommended Pattern — Multi-Window Burn Rate to Minimize False Positives

Connecting to a Rollout — Monitoring the Full Window with Background Analysis

Checking Status During Deployment — Debugging AnalysisRun

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Further Reading

Core Concepts

SLO, Error Budget, Burn Rate — Why They Work as Deployment Decision Criteria

What AnalysisTemplate Does

How the Datadog Connection Works

The Full Picture at a Glance

Practical Application

Basic — Validating Canary Quality with Error Rate

Core — SLO Auto-Rollback Based on Burn Rate

Recommended Pattern — Multi-Window Burn Rate to Minimize False Positives

Connecting to a Rollout — Monitoring the Full Window with Background Analysis

Checking Status During Deployment — Debugging AnalysisRun

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Further Reading

추천 포스트

Istio + Argo Rollouts로 구성하는 카나리 배포: 파드 메트릭 격리부터 헤더 기반 테스트 라우팅까지

Vercel CDN 비용 폭탄 없애기: Flat Rate CDN과 FinOps로 예측 가능한 인프라 비용 만들기 (2026)

Rancher Fleet으로 Kubernetes 멀티클러스터 운영하기 — 드리프트 없이 수십 개 클러스터를 Git 하나로 관리하는 패턴

ArgoCD ApplicationSet rollingSync + Argo Rollouts로 멀티 클러스터 카나리 배포 구현하기

2026년 GitOps 도구 비교: ArgoCD 3.3 vs FluxCD 2.8 + MCP Server, 어떤 팀에 무엇이 맞을까

Grafana Alerting Contact Point·Notification Policy로 팀·심각도별 알림 분기하기