Error Budget Automation: A Practical Implementation Guide to Blocking SLO Violations with GitOps Deployment Gates

Argo Rollouts · Sloth · GitHub Actions Step-by-Step Configuration

"Is this code safe to deploy?" — a question every engineering team asks repeatedly. Engineers manually check SLO dashboard numbers, ask in the team Slack, and ultimately rely on human judgment. This process is slow, subjective, and fails to prevent late-night incidents.

Error budget policy automation lets the system answer this question instead. The CI/CD pipeline reads Prometheus-calculated burn rates in real time, and automatically blocks deployments when a pre-agreed threshold is exceeded. Rather than waiting for human judgment, the system enforces the contract your team codified in code. This article walks through everything step by step — from core error budget concepts to building a practical GitOps gate using Argo Rollouts, Sloth, GitHub Actions, and Flagger.

After reading this, you'll be able to add a burn-rate-based deployment gate to your existing CI/CD pipeline and configure a GitOps environment that automatically blocks deployments when SLOs are violated. Note that this article assumes readers are familiar with basic Kubernetes concepts (Pod, Deployment, kubectl). The service mesh (Istio) examples apply only to teams already operating in that environment.

Core Concepts

What Is an Error Budget

An Error Budget quantifies the allowable "amount of unreliability" against an SLO target. Take a 99.9% availability SLO as an example: over one month (30 days = 43,200 minutes), approximately 43.2 minutes of downtime is permitted.

Error Budget = (1 - SLO target) × measurement period
Example: (1 - 0.999) × 43,200 min = 43.2 min/month

Key Insight: An error budget is not "how much can we fail" — it's a fuel gauge that defines "how fast can we move between innovation and stability." The core principle is to move fast when the budget is plentiful, and slow down as it depletes.

Burn Rate: Quantifying the Rate of Depletion

Burn Rate indicates how quickly the error budget is being consumed. A burn rate of 1.0 means the budget is being exhausted at exactly the normal pace — depleted over 30 days.

The table below shows tiered responses based on Google SRE Workbook recommendations.

Burn Rate	Meaning	Recommended Action
1.0x	Depleting at normal pace	Keep monitoring
2.0x	Depleting at 2× budget speed	Log a warning
5.0x	Depleting at 5× budget speed	Create a ticket, begin investigation
10.0x	Depleting at 10× budget speed	Page on-call engineer
14.4x	5% of budget depleted within 1 hour	Fast burn alert (Google SRE threshold)
20.0x+	Extremely rapid depletion	Automatically block deployments

Definition: Fast Burn is a state where a burn rate of 14.4x persists for more than one hour. At this pace, the monthly budget will be fully exhausted within 5 days. The Google SRE Workbook recommends this state as the threshold for immediate alerting.

Error Budget Policy: An Operational Contract Between Teams

An Error Budget Policy is a document pre-agreed upon by SRE, development, and product teams that defines "what action to take when the budget is depleted by a given amount." Before automation, it was just a number on a dashboard — but as SLOs-as-Code has gained adoption, embedding this policy directly into the pipeline is becoming the standard.

The core automation flow is as follows:

Measure SLI
   ↓
Record in Prometheus (Recording Rule)
   ↓
Calculate Burn Rate
   ↓
Evaluate Policy (compare against threshold)
   ↓
Block or allow deployment

A Deployment Gate operates at the final step of this flow and can be inserted at three points in a GitOps environment:

Git PR approval stage (GitHub Actions Status Check)
Argo CD PreSync/PostSync Hook
Argo Rollouts AnalysisRun stage

SLOs-as-Code: Managing SLOs as Code

The practice of codifying SLO definitions as YAML files and version-controlling them in Git is spreading rapidly. This approach allows SLO change history to be managed through the code review process, and when combined with a Kubernetes Operator, it automatically generates Prometheus Recording Rules and Alert Rules.

Tool	Role	Characteristics
Sloth	Automatic Prometheus SLO rule generation	CLI + Kubernetes Operator, OpenSLO compatible
Pyrra	Auto-converts SLO → PrometheusRule	Kubernetes Operator approach
OpenSLO	Declarative YAML standard spec for SLOs	Vendor-neutral
Nobl9 sloctl	GitLab/GitHub CI integration CLI	Commercial platform integration

Practical Application

Example 1: Argo Rollouts Canary Gate — Automatic Rollback Based on Prometheus SLO

Argo Rollouts is a Kubernetes-native rollout controller that creates an AnalysisRun at each stage of a canary deployment to evaluate Prometheus query results. When failure conditions are met, it automatically rolls back and halts the deployment. A key characteristic is that it operates on a Kubernetes cluster alone, without requiring a service mesh dependency.

yaml

# k8s/analysis/slo-analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: slo-analysis
  namespace: production
spec:
  metrics:
  - name: error-rate
    interval: 60s
    successCondition: result[0] < 0.01       # Error rate below 1%
    failureLimit: 3                           # Rollback after 3 consecutive failures
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
  - name: latency-p99
    interval: 60s
    successCondition: result[0] < 0.3        # P99 below 300ms (in seconds)
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          histogram_quantile(0.99,
            rate(http_duration_seconds_bucket[5m]))

Field	Role
`interval`	Metric measurement interval (Prometheus query runs every 60 seconds)
`successCondition`	Canary stage proceeds only if this condition is satisfied
`failureLimit`	Automatic rollback triggered after this many failures
`provider.prometheus.query`	The actual PromQL query — the heart of the SLI definition

The canary traffic flow proceeds as follows:

10% traffic shifted → AnalysisRun runs → SLO passes
→ 30% traffic shifted → AnalysisRun runs → SLO passes
→ ... → 100% shift complete
(SLO failure at any stage → immediate rollback to 0%)

Practical Tip: For the address field, use the FQDN format http://service-name.namespace.svc.cluster.local:port to ensure stable DNS resolution within the cluster. Shortened addresses that omit the namespace may fail DNS resolution in cross-namespace environments.

Example 2: Sloth + GitHub Actions — Burn Rate Deployment Gate at the PR Stage

This pattern defines SLOs as YAML using Sloth, queries the Prometheus burn rate in GitHub Actions, and blocks the deployment stage when the threshold is exceeded.

Step 1: Define SLO with Sloth

yaml

# slo/api-availability.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: api-availability
  namespace: monitoring
spec:
  service: "api-service"
  slos:
  - name: "availability"
    objective: 99.9                          # 99.9% availability SLO
    sli:
      events:
        error_query: |
          sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
        total_query: |
          sum(rate(http_requests_total[{{.window}}]))
    alerting:
      name: APIAvailabilityBurnRate
      pageAlert:
        labels:
          severity: critical
          team: platform

Running sloth generate -i slo/api-availability.yaml | kubectl apply -f - automatically generates Prometheus Recording Rules and Alert Rules.

Important: Recording Rule names generated by Sloth may vary by version. Verify the actual generated rule names with kubectl get prometheusrule -n monitoring -o yaml before using them in the GitHub Actions query below.

Step 2: GitHub Actions Deployment Gate

yaml

# .github/workflows/deploy.yml
name: Deploy with Error Budget Gate
 
on:
  push:
    branches: [main]
 
jobs:
  check-error-budget:
    runs-on: ubuntu-latest
    steps:
      - name: Check Error Budget Burn Rate
        env:
          PROMETHEUS: ${{ secrets.PROMETHEUS_URL }}
        run: |
          BURN_RATE=$(curl -sf "$PROMETHEUS/api/v1/query" \
            --data-urlencode \
              "query=slo:error_budget_burn_rate:ratio_rate1h" \
            | jq -r '.data.result[0].value[1]')
 
          # Block as fail-safe if Prometheus returns no response or empty value
          if [[ -z "$BURN_RATE" || "$BURN_RATE" == "null" ]]; then
            echo "::error::Failed to retrieve burn rate from Prometheus. Blocking deployment."
            exit 1
          fi
 
          echo "Current burn rate: ${BURN_RATE}x"
 
          if (( $(echo "$BURN_RATE > 10" | bc -l) )); then
            echo "::error::Error budget burn rate ${BURN_RATE}x exceeds threshold (10x)."
            echo "::error::Deployment blocked. Please check SLO status."
            exit 1
          fi
 
          echo "Error budget status healthy. Proceeding with deployment."
 
  deploy:
    needs: check-error-budget          # Runs only after gate passes
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Kubernetes
        run: kubectl apply -f k8s/

Terminology: slo:error_budget_burn_rate:ratio_rate1h is an example of a Recording Rule name auto-generated by Sloth. A Recording Rule is a Prometheus feature that pre-stores the results of complex PromQL calculations to improve query performance. The empty-value guard ([[ -z "$BURN_RATE" ]]) is an essential safety mechanism that prevents the gate from automatically opening (fail-open) in error scenarios.

Example 3: Flagger + Istio — Service Mesh-Based Automatic Canary

Prerequisite: This example applies only when Istio service mesh is installed in the cluster. If you are not using a service mesh, refer to Example 1 (Argo Rollouts).

Both tools serve the same role (canary SLO gate) but take different approaches. This comparison can help you choose the right tool for your environment.

Category	Argo Rollouts	Flagger
Dependencies	Kubernetes only	Requires a service mesh such as Istio or Linkerd
Traffic control	Direct control via Rollout resource	Control via auto-generated VirtualService
Ecosystem integration	Tightly integrated with Argo CD	Native integration with Flux CD
Best suited for	Kubernetes environments without a service mesh	Microservice environments based on Istio/Linkerd

When Flagger detects a Canary object, it automatically creates an Istio VirtualService to progressively shift traffic. It measures Prometheus metrics at each stage and immediately restores the original version if a threshold is exceeded.

yaml

# canary/api-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  service:
    port: 80
  analysis:
    interval: 1m                             # Analyze every 1 minute
    threshold: 5                             # Rollback after 5 failures
    maxWeight: 50                            # Maximum 50% canary traffic
    stepWeight: 10                           # Increase by 10% increments
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99                              # Success rate at least 99% (percentage, range 0–100)
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500                             # P99 latency at most 500ms (in milliseconds)
      interval: 1m

Stage	Traffic Percentage	Action
Stage 1	10%	Measure Prometheus metrics, evaluate SLO
Stage 2	20%	Increase traffic after previous stage passes
...	...	Repeat
Final success	100%	Canary fully promoted to primary
On failure	0%	Immediate rollback, Kubernetes event recorded

Unit note: thresholdRange.min: 99 for request-success-rate is a percentage (%) based on Flagger's built-in metrics. When defining custom metrics directly, always verify the unit of values returned by the PromQL query (decimal ratio vs. percentage) before setting thresholds.

Example 4: Multi-Window Burn Rate Alerts — Google SRE Recommended Pattern

Using only a single short window causes two problems: reacting to transient spikes (noise) or missing long-term degradation (slow response). The Google SRE Workbook recommends combining a short window and a long window with an AND condition to detect both patterns.

yaml

# prometheus/slo-alerts.yaml
groups:
  - name: slo.rules
    rules:
    # Fast burn alert: combination of 1h + 5m windows
    - alert: ErrorBudgetFastBurn
      expr: |
        (
          slo:error_budget_burn_rate:ratio_rate1h > 14.4
          and
          slo:error_budget_burn_rate:ratio_rate5m > 14.4
        )
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Fast burn detected: burn rate {{ $value }}x"
        description: "Monthly budget will be exhausted within 5 days at the current rate"
 
    # Slow burn alert: combination of 6h + 30m windows
    - alert: ErrorBudgetSlowBurn
      expr: |
        (
          slo:error_budget_burn_rate:ratio_rate6h > 6
          and
          slo:error_budget_burn_rate:ratio_rate30m > 6
        )
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Slow burn detected: burn rate {{ $value }}x"
        description: "A sustained low burn rate is quietly depleting the budget over time"

Alert Type	Window Combination	Threshold	Detects
Fast burn	1h + 5m	14.4x	Sudden outages requiring immediate response
Slow burn	6h + 30m	6x	Long-running low-level performance degradation

Terminology: Slow Burn is a pattern where the burn rate is low but sustained over a long period, quietly depleting the budget. Because it is difficult to detect with short windows alone, it is recommended to combine it with a long window of 6 hours or more using an AND condition.

Pros and Cons Analysis

Advantages

Item	Details
Automated reliability protection	Deployments are automatically blocked on SLO violations without human judgment, even at night or on weekends
Balance between development speed and stability	Fast deployments when budget is healthy, automatic braking when risk is high
Objective criteria across teams	Deployment decisions shift from subjective judgment to data-driven decisions
Audit trail	All deployment decisions are recorded in Git history in a GitOps environment
Progressive risk mitigation	Minimizes production risk when combined with Progressive Delivery

Disadvantages and Caveats

Item	Details	Mitigation
False Positives	Thresholds that are too strict will block legitimate deployments	Adjust burn rate thresholds gradually; start with warnings only
Initial setup complexity	Requires infrastructure investment in Prometheus, Argo Rollouts, alert rules, etc.	Minimize boilerplate using Sloth/Pyrra
Incorrect SLI definitions	A poorly defined SLI renders the entire gate system meaningless	Mandatory team-wide review and documentation before defining SLIs
Window selection issues	Short windows produce noise; long windows produce slow responses	Use the Google SRE recommended multi-window combination

Terminology: SLI (Service Level Indicator) is the actual measured metric (e.g., HTTP success rate, P99 latency), while SLO is the target value that the SLI must achieve (e.g., success rate ≥ 99.9%). If the SLI is poorly defined, the SLO and burn rate calculations both become meaningless.

Team Preparation Before Adoption

Error budget policy automation cannot be completed through technical configuration alone. For automatic deployment blocking to have a real impact on the team, the following organizational groundwork must come first:

Agree on SLO targets: It is critical to first agree on realistic SLO numbers with service owners and the product team. If you proceed with technical configuration without that agreement, every automatic block will generate team conflict.
Document hotfix bypass paths: If the gate also blocks security patch or P0 incident response deployments, it can lead to even larger outages. It is important for the team to agree on and document gate bypass procedures (e.g., using specific Git tags or labels).
Gradual cultural adaptation: Start with warning notifications only instead of blocking, and switch to an actual blocking policy after the team becomes comfortable with the data.

Most Common Mistakes in Practice

Setting SLO targets too high — A target like 99.999% leaves only 26 seconds of monthly error budget, meaning the budget is perpetually exhausted. Start with realistic targets (99.5%–99.9%).
Omitting the fail-safe guard in burn rate checks — When Prometheus doesn't respond or the query result is empty, the gate can fail open automatically. Always include the empty-value guard from Example 2 ([[ -z "$BURN_RATE" ]]).
Using only a single short window — Monitoring only a 5-minute window causes the gate to trigger on transient spikes, reducing reliability. Follow the Google SRE Workbook recommendation and use the 1h+5m (fast burn) and 6h+30m (slow burn) combination.

Closing Thoughts

The essence of error budget policy automation is transforming "whether a deployment is allowed" from a team's implicit judgment into an objective contract enforced by the system. This transformation does not happen overnight, but it can be started small and expanded incrementally.

Steps you can take right now:

First, agree on realistic SLO targets with service owners and the product team. Discussing whether 99.9% or 99.5% strikes the right balance between your team's deployment velocity and business requirements should come before any technical configuration.
Define one SLI from an existing service and observe it for two weeks. Adding the query sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) to a Grafana dashboard will give you real measured data to use as the basis for setting a realistic SLO target.
Codify your SLO as YAML using Sloth and commit it to Git. You can see firsthand how Prometheus Recording Rules and Alert Rules are automatically generated with a single command: sloth generate -i slo/api-availability.yaml | kubectl apply -f -.
Add a burn rate check step to your existing CI/CD pipeline, but start by outputting only a warning message instead of exit 1. Switching to an actual blocking policy after the team is comfortable with the data lets you introduce automation while minimizing cultural resistance.

Next Article (Series Upcoming): A deep-dive Progressive Delivery guide to simultaneously implementing A/B testing and SLO-based automatic rollback in canary deployments using Argo Rollouts' AnalysisTemplate

References

Error Budget Automation: A Practical Implementation Guide to Blocking SLO Violations with GitOps Deployment Gates | DEV BAK - 기술블로그

DevOps

Error Budget Automation: A Practical Implementation Guide to Blocking SLO Violations with GitOps Deployment Gates

Argo Rollouts · Sloth · GitHub Actions Step-by-Step Configuration

Core Concepts

What Is an Error Budget

Error Budget = (1 - SLO target) × measurement period
Example: (1 - 0.999) × 43,200 min = 43.2 min/month

Key Insight: An error budget is not "how much can we fail" — it's a fuel gauge that defines "how fast can we move between innovation and stability." The core principle is to move fast when the budget is plentiful, and slow down as it depletes.

Burn Rate: Quantifying the Rate of Depletion

Burn Rate indicates how quickly the error budget is being consumed. A burn rate of 1.0 means the budget is being exhausted at exactly the normal pace — depleted over 30 days.

The table below shows tiered responses based on Google SRE Workbook recommendations.

Burn Rate	Meaning	Recommended Action
1.0x	Depleting at normal pace	Keep monitoring
2.0x	Depleting at 2× budget speed	Log a warning
5.0x	Depleting at 5× budget speed	Create a ticket, begin investigation
10.0x	Depleting at 10× budget speed	Page on-call engineer
14.4x	5% of budget depleted within 1 hour	Fast burn alert (Google SRE threshold)
20.0x+	Extremely rapid depletion	Automatically block deployments

Definition: Fast Burn is a state where a burn rate of 14.4x persists for more than one hour. At this pace, the monthly budget will be fully exhausted within 5 days. The Google SRE Workbook recommends this state as the threshold for immediate alerting.

Error Budget Policy: An Operational Contract Between Teams

The core automation flow is as follows:

Measure SLI
   ↓
Record in Prometheus (Recording Rule)
   ↓
Calculate Burn Rate
   ↓
Evaluate Policy (compare against threshold)
   ↓
Block or allow deployment

A Deployment Gate operates at the final step of this flow and can be inserted at three points in a GitOps environment:

Git PR approval stage (GitHub Actions Status Check)
Argo CD PreSync/PostSync Hook
Argo Rollouts AnalysisRun stage

SLOs-as-Code: Managing SLOs as Code

Tool	Role	Characteristics
Sloth	Automatic Prometheus SLO rule generation	CLI + Kubernetes Operator, OpenSLO compatible
Pyrra	Auto-converts SLO → PrometheusRule	Kubernetes Operator approach
OpenSLO	Declarative YAML standard spec for SLOs	Vendor-neutral
Nobl9 sloctl	GitLab/GitHub CI integration CLI	Commercial platform integration

Practical Application

Example 1: Argo Rollouts Canary Gate — Automatic Rollback Based on Prometheus SLO

yaml

# k8s/analysis/slo-analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: slo-analysis
  namespace: production
spec:
  metrics:
  - name: error-rate
    interval: 60s
    successCondition: result[0] < 0.01       # Error rate below 1%
    failureLimit: 3                           # Rollback after 3 consecutive failures
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
  - name: latency-p99
    interval: 60s
    successCondition: result[0] < 0.3        # P99 below 300ms (in seconds)
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          histogram_quantile(0.99,
            rate(http_duration_seconds_bucket[5m]))

Field	Role
`interval`	Metric measurement interval (Prometheus query runs every 60 seconds)
`successCondition`	Canary stage proceeds only if this condition is satisfied
`failureLimit`	Automatic rollback triggered after this many failures
`provider.prometheus.query`	The actual PromQL query — the heart of the SLI definition

The canary traffic flow proceeds as follows:

10% traffic shifted → AnalysisRun runs → SLO passes
→ 30% traffic shifted → AnalysisRun runs → SLO passes
→ ... → 100% shift complete
(SLO failure at any stage → immediate rollback to 0%)

Practical Tip: For the address field, use the FQDN format http://service-name.namespace.svc.cluster.local:port to ensure stable DNS resolution within the cluster. Shortened addresses that omit the namespace may fail DNS resolution in cross-namespace environments.

Example 2: Sloth + GitHub Actions — Burn Rate Deployment Gate at the PR Stage

This pattern defines SLOs as YAML using Sloth, queries the Prometheus burn rate in GitHub Actions, and blocks the deployment stage when the threshold is exceeded.

Step 1: Define SLO with Sloth

yaml

# slo/api-availability.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: api-availability
  namespace: monitoring
spec:
  service: "api-service"
  slos:
  - name: "availability"
    objective: 99.9                          # 99.9% availability SLO
    sli:
      events:
        error_query: |
          sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
        total_query: |
          sum(rate(http_requests_total[{{.window}}]))
    alerting:
      name: APIAvailabilityBurnRate
      pageAlert:
        labels:
          severity: critical
          team: platform

Running sloth generate -i slo/api-availability.yaml | kubectl apply -f - automatically generates Prometheus Recording Rules and Alert Rules.

Important: Recording Rule names generated by Sloth may vary by version. Verify the actual generated rule names with kubectl get prometheusrule -n monitoring -o yaml before using them in the GitHub Actions query below.

Step 2: GitHub Actions Deployment Gate

yaml

# .github/workflows/deploy.yml
name: Deploy with Error Budget Gate
 
on:
  push:
    branches: [main]
 
jobs:
  check-error-budget:
    runs-on: ubuntu-latest
    steps:
      - name: Check Error Budget Burn Rate
        env:
          PROMETHEUS: ${{ secrets.PROMETHEUS_URL }}
        run: |
          BURN_RATE=$(curl -sf "$PROMETHEUS/api/v1/query" \
            --data-urlencode \
              "query=slo:error_budget_burn_rate:ratio_rate1h" \
            | jq -r '.data.result[0].value[1]')
 
          # Block as fail-safe if Prometheus returns no response or empty value
          if [[ -z "$BURN_RATE" || "$BURN_RATE" == "null" ]]; then
            echo "::error::Failed to retrieve burn rate from Prometheus. Blocking deployment."
            exit 1
          fi
 
          echo "Current burn rate: ${BURN_RATE}x"
 
          if (( $(echo "$BURN_RATE > 10" | bc -l) )); then
            echo "::error::Error budget burn rate ${BURN_RATE}x exceeds threshold (10x)."
            echo "::error::Deployment blocked. Please check SLO status."
            exit 1
          fi
 
          echo "Error budget status healthy. Proceeding with deployment."
 
  deploy:
    needs: check-error-budget          # Runs only after gate passes
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Kubernetes
        run: kubectl apply -f k8s/

Terminology: slo:error_budget_burn_rate:ratio_rate1h is an example of a Recording Rule name auto-generated by Sloth. A Recording Rule is a Prometheus feature that pre-stores the results of complex PromQL calculations to improve query performance. The empty-value guard ([[ -z "$BURN_RATE" ]]) is an essential safety mechanism that prevents the gate from automatically opening (fail-open) in error scenarios.

Example 3: Flagger + Istio — Service Mesh-Based Automatic Canary

Prerequisite: This example applies only when Istio service mesh is installed in the cluster. If you are not using a service mesh, refer to Example 1 (Argo Rollouts).

Both tools serve the same role (canary SLO gate) but take different approaches. This comparison can help you choose the right tool for your environment.

Category	Argo Rollouts	Flagger
Dependencies	Kubernetes only	Requires a service mesh such as Istio or Linkerd
Traffic control	Direct control via Rollout resource	Control via auto-generated VirtualService
Ecosystem integration	Tightly integrated with Argo CD	Native integration with Flux CD
Best suited for	Kubernetes environments without a service mesh	Microservice environments based on Istio/Linkerd

yaml

# canary/api-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  service:
    port: 80
  analysis:
    interval: 1m                             # Analyze every 1 minute
    threshold: 5                             # Rollback after 5 failures
    maxWeight: 50                            # Maximum 50% canary traffic
    stepWeight: 10                           # Increase by 10% increments
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99                              # Success rate at least 99% (percentage, range 0–100)
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500                             # P99 latency at most 500ms (in milliseconds)
      interval: 1m

Stage	Traffic Percentage	Action
Stage 1	10%	Measure Prometheus metrics, evaluate SLO
Stage 2	20%	Increase traffic after previous stage passes
...	...	Repeat
Final success	100%	Canary fully promoted to primary
On failure	0%	Immediate rollback, Kubernetes event recorded

Unit note: thresholdRange.min: 99 for request-success-rate is a percentage (%) based on Flagger's built-in metrics. When defining custom metrics directly, always verify the unit of values returned by the PromQL query (decimal ratio vs. percentage) before setting thresholds.

Example 4: Multi-Window Burn Rate Alerts — Google SRE Recommended Pattern

yaml

# prometheus/slo-alerts.yaml
groups:
  - name: slo.rules
    rules:
    # Fast burn alert: combination of 1h + 5m windows
    - alert: ErrorBudgetFastBurn
      expr: |
        (
          slo:error_budget_burn_rate:ratio_rate1h > 14.4
          and
          slo:error_budget_burn_rate:ratio_rate5m > 14.4
        )
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Fast burn detected: burn rate {{ $value }}x"
        description: "Monthly budget will be exhausted within 5 days at the current rate"
 
    # Slow burn alert: combination of 6h + 30m windows
    - alert: ErrorBudgetSlowBurn
      expr: |
        (
          slo:error_budget_burn_rate:ratio_rate6h > 6
          and
          slo:error_budget_burn_rate:ratio_rate30m > 6
        )
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Slow burn detected: burn rate {{ $value }}x"
        description: "A sustained low burn rate is quietly depleting the budget over time"

Alert Type	Window Combination	Threshold	Detects
Fast burn	1h + 5m	14.4x	Sudden outages requiring immediate response
Slow burn	6h + 30m	6x	Long-running low-level performance degradation

Terminology: Slow Burn is a pattern where the burn rate is low but sustained over a long period, quietly depleting the budget. Because it is difficult to detect with short windows alone, it is recommended to combine it with a long window of 6 hours or more using an AND condition.

Pros and Cons Analysis

Advantages

Item	Details
Automated reliability protection	Deployments are automatically blocked on SLO violations without human judgment, even at night or on weekends
Balance between development speed and stability	Fast deployments when budget is healthy, automatic braking when risk is high
Objective criteria across teams	Deployment decisions shift from subjective judgment to data-driven decisions
Audit trail	All deployment decisions are recorded in Git history in a GitOps environment
Progressive risk mitigation	Minimizes production risk when combined with Progressive Delivery

Disadvantages and Caveats

Item	Details	Mitigation
False Positives	Thresholds that are too strict will block legitimate deployments	Adjust burn rate thresholds gradually; start with warnings only
Initial setup complexity	Requires infrastructure investment in Prometheus, Argo Rollouts, alert rules, etc.	Minimize boilerplate using Sloth/Pyrra
Incorrect SLI definitions	A poorly defined SLI renders the entire gate system meaningless	Mandatory team-wide review and documentation before defining SLIs
Window selection issues	Short windows produce noise; long windows produce slow responses	Use the Google SRE recommended multi-window combination

Terminology: SLI (Service Level Indicator) is the actual measured metric (e.g., HTTP success rate, P99 latency), while SLO is the target value that the SLI must achieve (e.g., success rate ≥ 99.9%). If the SLI is poorly defined, the SLO and burn rate calculations both become meaningless.

Team Preparation Before Adoption

Agree on SLO targets: It is critical to first agree on realistic SLO numbers with service owners and the product team. If you proceed with technical configuration without that agreement, every automatic block will generate team conflict.
Document hotfix bypass paths: If the gate also blocks security patch or P0 incident response deployments, it can lead to even larger outages. It is important for the team to agree on and document gate bypass procedures (e.g., using specific Git tags or labels).
Gradual cultural adaptation: Start with warning notifications only instead of blocking, and switch to an actual blocking policy after the team becomes comfortable with the data.

Most Common Mistakes in Practice

Setting SLO targets too high — A target like 99.999% leaves only 26 seconds of monthly error budget, meaning the budget is perpetually exhausted. Start with realistic targets (99.5%–99.9%).
Omitting the fail-safe guard in burn rate checks — When Prometheus doesn't respond or the query result is empty, the gate can fail open automatically. Always include the empty-value guard from Example 2 ([[ -z "$BURN_RATE" ]]).
Using only a single short window — Monitoring only a 5-minute window causes the gate to trigger on transient spikes, reducing reliability. Follow the Google SRE Workbook recommendation and use the 1h+5m (fast burn) and 6h+30m (slow burn) combination.

Closing Thoughts

Steps you can take right now:

First, agree on realistic SLO targets with service owners and the product team. Discussing whether 99.9% or 99.5% strikes the right balance between your team's deployment velocity and business requirements should come before any technical configuration.
Define one SLI from an existing service and observe it for two weeks. Adding the query sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) to a Grafana dashboard will give you real measured data to use as the basis for setting a realistic SLO target.
Codify your SLO as YAML using Sloth and commit it to Git. You can see firsthand how Prometheus Recording Rules and Alert Rules are automatically generated with a single command: sloth generate -i slo/api-availability.yaml | kubectl apply -f -.
Add a burn rate check step to your existing CI/CD pipeline, but start by outputting only a warning message instead of exit 1. Switching to an actual blocking policy after the team is comfortable with the data lets you introduce automation while minimizing cultural resistance.

Next Article (Series Upcoming): A deep-dive Progressive Delivery guide to simultaneously implementing A/B testing and SLO-based automatic rollback in canary deployments using Argo Rollouts' AnalysisTemplate

Core Concepts

What Is an Error Budget

Burn Rate: Quantifying the Rate of Depletion

Error Budget Policy: An Operational Contract Between Teams

SLOs-as-Code: Managing SLOs as Code

Practical Application

Example 1: Argo Rollouts Canary Gate — Automatic Rollback Based on Prometheus SLO

Example 2: Sloth + GitHub Actions — Burn Rate Deployment Gate at the PR Stage

Example 3: Flagger + Istio — Service Mesh-Based Automatic Canary

Example 4: Multi-Window Burn Rate Alerts — Google SRE Recommended Pattern

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Team Preparation Before Adoption

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

What Is an Error Budget

Burn Rate: Quantifying the Rate of Depletion

Error Budget Policy: An Operational Contract Between Teams

SLOs-as-Code: Managing SLOs as Code

Practical Application

Example 1: Argo Rollouts Canary Gate — Automatic Rollback Based on Prometheus SLO

Example 2: Sloth + GitHub Actions — Burn Rate Deployment Gate at the PR Stage

Example 3: Flagger + Istio — Service Mesh-Based Automatic Canary

Example 4: Multi-Window Burn Rate Alerts — Google SRE Recommended Pattern

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Team Preparation Before Adoption

Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Complete Guide to Prometheus + Grafana Monitoring — From Docker Compose to Kubernetes

Grafana Loki + Tempo: Implementing Bidirectional Log-Trace Drill-Down with a Single Trace ID

TraceQL Deep Dive: A Practical Guide to Error Filtering, P99, and Mimir Cross-Signal Queries in Grafana Tempo 2.x

Kubernetes SLO Automation: Declarative SLO Management with Sloth and Pyrra

Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline

Automating Fast-burn/Slow-burn Alerts with Grafana SLO