Error Budget Automation: A Practical Implementation Guide to Blocking SLO Violations with GitOps Deployment Gates
Argo Rollouts · Sloth · GitHub Actions Step-by-Step Configuration
"Is this code safe to deploy?" — a question every engineering team asks repeatedly. Engineers manually check SLO dashboard numbers, ask in the team Slack, and ultimately rely on human judgment. This process is slow, subjective, and fails to prevent late-night incidents.
Error budget policy automation lets the system answer this question instead. The CI/CD pipeline reads Prometheus-calculated burn rates in real time, and automatically blocks deployments when a pre-agreed threshold is exceeded. Rather than waiting for human judgment, the system enforces the contract your team codified in code. This article walks through everything step by step — from core error budget concepts to building a practical GitOps gate using Argo Rollouts, Sloth, GitHub Actions, and Flagger.
After reading this, you'll be able to add a burn-rate-based deployment gate to your existing CI/CD pipeline and configure a GitOps environment that automatically blocks deployments when SLOs are violated. Note that this article assumes readers are familiar with basic Kubernetes concepts (Pod, Deployment, kubectl). The service mesh (Istio) examples apply only to teams already operating in that environment.
Core Concepts
What Is an Error Budget
An Error Budget quantifies the allowable "amount of unreliability" against an SLO target. Take a 99.9% availability SLO as an example: over one month (30 days = 43,200 minutes), approximately 43.2 minutes of downtime is permitted.
Error Budget = (1 - SLO target) × measurement period
Example: (1 - 0.999) × 43,200 min = 43.2 min/monthKey Insight: An error budget is not "how much can we fail" — it's a fuel gauge that defines "how fast can we move between innovation and stability." The core principle is to move fast when the budget is plentiful, and slow down as it depletes.
Burn Rate: Quantifying the Rate of Depletion
Burn Rate indicates how quickly the error budget is being consumed. A burn rate of 1.0 means the budget is being exhausted at exactly the normal pace — depleted over 30 days.
The table below shows tiered responses based on Google SRE Workbook recommendations.
| Burn Rate | Meaning | Recommended Action |
|---|---|---|
| 1.0x | Depleting at normal pace | Keep monitoring |
| 2.0x | Depleting at 2× budget speed | Log a warning |
| 5.0x | Depleting at 5× budget speed | Create a ticket, begin investigation |
| 10.0x | Depleting at 10× budget speed | Page on-call engineer |
| 14.4x | 5% of budget depleted within 1 hour | Fast burn alert (Google SRE threshold) |
| 20.0x+ | Extremely rapid depletion | Automatically block deployments |
Definition: Fast Burn is a state where a burn rate of 14.4x persists for more than one hour. At this pace, the monthly budget will be fully exhausted within 5 days. The Google SRE Workbook recommends this state as the threshold for immediate alerting.
Error Budget Policy: An Operational Contract Between Teams
An Error Budget Policy is a document pre-agreed upon by SRE, development, and product teams that defines "what action to take when the budget is depleted by a given amount." Before automation, it was just a number on a dashboard — but as SLOs-as-Code has gained adoption, embedding this policy directly into the pipeline is becoming the standard.
The core automation flow is as follows:
Measure SLI
↓
Record in Prometheus (Recording Rule)
↓
Calculate Burn Rate
↓
Evaluate Policy (compare against threshold)
↓
Block or allow deploymentA Deployment Gate operates at the final step of this flow and can be inserted at three points in a GitOps environment:
- Git PR approval stage (GitHub Actions Status Check)
- Argo CD PreSync/PostSync Hook
- Argo Rollouts AnalysisRun stage
SLOs-as-Code: Managing SLOs as Code
The practice of codifying SLO definitions as YAML files and version-controlling them in Git is spreading rapidly. This approach allows SLO change history to be managed through the code review process, and when combined with a Kubernetes Operator, it automatically generates Prometheus Recording Rules and Alert Rules.
| Tool | Role | Characteristics |
|---|---|---|
| Sloth | Automatic Prometheus SLO rule generation | CLI + Kubernetes Operator, OpenSLO compatible |
| Pyrra | Auto-converts SLO → PrometheusRule | Kubernetes Operator approach |
| OpenSLO | Declarative YAML standard spec for SLOs | Vendor-neutral |
| Nobl9 sloctl | GitLab/GitHub CI integration CLI | Commercial platform integration |
Practical Application
Example 1: Argo Rollouts Canary Gate — Automatic Rollback Based on Prometheus SLO
Argo Rollouts is a Kubernetes-native rollout controller that creates an AnalysisRun at each stage of a canary deployment to evaluate Prometheus query results. When failure conditions are met, it automatically rolls back and halts the deployment. A key characteristic is that it operates on a Kubernetes cluster alone, without requiring a service mesh dependency.
# k8s/analysis/slo-analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: slo-analysis
namespace: production
spec:
metrics:
- name: error-rate
interval: 60s
successCondition: result[0] < 0.01 # Error rate below 1%
failureLimit: 3 # Rollback after 3 consecutive failures
provider:
prometheus:
address: http://prometheus.monitoring.svc.cluster.local:9090
query: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- name: latency-p99
interval: 60s
successCondition: result[0] < 0.3 # P99 below 300ms (in seconds)
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring.svc.cluster.local:9090
query: |
histogram_quantile(0.99,
rate(http_duration_seconds_bucket[5m]))| Field | Role |
|---|---|
interval |
Metric measurement interval (Prometheus query runs every 60 seconds) |
successCondition |
Canary stage proceeds only if this condition is satisfied |
failureLimit |
Automatic rollback triggered after this many failures |
provider.prometheus.query |
The actual PromQL query — the heart of the SLI definition |
The canary traffic flow proceeds as follows:
10% traffic shifted → AnalysisRun runs → SLO passes
→ 30% traffic shifted → AnalysisRun runs → SLO passes
→ ... → 100% shift complete
(SLO failure at any stage → immediate rollback to 0%)Practical Tip: For the
addressfield, use the FQDN formathttp://service-name.namespace.svc.cluster.local:portto ensure stable DNS resolution within the cluster. Shortened addresses that omit the namespace may fail DNS resolution in cross-namespace environments.
Example 2: Sloth + GitHub Actions — Burn Rate Deployment Gate at the PR Stage
This pattern defines SLOs as YAML using Sloth, queries the Prometheus burn rate in GitHub Actions, and blocks the deployment stage when the threshold is exceeded.
Step 1: Define SLO with Sloth
# slo/api-availability.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: api-availability
namespace: monitoring
spec:
service: "api-service"
slos:
- name: "availability"
objective: 99.9 # 99.9% availability SLO
sli:
events:
error_query: |
sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
total_query: |
sum(rate(http_requests_total[{{.window}}]))
alerting:
name: APIAvailabilityBurnRate
pageAlert:
labels:
severity: critical
team: platformRunning sloth generate -i slo/api-availability.yaml | kubectl apply -f - automatically generates Prometheus Recording Rules and Alert Rules.
Important: Recording Rule names generated by Sloth may vary by version. Verify the actual generated rule names with
kubectl get prometheusrule -n monitoring -o yamlbefore using them in the GitHub Actions query below.
Step 2: GitHub Actions Deployment Gate
# .github/workflows/deploy.yml
name: Deploy with Error Budget Gate
on:
push:
branches: [main]
jobs:
check-error-budget:
runs-on: ubuntu-latest
steps:
- name: Check Error Budget Burn Rate
env:
PROMETHEUS: ${{ secrets.PROMETHEUS_URL }}
run: |
BURN_RATE=$(curl -sf "$PROMETHEUS/api/v1/query" \
--data-urlencode \
"query=slo:error_budget_burn_rate:ratio_rate1h" \
| jq -r '.data.result[0].value[1]')
# Block as fail-safe if Prometheus returns no response or empty value
if [[ -z "$BURN_RATE" || "$BURN_RATE" == "null" ]]; then
echo "::error::Failed to retrieve burn rate from Prometheus. Blocking deployment."
exit 1
fi
echo "Current burn rate: ${BURN_RATE}x"
if (( $(echo "$BURN_RATE > 10" | bc -l) )); then
echo "::error::Error budget burn rate ${BURN_RATE}x exceeds threshold (10x)."
echo "::error::Deployment blocked. Please check SLO status."
exit 1
fi
echo "Error budget status healthy. Proceeding with deployment."
deploy:
needs: check-error-budget # Runs only after gate passes
runs-on: ubuntu-latest
steps:
- name: Deploy to Kubernetes
run: kubectl apply -f k8s/Terminology:
slo:error_budget_burn_rate:ratio_rate1his an example of a Recording Rule name auto-generated by Sloth. A Recording Rule is a Prometheus feature that pre-stores the results of complex PromQL calculations to improve query performance. The empty-value guard ([[ -z "$BURN_RATE" ]]) is an essential safety mechanism that prevents the gate from automatically opening (fail-open) in error scenarios.
Example 3: Flagger + Istio — Service Mesh-Based Automatic Canary
Prerequisite: This example applies only when Istio service mesh is installed in the cluster. If you are not using a service mesh, refer to Example 1 (Argo Rollouts).
Both tools serve the same role (canary SLO gate) but take different approaches. This comparison can help you choose the right tool for your environment.
| Category | Argo Rollouts | Flagger |
|---|---|---|
| Dependencies | Kubernetes only | Requires a service mesh such as Istio or Linkerd |
| Traffic control | Direct control via Rollout resource | Control via auto-generated VirtualService |
| Ecosystem integration | Tightly integrated with Argo CD | Native integration with Flux CD |
| Best suited for | Kubernetes environments without a service mesh | Microservice environments based on Istio/Linkerd |
When Flagger detects a Canary object, it automatically creates an Istio VirtualService to progressively shift traffic. It measures Prometheus metrics at each stage and immediately restores the original version if a threshold is exceeded.
# canary/api-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-service
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
service:
port: 80
analysis:
interval: 1m # Analyze every 1 minute
threshold: 5 # Rollback after 5 failures
maxWeight: 50 # Maximum 50% canary traffic
stepWeight: 10 # Increase by 10% increments
metrics:
- name: request-success-rate
thresholdRange:
min: 99 # Success rate at least 99% (percentage, range 0–100)
interval: 1m
- name: request-duration
thresholdRange:
max: 500 # P99 latency at most 500ms (in milliseconds)
interval: 1m| Stage | Traffic Percentage | Action |
|---|---|---|
| Stage 1 | 10% | Measure Prometheus metrics, evaluate SLO |
| Stage 2 | 20% | Increase traffic after previous stage passes |
| ... | ... | Repeat |
| Final success | 100% | Canary fully promoted to primary |
| On failure | 0% | Immediate rollback, Kubernetes event recorded |
Unit note:
thresholdRange.min: 99forrequest-success-rateis a percentage (%) based on Flagger's built-in metrics. When defining custom metrics directly, always verify the unit of values returned by the PromQL query (decimal ratio vs. percentage) before setting thresholds.
Example 4: Multi-Window Burn Rate Alerts — Google SRE Recommended Pattern
Using only a single short window causes two problems: reacting to transient spikes (noise) or missing long-term degradation (slow response). The Google SRE Workbook recommends combining a short window and a long window with an AND condition to detect both patterns.
# prometheus/slo-alerts.yaml
groups:
- name: slo.rules
rules:
# Fast burn alert: combination of 1h + 5m windows
- alert: ErrorBudgetFastBurn
expr: |
(
slo:error_budget_burn_rate:ratio_rate1h > 14.4
and
slo:error_budget_burn_rate:ratio_rate5m > 14.4
)
for: 2m
labels:
severity: critical
annotations:
summary: "Fast burn detected: burn rate {{ $value }}x"
description: "Monthly budget will be exhausted within 5 days at the current rate"
# Slow burn alert: combination of 6h + 30m windows
- alert: ErrorBudgetSlowBurn
expr: |
(
slo:error_budget_burn_rate:ratio_rate6h > 6
and
slo:error_budget_burn_rate:ratio_rate30m > 6
)
for: 15m
labels:
severity: warning
annotations:
summary: "Slow burn detected: burn rate {{ $value }}x"
description: "A sustained low burn rate is quietly depleting the budget over time"| Alert Type | Window Combination | Threshold | Detects |
|---|---|---|---|
| Fast burn | 1h + 5m | 14.4x | Sudden outages requiring immediate response |
| Slow burn | 6h + 30m | 6x | Long-running low-level performance degradation |
Terminology: Slow Burn is a pattern where the burn rate is low but sustained over a long period, quietly depleting the budget. Because it is difficult to detect with short windows alone, it is recommended to combine it with a long window of 6 hours or more using an AND condition.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Automated reliability protection | Deployments are automatically blocked on SLO violations without human judgment, even at night or on weekends |
| Balance between development speed and stability | Fast deployments when budget is healthy, automatic braking when risk is high |
| Objective criteria across teams | Deployment decisions shift from subjective judgment to data-driven decisions |
| Audit trail | All deployment decisions are recorded in Git history in a GitOps environment |
| Progressive risk mitigation | Minimizes production risk when combined with Progressive Delivery |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| False Positives | Thresholds that are too strict will block legitimate deployments | Adjust burn rate thresholds gradually; start with warnings only |
| Initial setup complexity | Requires infrastructure investment in Prometheus, Argo Rollouts, alert rules, etc. | Minimize boilerplate using Sloth/Pyrra |
| Incorrect SLI definitions | A poorly defined SLI renders the entire gate system meaningless | Mandatory team-wide review and documentation before defining SLIs |
| Window selection issues | Short windows produce noise; long windows produce slow responses | Use the Google SRE recommended multi-window combination |
Terminology: SLI (Service Level Indicator) is the actual measured metric (e.g., HTTP success rate, P99 latency), while SLO is the target value that the SLI must achieve (e.g., success rate ≥ 99.9%). If the SLI is poorly defined, the SLO and burn rate calculations both become meaningless.
Team Preparation Before Adoption
Error budget policy automation cannot be completed through technical configuration alone. For automatic deployment blocking to have a real impact on the team, the following organizational groundwork must come first:
- Agree on SLO targets: It is critical to first agree on realistic SLO numbers with service owners and the product team. If you proceed with technical configuration without that agreement, every automatic block will generate team conflict.
- Document hotfix bypass paths: If the gate also blocks security patch or P0 incident response deployments, it can lead to even larger outages. It is important for the team to agree on and document gate bypass procedures (e.g., using specific Git tags or labels).
- Gradual cultural adaptation: Start with warning notifications only instead of blocking, and switch to an actual blocking policy after the team becomes comfortable with the data.
Most Common Mistakes in Practice
-
Setting SLO targets too high — A target like 99.999% leaves only 26 seconds of monthly error budget, meaning the budget is perpetually exhausted. Start with realistic targets (99.5%–99.9%).
-
Omitting the fail-safe guard in burn rate checks — When Prometheus doesn't respond or the query result is empty, the gate can fail open automatically. Always include the empty-value guard from Example 2 (
[[ -z "$BURN_RATE" ]]). -
Using only a single short window — Monitoring only a 5-minute window causes the gate to trigger on transient spikes, reducing reliability. Follow the Google SRE Workbook recommendation and use the 1h+5m (fast burn) and 6h+30m (slow burn) combination.
Closing Thoughts
The essence of error budget policy automation is transforming "whether a deployment is allowed" from a team's implicit judgment into an objective contract enforced by the system. This transformation does not happen overnight, but it can be started small and expanded incrementally.
Steps you can take right now:
-
First, agree on realistic SLO targets with service owners and the product team. Discussing whether 99.9% or 99.5% strikes the right balance between your team's deployment velocity and business requirements should come before any technical configuration.
-
Define one SLI from an existing service and observe it for two weeks. Adding the query
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))to a Grafana dashboard will give you real measured data to use as the basis for setting a realistic SLO target. -
Codify your SLO as YAML using Sloth and commit it to Git. You can see firsthand how Prometheus Recording Rules and Alert Rules are automatically generated with a single command:
sloth generate -i slo/api-availability.yaml | kubectl apply -f -. -
Add a burn rate check step to your existing CI/CD pipeline, but start by outputting only a warning message instead of
exit 1. Switching to an actual blocking policy after the team is comfortable with the data lets you introduce automation while minimizing cultural resistance.
Next Article (Series Upcoming): A deep-dive Progressive Delivery guide to simultaneously implementing A/B testing and SLO-based automatic rollback in canary deployments using Argo Rollouts'
AnalysisTemplate
References
- Google SRE Workbook — Error Budget Policy
- Google SRE Workbook — Alerting on SLOs (Prometheus)
- Google SRE Workbook — Implementing SLOs
- Argo Rollouts — Analysis & Progressive Delivery Official Docs
- Argo Rollouts — Prometheus Analysis Official Docs
- Flagger Official Docs
- Flagger — Istio Canary Deployments Tutorial
- GitHub — slok/sloth: Prometheus SLO Generator
- GitHub — pyrra-dev/pyrra: SLOs with Prometheus
- Nobl9 — CI/CD Integration Guide
- OpenSLO Spec Docs (Nobl9)
- Datadog — Burn Rate Alerts Official Docs
- GitOps in 2025: From Old-School Updates to the Modern Way | CNCF
- Error Budgets 2.0 — Agentic AI for SLO-Apprehensive Deployment | DZone
- Error Budgets in Practice: A Data-Driven Approach | DEV Community
- GitOps using Flux and Flagger | InfraCloud