Automating Canary Rollbacks with Kargo + Argo Rollouts: AnalysisTemplate and Freight Propagation Blocking in Practice
Have you ever been afraid of a deployment? I have. Especially that moment when you say "let's send just 10% of traffic to this release first," then sit there refreshing the Slack channel, staring holes through the error rate dashboard. The process of manually reading metrics, making judgments, and deciding whether to roll back was itself the bottleneck. When people are tired, their judgment suffers—and if it's a late-night deployment, even more so.
By using Kargo and Argo Rollouts together, you can move that judgment process into code. Declare "automatically block if the error rate exceeds 5%" in YAML, and from then on the pipeline reads Prometheus on its own, evaluates success criteria, and blocks Freight propagation on failure. After switching our deployment pipeline to this structure, the number of late-night alerts dropped noticeably. In this post, I'll walk through the specifics with concrete YAML—from setting success criteria based on AnalysisTemplate, to Kargo Stage verification, and the two independent layers where automatic rollback actually happens.
Prerequisite for this post: This is especially helpful for those already running Kubernetes and doing GitOps deployments with Argo CD. Terms like Prometheus, PromQL, and CRD will appear without prior explanation, so keep that in mind.
Argo Rollouts' Analysis Engine: AnalysisTemplate
Progressive Delivery is a deployment approach that validates stability by gradually shifting traffic, as with canary or blue/green strategies. It can reduce the blast radius compared to traditional rolling updates.
Argo Rollouts is a controller that handles canary and blue/green deployments in Kubernetes. But it doesn't stop at "we sent 10% canary traffic"—it can automatically judge whether that traffic is healthy. That's AnalysisTemplate.
AnalysisTemplate is a Kubernetes CRD that declares which metrics to collect, at what interval, how many times, and by what criteria to determine success or failure. An
AnalysisRunis a single execution instance of that template.
One important design point about AnalysisTemplate: it doesn't have to be directly tied to a Rollout object. It can exist independently and be referenced by other tools—including Kargo. This is the key that makes integration with Kargo possible.
There's another thing worth noting. If you define both successCondition and failureCondition, a middle ground emerges. If the success threshold is 95% or above and the failure threshold is below 80%, the 80–95% range becomes Inconclusive. If you only specify failureCondition, anything that doesn't fall below it is automatically treated as a success. I missed this distinction at first and spent a long time puzzling over "why is the analysis always in an inconclusive state?"
How Kargo Orchestrates Promotions
Kargo is a GitOps-based promotion orchestrator built by Akuity (the team that founded the Argo project). Understanding just three objects gives you the full picture.
| Object | Role |
|---|---|
| Warehouse | Detects changes in container images, Git commits, and Helm charts, and generates Freight |
| Freight | An immutable object representing a bundle of artifacts at a specific version. Moves between Stages and becomes the promotion history |
| Stage | Each environment such as dev, staging, and prod. Receives Freight, validates it, and passes it downstream |
Thinking of Freight as a "deployment ticket" makes it easier to understand. When Freight for image version
v1.2.3passes dev Stage verification, it gets aVerifiedstamp, and the staging Stage becomes eligible to receive that Freight. If verification fails, the Freight stops at that stage.
Where the Two Tools Connect — Two Layers Where Rollback Happens
This is the most confusing part, so let's address it directly. When Kargo and Argo Rollouts work together, "rollback" actually happens in two independent layers. Thinking of them as one will inevitably cause problems when you go to implement this.
Layer 1 — Argo Rollouts Layer: This is when analysis is defined directly in the canary steps of a Rollout object. Argo Rollouts creates an AnalysisRun on its own, and if the analysis fails, it transitions the Rollout object to an Aborted state and reverts the canary weight to 0. This behavior is handled by Argo Rollouts alone, independently of Kargo.
Layer 2 — Kargo Layer: This is when a Kargo Stage references an AnalysisTemplate in spec.verification.analysisTemplates. Once a promotion completes, Kargo creates an AnalysisRun and evaluates the result. If this analysis fails, Kargo does not mark the Freight as Verified. That means the Freight is not propagated to the next Stage. It does not directly modify the state of the Argo Rollouts Rollout object.
To summarize:
| Layer | Behavior on Failure | Responsible Component |
|---|---|---|
| Argo Rollouts analysis | Abort current Rollout + revert canary weight to 0 | Argo Rollouts |
| Kargo Stage verification | Block downstream Stage propagation for that Freight | Kargo |
Using both layers together is powerful. While Argo Rollouts immediately restores traffic for the current deployment, Kargo prevents that bad Freight from flowing into the next environment.
The full flow looks like this:
CI build/push
↓
Kargo Warehouse detects change → creates Freight
↓
Stage receives Freight, updates Git manifests
↓
Argo CD syncs to cluster (canary traffic splitting begins)
↓
Argo Rollouts: executes analysis steps within Rollout (if defined)
→ On failure: canary weight immediately reverts to 0 (Argo Rollouts handles this alone)
↓
Kargo: creates AnalysisRun for Stage verification → evaluates Prometheus metrics
→ Success: marks Freight as Verified → allows movement to downstream Stage
→ Failure: blocks downstream Stage propagation (does not directly modify Rollout state)A quick note on ClusterAnalysisTemplate: while AnalysisTemplate is namespace-scoped, ClusterAnalysisTemplate can be referenced cluster-wide. It's useful when you want to manage verification criteria for the entire organization in one place, and you can reference it with kind: ClusterAnalysisTemplate in a Kargo Stage's spec.verification.analysisTemplates.
Practical Application
Example 1: Writing a Prometheus-Based AnalysisTemplate
The pattern used most often in practice is measuring both success rate and error rate simultaneously. Looking at just one can lead to missed cases. I once only watched success rate and had errors on a specific endpoint get diluted and slip through, so ever since I always define both metrics together.
interval × count = total observation time. In the example below, interval: 5m with count: 6 means "measure every 5 minutes and observe for a total of 30 minutes." The key is to set these values differently per environment—keeping count short (e.g., count: 2) for dev and longer (e.g., count: 12) for prod.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate-check
spec:
args:
- name: service-name # dynamically injected from Stage
metrics:
- name: success-rate
interval: 5m # measure every 5 minutes
count: 6 # 6 total measurements = 30 minutes of observation
successCondition: result[0] >= 0.95 # success threshold: 95% or above
failureCondition: result[0] < 0.80 # failure threshold: below 80% (80–95% is Inconclusive)
failureLimit: 2 # immediately fail Analysis after 2 consecutive failures
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status!~"5.."
}[5m])) /
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[5m]))
- name: error-rate
interval: 5m
count: 6 # same as success-rate: 30 minutes of observation
successCondition: result[0] < 0.05 # error rate below 5%
failureCondition: result[0] >= 0.05 # immediately fail if error rate is 5% or above
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"5.."
}[5m])) /
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[5m]))A summary of each key field's role:
| Field | Value | Meaning |
|---|---|---|
interval |
5m |
Executes PromQL query every 5 minutes |
count |
6 |
Observes for 30 minutes total before final verdict |
successCondition |
result[0] >= 0.95 |
Evaluated as a Go expression |
failureCondition |
result[0] < 0.80 |
Condition that triggers immediate failure |
failureLimit |
2 |
Stops when cumulative failure count is exceeded |
result[0]is the first value in the time-series vector returned by the PromQL query. Because PromQL returns a vector by default, you access it with an index ([0]). If your query usesscalar()to return a single number, you can useresultalone.
Example 2: Configuring Kargo Stage Verification
Once you've created the AnalysisTemplate, all that's left is referencing it from the Stage.
apiVersion: kargo.akuity.io/v1alpha1
kind: Stage
metadata:
name: staging
namespace: my-project
spec:
requestedFreight:
- origin:
kind: Warehouse
name: my-warehouse
sources:
stages:
- dev # only receive Freight that has been Verified in the dev Stage
verification:
analysisTemplates:
- name: success-rate-check # references the AnalysisTemplate in the same namespace
args:
- name: service-name
value: my-serviceStarting with Kargo v1.3, you can use expressions for args values, making it possible to dynamically pass in the commit hash being verified:
args:
- name: service-name
value: my-service
- name: commit
value: "${{ freight.git.commit }}" # dynamically injects the deployed commit hashExample 3: Multi-Stage Pipeline — A Financial Services Deployment Pattern
Here's one practical pattern worth referencing. It's a configuration commonly used when deploying payment services in a fintech environment. Before using this setup, someone had to stare at a dashboard for 30 minutes and make a manual judgment on every prod deployment.
# dev Stage: lightweight HTTP health check
apiVersion: kargo.akuity.io/v1alpha1
kind: Stage
metadata:
name: dev
namespace: payments
spec:
requestedFreight:
- origin:
kind: Warehouse
name: payments-warehouse
verification:
analysisTemplates:
- name: http-healthcheck # simple Job-based health check (defined separately)
---
# staging Stage: Prometheus-based 30-minute analysis
apiVersion: kargo.akuity.io/v1alpha1
kind: Stage
metadata:
name: staging
namespace: payments
spec:
requestedFreight:
- origin:
kind: Warehouse
name: payments-warehouse
sources:
stages:
- dev
verification:
analysisTemplates:
- name: success-rate-check
args:
- name: service-name
value: payments-service
---
# prod Stage: manual approval gate + Prometheus analysis
apiVersion: kargo.akuity.io/v1alpha1
kind: Stage
metadata:
name: prod
namespace: payments
spec:
requestedFreight:
- origin:
kind: Warehouse
name: payments-warehouse
sources:
stages:
- staging
promotionTemplate:
spec:
steps:
- uses: wait-for-approval # manual approval gate
- uses: argocd-update
verification:
analysisTemplates:
- name: success-rate-check
args:
- name: service-name
value: payments-serviceIf the error rate exceeds 5% in staging, the Freight won't be marked Verified and won't advance to prod at all. If Argo Rollouts also has analysis defined within its Rollout steps, the canary weight reverts to 0 as well. No one has to monitor things manually in the middle of the night. If you want a longer observation window in prod, a common pattern is to create a separate AnalysisTemplate that accepts count as a parameter, or to split out a prod-specific template entirely.
Real-World Experience: What Worked Well and What to Watch Out For
What Worked Well
| Item | Details |
|---|---|
| Full GitOps integration | All promotion history is recorded in Git, so rollback is simply git revert |
| Automated quality gates | Automatically blocks failed Freight from propagating upstream to downstream |
| Reusability | Manage organization-wide verification criteria in one place with ClusterAnalysisTemplate |
| Rich metric providers | Supports Prometheus, Datadog, New Relic, CloudWatch, Kubernetes Jobs, and more |
| Declarative success criteria | Go expression-based conditions can express complex business rules |
What to Watch Out For
| Item | Details | Mitigation |
|---|---|---|
| Metrics infrastructure required | Observability stack such as Prometheus must already be in place | Can be set up quickly with the kube-prometheus-stack Helm chart |
| Initial setup complexity | Running three components simultaneously: Argo Rollouts + Kargo + Argo CD | Recommended to practice in a local k3d environment before applying to production |
| Analysis wait time | Next Stage promotion is blocked for count × interval duration |
Set count differently per environment (shorter for dev, longer for prod) |
| False negative risk | Inaccurate metric definitions can let bad deployments through or block good ones | Tune by progressively adjusting thresholds |
| Sharding environment caution | In distributed cluster environments, AnalysisRun may read metrics from the wrong shard, causing false positives | Kargo shard configuration must match the AnalysisRun's target cluster |
The Most Common Mistakes in Practice
-
Defining only
successConditionwithoutfailureCondition— Even if a metric drops below 95%, it will only be treated asInconclusive, and the analysis may never end. Clearly separate both conditions, or consider the simpler approach of using onlyfailureCondition. -
Not accounting for analysis start timing — If an AnalysisRun starts before canary Pods are in a Ready state, early measurements can be skewed. Use the
initialDelayfield to allow time for Pod stabilization. -
Confusing Kargo verification and Argo Rollouts rollout steps as the same layer — As explained above, the two layers are independent. The pipeline order must be designed clearly so that Kargo's AnalysisRun runs at the point when canary traffic is actually flowing.
Closing Thoughts
To summarize what this post covered: by declaring success criteria in code with AnalysisTemplate and controlling Freight flow with Kargo Stage verification, the two layers each play their own role in protecting deployment stability. While Argo Rollouts immediately reverts the current canary, Kargo prevents that bad version from spreading to upstream environments.
Instead of refreshing the Slack channel, you can delegate that judgment logic to YAML and move on to the next problem. You can be someone who designs deployments rather than someone who watches them.
Three steps you can start right now:
-
Install Argo Rollouts locally and experiment with
AnalysisTemplateon its own. Installation is just this one line:kubectl create namespace argo-rollouts && kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml— you can copy thesuccess-rate-checkYAML from this post, change only the Prometheus address, and apply it. -
Add Kargo to a cluster that already has Argo CD, and build a mini pipeline with just two Stages: dev → staging. Following the official QuickStart lets you see Freight moving between Stages within 30 minutes.
-
Attach the
verificationblock from this post to the staging Stage, then intentionally trigger an error and verify that automatic blocking kicks in. Once you've seen a failure scenario with your own eyes, designing for production becomes much more concrete.
References
- Kargo: Verifying Freight in a Stage | kargo.io
- Kargo: Analysis Templates Reference | kargo.io
- Kargo Core Concepts | kargo.io
- Kargo v1.3 Release Notes — Conditional Steps & Advanced Verification | akuity.io
- Kargo v1.10 Release Notes | akuity.io
- What is Kargo? | akuity.io
- Argo Rollouts: Analysis & Progressive Delivery | argo-rollouts.readthedocs.io
- Argo Rollouts: Prometheus Analysis Provider | argo-rollouts.readthedocs.io
- Progressive Delivery with Argo Rollouts: Canary with Analysis | infracloud.io
- Continuous Promotion on Kubernetes with GitOps | piotrminkowski.com
- Canary delivery with Argo Rollout and Amazon VPC Lattice for EKS | aws.amazon.com