Implementing AI Agent (MCP) Server Canary Deployment and Automatic Rollback with KEDA + Argo Rollouts
In the Model Context Protocol (MCP) server, where the LLM directly calls external tools, a single incorrect version can bring the entire AI workflow to a halt. Existing rolling updates do not adequately control this risk. During version transitions, old and new version pods coexist, and even if some requests are routed to a defective version, immediate rollback is impossible, leaving error detection to humans. The MCP server provides the schemas (function signatures, parameter types) of the tools used by the LLM agent; short transition times are critical because a schema mismatch between versions causes the LLM tool call itself to fail.
This article examines step-by-step how to automate MCP server canary deployments by combining the Kubernetes event-driven autoscaler KEDA with the incremental deployment controller Argo Rollouts. The goal is to create an autonomous system that reduces operating costs by up to 90% by achieving scale-to-zero during off-peak hours when there are no requests, and automatically executes rollbacks based on Prometheus metrics the moment the success rate drops below 95%.
By combining KEDA's event-driven autoscaling with Argo Rollouts' incremental traffic switching, you can achieve both stability and cost-effectiveness in MCP server deployments.
Before Reading This Article: It is assumed that you are familiar with Kubernetes Pod, Service, and HPA concepts, Prometheus metric basics, and Istio VirtualService concepts. This article is intended for backend and DevOps developers with experience in Kubernetes operations.
Key Concepts
KEDA: Event-based Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) is a CNCF graduation project that extends the native Kubernetes HPA (Horizontal Pod Autoscaler). The key difference lies in the breadth of trigger sources. In addition to CPU and memory, it can use over 65 external event sources as scaling criteria, including Kafka queue depth, Prometheus query results, and the number of AWS SQS messages.
Two features particularly useful for MCP servers:
scale-to-zero: Reduces the number of pods to zero when there are no requests to completely reduce costs.
Argo Rollout Direct Target Support: You can directly specify a Rollout CRD for scaleTargetRef within ScaledObject to auto-scale resources currently in canary deployment simultaneously.
yaml
# 주의: KEDA는 내부적으로 HPA를 자동 생성한다# 기존 HPA가 있으면 충돌하므로, 반드시 HPA를 먼저 제거하고 ScaledObject만 사용할 것apiVersion: keda.sh/v1alpha1kind: ScaledObjectspec: scaleTargetRef: apiVersion: argoproj.io/v1alpha1 # Deployment가 아닌 Rollout CRD를 타겟 kind: Rollout name: mcp-server
When multiple triggers are configured, KEDA applies the largest value among the target replica counts calculated by each trigger. If the Connection Count trigger requires 5 pods and the Queue Depth trigger requires 8 pods, it scales up to 8. If you do not understand this operating principle, you may be confused by the phenomenon where more pods appear than expected.
scale-to-zero: A feature where KEDA polls external event sources and sets minReplicaCount to 0 when no events are found. The default HPA maintains at least 1, but KEDA can be completely shut down.
Argo Rollouts: Incremental Deployment Controller
Argo Rollouts is a Progressive Delivery controller that replaces the default Kubernetes Deployment. If you declare a Canary stage in the Rollout CRD, it works with service meshes like Istio or Linked to split traffic based on weights.
The key is AnalysisTemplate. It executes a Prometheus query during the deployment phase to evaluate the success rate, and if the conditions are not met, it automatically reverts to the stable version. There is no need for a human to press the approval button.
AnalysisTemplate: Argo Rollouts' CRD that automatically evaluates deployment quality by integrating with metric providers such as Prometheus, Datadog, and CloudWatch. It declaratively defines evaluation intervals, allowed number of failures, and success conditions.
Considerations for Deploying MCP Server and Kubernetes
MCP (Model Context Protocol) is an LLM tool integration standard designed by Anthropic and currently managed by the Linux Foundation. Thanks to the Streamable HTTP transport method introduced in the 2025 specification update, MCP servers can now be horizontally scaled as remote services on Kubernetes.
There are two reasons why the canary strategy is particularly important for MCP servers. First, the MCP server provides the schemas for the tools called by the LLM; if schema inconsistencies occur between versions, the LLM tool calls themselves fail. Even a short transition time can cause the entire AI workflow to come to a halt if old and new versions are mixed, leading to differing LLM inference results. Second, if the MCP server maintains session state, the connection may be lost during the canary transition.
The MCP 2025 specification recommends a stateless design, and KEDA's scale-to-zero and canary traffic switching operate safely only if there is no session state. If state is required, it is recommended to separate it into an external storage such as Redis.
Prerequisites: kubectl, Helm, Argo Rollouts kubectl plugin(kubectl argo rollouts) must be installed. Without the plugin, commands such as kubectl argo rollouts get rollout cannot be executed.
# Argo Rollouts 설치kubectl create namespace argo-rolloutshelm install argo-rollouts argo/argo-rollouts -n argo-rollouts# KEDA 설치helm install keda kedacore/keda -n keda --create-namespace
The examples below are applied in the following order:
Example 1: Istio DestinationRule + VirtualService — Traffic Splitting-based Configuration
For Argo Rollouts to perform weighted traffic splitting, subsets of stable and canary must first be defined in Istio's DestinationRule. Since referencing subset in VirtualService without DestinationRule will immediately result in a routing error, these must be applied first.
yaml
# destination-rule.yaml# VirtualService의 subset(stable/canary) 참조에 필요한 선행 리소스apiVersion: networking.istio.io/v1alpha3kind: DestinationRulemetadata: name: mcp-server-destrule namespace: ai-servicesspec: host: mcp-server # Kubernetes Service 이름과 일치해야 함 subsets: - name: stable labels: app: mcp-server # Argo Rollouts가 stable 파드에 rollouts-pod-template-hash 레이블을 자동 추가 - name: canary labels: app: mcp-server
yaml
# virtual-service.yaml# 초기 가중치는 Argo Rollouts가 배포 단계마다 자동으로 수정한다# 수동으로 편집하면 Rollout 상태와 불일치가 발생하므로 주의apiVersion: networking.istio.io/v1alpha3kind: VirtualServicemetadata: name: mcp-server-vsvc namespace: ai-servicesspec: hosts: - mcp-server http: - name: primary route: - destination: host: mcp-server subset: stable weight: 100 # 초기값. 배포 시작 후 Argo Rollouts가 자동 조정 - destination: host: mcp-server subset: canary weight: 0
Example 2: Argo Rollout — MCP Server Canary Deployment Strategy
Replace the existing Deployment with a Rollout. Transition traffic in stages from 10% → 30% → 60% → 100%, and run AnalysisTemplate at the 30% stage to automatically verify the success rate. The key is to pass the hash value of the current Canary Pod to AnalysisTemplate via analysis.args. Without this value, it is impossible to evaluate metrics filtered solely for Canary Pods.
Routing 10% of traffic to Canary Pods in Istio VirtualService
pause: {duration: 2m}
Automatically proceed to the next step after waiting 2 minutes
analysis.args.canary-hash
Rollout automatically injects the current canary's rollouts-pod-template-hash label values into AnalysisRun
If you specify
podTemplateHashValue: Latest
Example 3: AnalysisTemplate — Prometheus-based Automatic Rollback Conditions
AnalysisTemplate is the "automatic quality gate" for canary deployments. It queries Prometheus every 60 seconds, records a failure if the MCP request success rate is less than 95%, and triggers an automatic rollback after 3 consecutive failures.
yaml
# analysis-template.yamlapiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: mcp-success-rate namespace: ai-servicesspec: args: - name: canary-hash # Rollout의 analysis.args에서 주입받는 변수 선언 metrics: - name: success-rate interval: 60s # 60초마다 평가 (Prometheus 스크레이프 주기 15s의 4배 이상 권장) failureLimit: 3 # 3번 연속 실패 시 롤백 (일시적 메트릭 이상에 의한 오탐 방지) # result[0]: Prometheus 쿼리가 단일 스칼라 값을 반환할 때의 접근 방식 # 벡터(시계열 집합) 쿼리는 result 구조가 다르므로 주의 successCondition: result[0] >= 0.95 provider: prometheus: address: http://prometheus.monitoring:9090 query: | # rate(metric[2m]): 지난 2분 동안의 초당 평균 증가율 계산 # sum(): 여러 파드의 값을 합산 # {version="{{args.canary-hash}}"}: 카나리 파드만 필터링 # 이 레이블 없이 전체를 집계하면 카나리 오류가 stable에 희석되어 문제를 감지 못 함 sum(rate(mcp_requests_total{status="success", version="{{args.canary-hash}}"}[2m])) / sum(rate(mcp_requests_total{ version="{{args.canary-hash}}"}[2m])) - name: error-rate interval: 60s failureLimit: 2 successCondition: result[0] <= 0.05 # 오류율 5% 이하 provider: prometheus: address: http://prometheus.monitoring:9090 query: | sum(rate(mcp_requests_total{status="error", version="{{args.canary-hash}}"}[2m])) / sum(rate(mcp_requests_total{ version="{{args.canary-hash}}"}[2m]))
How to read PromQL: rate(metric[2m]) calculates the rate of change (increase rate) per second over the past 2 minutes. Summing the values of multiple pods with sum() and dividing the number of successful requests by the total number of requests gives the success rate. Without the {version="..."} label filter, stable and canary pods are mixed, and the errors from canaries are diluted in the overall success rate.
Example 4: KEDA ScaledObject — Event-based Autoscaling
Once Rollout is defined, attach KEDA ScaledObject to add event-based scaling. The key is to directly specify Rollout in scaleTargetRef.
yaml
# scaled-object.yamlapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: mcp-server-scaler namespace: ai-servicesspec: scaleTargetRef: apiVersion: argoproj.io/v1alpha1 kind: Rollout # Deployment가 아닌 Rollout을 타겟 name: mcp-server minReplicaCount: 2 # 카나리 단계에서 최소 replica가 2 이상이어야 가중치 계산이 올바름 maxReplicaCount: 20 cooldownPeriod: 60 # 단위: 초(seconds). 스케일 다운 전 60초 대기 triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring:9090 metricName: mcp_active_connections query: sum(mcp_active_connections{namespace="ai-services"}) threshold: "50" # 활성 연결 50개당 파드 1개 - type: prometheus metadata: serverAddress: http://prometheus.monitoring:9090 metricName: mcp_request_queue_depth query: sum(mcp_request_queue_depth{namespace="ai-services"}) threshold: "100" # 큐 깊이 100당 파드 1개
Field
Description
minReplicaCount: 2
If set to 0, the replica count becomes 0 during canary deployment, causing the Rollout weight calculation to break.
cooldownPeriod: 60
Units are in seconds. Prevents hasty scale-down when MCP sessions are processing
Multiple Trigger Priority
Between the number of connections and queue depth, the higher value of the replica count calculated by each trigger is applied first.
Check the deployment status in real-time with the following command:
# Rollout 단계 및 AnalysisRun 상태 확인kubectl argo rollouts get rollout mcp-server --watch -n ai-services# KEDA ScaledObject 동작 확인kubectl get scaledobject mcp-server-scaler -n ai-services
Pros and Cons Analysis
Advantages
Item
Content
Zero-Downtime Incremental Deployment
Minimize failure radius with 10% → 30% → 60% → 100% phase transitions, and limit impact to canary traffic in the event of an issue
Automatic Rollback
Immediately restores to the stable version without human intervention when Prometheus metrics fall below threshold, shortening MTTR
Event-Driven Cost Optimization
Reduce MCP Server Operating Costs by Up to 90% by Scale-to-Zero During No AI Agent Request Periods
Fully Integrated with GitOps
Declaratively manage all deployment states in Git with the combination of Argo CD + Argo Rollouts
Multiple Metric Conditions
Sophisticated quality gate configuration possible by combining multiple conditions such as success rate, error rate, and latency P99 using AND
Disadvantages and Precautions
Item
Content
Response Plan
KEDA ↔ Rollout State Conflict
Canary weight ratio is distorted if KEDA changes replica count during rollout
Set minReplicaCount larger than minimum canary replica, adjust cooldownPeriod
KEDA and HPA Conflict
Since KEDA automatically generates HPAs internally, a conflict occurs if an existing HPA exists
Remove all existing HPAs and use only ScaledObject
Metric Detection Delay
Error detection is delayed due to the time lag between the Prometheus scrap period (15s) and the AnalysisTemplate interval
Set the interval to at least 4 times the scrap period (60s recommended)
MCP Session Drop
Connection may be lost during canary switching if streamable HTTP session state is maintained
Design MCP server as stateless, separate session state to external storage such as Redis
Complex Debugging
Tracing the cause of failure is difficult as three components—KEDA, Argo Rollouts, and Istio—are all involved
Step-by-step status check using kubectl argo rollouts get rollout mcp-server --watch, separate collection of logs for each component
The Most Common Mistakes in Practice
Set failureLimit: 1 in AnalysisTemplate: Temporary NaN results may occur due to Prometheus scrap timing issues. Set to at least failureLimit: 3 to prevent false positive rollbacks.
Simultaneous use of KEDA minReplicaCount: 0 + Canary deployment: If the replica scales down to 0 during the canary phase, the weight calculation of the Rollout controller breaks. It must be maintained at a minimum of 2 or more during deployment.
Version labels not included in Prometheus queries: If you aggregate the entire metric without distinguishing between stable and canary pods, errors in the canary will be diluted in the stable success rate, preventing problem detection. Be sure to add the version or rollouts-pod-template-hash labels as filters.
In Conclusion
Rolling updates struggle to resolve schema inconsistencies and the inability to immediately roll back that occur during MCP server version transitions. The MCP server deployment pipeline combining KEDA and Argo Rollouts utilizes "automatic verification during deployment" rather than "post-deployment observation," simultaneously achieving two benefits without human intervention: immediate rollback when the success rate falls below 95% and up to 90% cost reduction during downtime.
3 Steps to Start Right Now:
Migrate to Rollout after installing Argo Rollouts + KEDA: Install the two components using Helm, and replace the existing MCP Server Deployment with the YAML from Example 2 above. Check the progress of the steps in real-time using kubectl argo rollouts get rollout mcp-server --watch.
Connect MCP-specific metrics to AnalysisTemplate: If the mcp_requests_total metric does not exist, first add the Prometheus exporter to the server. Deploy the AnalysisTemplate with conditions failureLimit: 3 and successCondition: result[0] >= 0.95, and connect the canary-hash injection (podTemplateHashValue: Latest) to the analysis.args of the Rollout.
Configure scale-to-zero with KEDA ScaledObject: minReplicaCount: 2, maxReplicaCount: 20, set the sum(mcp_active_connections) query to threshold 50 using a Prometheus trigger, and verify normal operation with kubectl get scaledobject.
Next Article: We will examine how Flagger resolves the limitation of this configuration—that "state management is complex because KEDA ScaledObject and AnalysisTemplate are separated into different CRDs"—using a single CRD, and how to automate multi-cluster MCP server canary deployment with Argo CD ApplicationSet.