Stabilizing MCP Servers with HPA Custom Metrics + Grafana Dashboards: Practical Operation of AI Agent Servers on Kubernetes
The moment the AI agent goes into production, the infrastructure team faces a perplexing situation. The HPA, configured with a 70% CPU threshold, fails to respond to traffic spikes at all, and user requests begin piling up in the waiting queue by the hundreds. This is because the LLM inference workload creates a bottleneck by causing model latency to skyrocket rather than consuming excessive CPU resources. Deploying without this setting leaves the situation unchecked, allowing the error rate to surge during spikes.
This article covers step-by-step how to deploy an MCP (Model Context Protocol) AI agent server on Kubernetes, configure HPA with custom metrics based on call latency and error rates, and make the entire stack observable with Prometheus and Grafana.
If you read this article to the end, you can configure the Prometheus Adapter and custom metric HPA yourself, and complete a dashboard in Grafana that displays p99 latency, error rate, and queue depth at a glance.
Prerequisites for this article: This article is intended for backend/infrastructure developers with a basic understanding of kubectl usage, Kubernetes Deployment and Service concepts, and experience building Docker images. Minikube or kind is recommended for local practice, and the same applies to actual EKS, GKE, and AKS clusters.
Key Concepts
MCP Server and Kubernetes Deployment Architecture
The Model Context Protocol (MCP) is a communication protocol that enables AI assistants to access external tools and data sources in a standardized manner. The MCP server acts as a context layer between the "agent and the outside world," standardizing communication between AI agents and tools just as REST APIs standardize communication between web services.
MCP (Model Context Protocol): An open standard proposed by Anthropic that enables LLM agents to access resources such as file systems, databases, and external APIs through a consistent interface.
Deployment in a Kubernetes environment follows the following pattern.
- Transmission Protocol: Streamable HTTP for production, Stdio for local development
- Packaging: Docker Image → Kubernetes Deployment
- Configuration Management: Integrated management of OAuth, resource limits, and telemetry options with Helm Charts
- Authentication: HTTP-based MCP servers must implement OAuth 2.1 after March 2025 (Stdio transmission is not required)
HPA Operation Principles and Custom Metrics
HPA checks the metric every 15 seconds and determines the number of Pods based on the ratio of the current value to the target value.
원하는 Pod 수 = ceil(현재 Pod 수 × (현재 메트릭 값 / 목표 메트릭 값))The problem lies in the characteristics of the LLM workload. While CPU utilization on a standard web server increases as the number of requests grows, most of the request time for LLM inference is spent on model waiting and I/O. No matter how well the CPU threshold is set, it is impossible to detect the phenomenon where the average response time increases from 300ms to 3,000ms.
| Metric Type | Source | MCP Server Compatibility |
|---|---|---|
| CPU Usage | kubelet | Low — LLM latency is not reflected in CPU |
| Memory Usage | kubelet | Average — Used only as a secondary metric |
| Call Delay (p99) | Prometheus | High — Directly related to perceived quality |
| Error Rate (5xx) | Prometheus | High — Key metric for service stability |
| Queue Depth | Prometheus | Very High — Preemptive Scaling Available |
Overall Flow of Monitoring Stack
MCP 서버 Pod
└── /metrics 엔드포인트 (Prometheus 형식)
↓
Prometheus
└── 15초 간격 스크래핑
↓
Prometheus Adapter
└── custom.metrics.k8s.io API로 변환
↓
HPA 컨트롤러
└── 15초마다 메트릭 조회 → Pod 수 결정
↓
Grafana
└── 시각화 및 알림Prometheus Adapter: This is a bridge component that converts Prometheus metrics into the Kubernetes Custom Metrics API (custom.metrics.k8s.io). Without this, HPA cannot read Prometheus metrics directly.
Now, let's apply these concepts to an actual cluster step by step.
Practical Application
Step 1: Configure MCP Server /metrics Endpoint
To use HPA custom metrics, the MCP server must first expose the metrics in Prometheus format. In particular, mcp_request_duration_seconds must be exposed as a histogram type. If exposed as a simple gauge or counter, the histogram_quantile PromQL used later will not work.
// Node.js (TypeScript) 예시
import { Registry, Histogram, Counter, Gauge } from 'prom-client';
const registry = new Registry();
// histogram 타입 필수: _bucket, _sum, _count가 자동 생성됨
export const requestDuration = new Histogram({
name: 'mcp_request_duration_seconds',
help: 'MCP 요청 처리 지연 시간',
labelNames: ['method', 'status'],
buckets: [0.05, 0.1, 0.2, 0.5, 1, 2, 5], // 초 단위
registers: [registry],
});
export const errorsTotal = new Counter({
name: 'mcp_errors_total',
help: 'MCP 에러 총 횟수',
labelNames: ['error_type'],
registers: [registry],
});
export const queueDepth = new Gauge({
name: 'mcp_queue_depth',
help: '처리 대기 중인 요청 수',
labelNames: ['service'],
registers: [registry],
});
// /metrics 엔드포인트
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType);
res.end(await registry.metrics());
});For HPA to operate, resources.requests must be set on the target Deployment. Without this, HPA itself will not start.
# mcp-server-deployment.yaml (핵심 부분)
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
spec:
replicas: 2
selector:
matchLabels:
app: mcp-server
template:
metadata:
labels:
app: mcp-server
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: mcp-server
image: your-registry/mcp-server:latest
ports:
- containerPort: 8080
resources:
requests:
cpu: "250m" # HPA 작동을 위해 필수
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"Step 2: Install Prometheus Adapter and Set Up Custom Metrics
First, check the cluster's Prometheus service name. The service name and namespace vary depending on the installation environment.
# Prometheus 서비스 이름 확인 (환경마다 다름)
kubectl get svc -n monitoring
# 확인된 서비스명으로 Adapter 설치
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--set prometheus.url=http://prometheus-operated.monitoring.svc.cluster.local \
--set prometheus.port=9090Defines custom metric rules. When calculating average delay, you must divide by _sum / _count. Using only the rate of _sum results in the rate of increase of the total delay amount, rather than the actual average delay.
# prometheus-adapter-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'mcp_request_duration_seconds_sum{service="mcp-server"}'
resources:
overrides:
namespace: { resource: namespace }
pod: { resource: pod }
name:
matches: "^mcp_request_(.*)_seconds_sum"
as: "mcp_request_${1}_latency"
# 올바른 평균 지연 계산: sum을 count로 나눠야 실제 지연값이 됨
metricsQuery: |
avg(
rate(mcp_request_duration_seconds_sum{service="mcp-server"}[2m])
/ rate(mcp_request_duration_seconds_count{service="mcp-server"}[2m])
)
- seriesQuery: 'mcp_errors_total{service="mcp-server"}'
resources:
overrides:
namespace: { resource: namespace }
pod: { resource: pod }
name:
matches: "^mcp_(.*)_total"
as: "mcp_${1}_rate"
metricsQuery: 'sum(rate(mcp_errors_total{service="mcp-server"}[2m]))'Why the resources stanza is needed: Without this block, HPA cannot find metrics in the Namespace·Pod scope in custom.metrics.k8s.io. If omitted, only the "unable to get metric" error will be repeated in kubectl describe hpa.
Verify that the custom metric API is exposed correctly.
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq '.resources[].name'Step 3: HPA Custom Metrics Setup
Calculate the maxReplicas value based on the number of cluster nodes and the amount of Pod resource requests. For example, in an environment with 5 nodes and 4 CPUs per node, if the number of Pods is requests.cpu: 250m, theoretically up to 80 are possible. Here, we set it to 20 to provide a safety margin.
# mcp-server-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mcp-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mcp-server
minReplicas: 2
maxReplicas: 20 # 노드 5개 × 4 CPU / 0.25(requests) × 0.25(안전마진)
metrics:
- type: Pods
pods:
metric:
name: mcp_request_duration_latency
target:
type: AverageValue
averageValue: "200m" # Pod당 평균 지연 200ms 초과 시 스케일 아웃
- type: Pods
pods:
metric:
name: mcp_errors_rate
target:
type: AverageValue
averageValue: "0.01" # Pod당 에러율 1% 초과 시 스케일 아웃
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # 스파이크에 즉각 반응
policies:
- type: Percent
value: 100
periodSeconds: 15 # 15초마다 최대 2배 확장
scaleDown:
stabilizationWindowSeconds: 300 # 5분간 안정화 후 축소
policies:
- type: Percent
value: 50
periodSeconds: 60| Setting Item | Value | Meaning |
|---|---|---|
minReplicas: 2 |
2 | Minimum number of Pods to prevent single point of failure |
averageValue: "200m" |
200ms | Trigger scale-out when this delay is exceeded |
scaleUp.stabilizationWindowSeconds: 0 |
Immediate | Respond immediately to delayed spikes |
scaleDown.stabilizationWindowSeconds: 300 |
5 min | Prevents premature downsizing due to temporary load reduction |
Step 4: Adding Queue-Based Preemptive Scaling with KEDA
While HPA reacts to current latency levels, KEDA preemptively scales up Pods the moment queues start to build up. Since the two methods have different roles, they can be used together in a complementary manner.
# mcp-server-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: mcp-server-keda
spec:
scaleTargetRef:
name: mcp-server
minReplicaCount: 1
maxReplicaCount: 20
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc.cluster.local:9090
metricName: mcp_request_queue_depth
query: 'sum(mcp_queue_depth{service="mcp-server"})'
threshold: '30' # 큐에 30개 이상이면 스케일 아웃
activationThreshold: '5' # 큐 5개 미만이면 KEDA 비활성 (불필요한 기동 방지)KEDA (Kubernetes Event-driven Autoscaling): An autoscaler that supports over 65 event sources (Kafka, Redis, SQS, Prometheus, etc.). You can reduce costs by setting the minReplicaCount: 0 to zero Pods when completely idle. However, for applications where responsiveness is critical, such as MCP servers, it is safer to maintain minReplicaCount: 1 or higher, as cold starts of 30 to 120 seconds may occur.
Step 5: Configure Grafana Dashboard Core Panels
These are the four core PromQL panels required for operation. histogram_quantile in Panel 1 works only if the MCP server exposes metrics as a histogram type (see Step 1).
# 패널 1: p99 호출 지연 — histogram 타입 메트릭 필수
histogram_quantile(0.99,
rate(mcp_request_duration_seconds_bucket[5m])
)
# 패널 2: 에러율 (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# 패널 3: 현재 활성 Pod 수
count(up{job="mcp-servers"})
# 패널 4: 큐 깊이 (실시간)
mcp_queue_depth{service="mcp-server"}| Panel | Notification Threshold | Meaning |
|---|---|---|
| p99 Latency | > 500ms | Starting point of perceived quality degradation |
| Error Rate | > 1% | Service Stability Boundary |
| Number of Pods | 80% of maxReplicas | Capacity Limit Early Alert |
| Queue depth | > 50 | Processing delay accumulated signal |
Step 6: Integrating Distributed Trace with OpenTelemetry Collector
By collecting distributed traces per MCP tool call in addition to metrics, you can immediately identify which tool is causing the latency.
# otel-collector-configmap.yaml
receivers:
otlp: # traces 파이프라인이 참조하는 수신기 — 반드시 정의 필요
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'mcp-servers'
static_configs:
- targets: ['localhost:8080']
processors:
batch:
timeout: 10s
send_batch_size: 1024
exporters:
prometheus:
endpoint: "0.0.0.0:8888"
jaeger:
endpoint: jaeger-collector:14250
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [prometheus]
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]Pros and Cons Analysis
Situations where this stack is suitable
| Item | Content |
|---|---|
| Irregular LLM Traffic Patterns | Sophisticated Response to Unpredictable Spikes by Combining Latency, Error Rate, and Queue Depth |
| Cost optimization needed | Pods can be completely removed when idle with KEDA minReplicaCount: 0 |
| Diverse Event Sources | Integrate over 65 triggers including Kafka, Redis, SQS, etc. into a single Autoscaler |
| Vendor-neutral observability | Integrates metrics, traces, and logs with OpenTelemetry standards, no platform lock-in |
Situations where this stack should not be used
| Item | Content | Alternatives |
|---|---|---|
| Small single cluster | Adapter and KEDA management overhead is excessive relative to service scale | Kubernetes native CPU/memory HPA is sufficient |
| Environment without Prometheus | Entire HPA relies on Prometheus for custom metrics | Utilizes external metric providers such as Datadog, New Relic, etc. |
| Service unable to cold start | Pod startup 30 seconds + model warm-up up to 2 minutes | Minimum replicas always maintained, warm-up readinessProbe configured |
| Insufficient Cardinality Management | Out of Memory (OOM) when generating time series of millions with eigenvalue labels such as user_id |
Code review is mandatory for labels that only allow low-cardinality values |
Cardinality: The number of unique time series for a metric. Using unique values like user_id as labels generates millions of time series, causing Prometheus memory to skyrocket. Please use only low-cardinality values for labels, such as service names, endpoints, and HTTP status codes.
The Most Common Mistakes in Practice
- Set up CPU-based HPA only: LLM inference uses very little CPU while requests are waiting for the model. You will completely miss scaling timing based on CPU utilization. Set call delay and queue depth as the primary triggers from the start.
- Omit the
resourcesstanza in the Prometheus Adapter ConfigMap: If you deploy without this block, the HPA will fail to find the metric incustom.metrics.k8s.io, resulting in a continuous "unable to get metric" error. Please check the API exposure status withkubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/"first. - Set scaleDown
stabilizationWindowSecondsto 0: Unlike scale-up, scale-down must have a stabilization window. If set to 0, oscillations occur where the number of Pods repeatedly shrinks and expands with every momentary drop in traffic, which actually reduces availability.
In Conclusion
The criteria for judgment are simple. For services like LLM-based MCP servers, where request latency fluctuates from hundreds of milliseconds to several seconds and traffic patterns are irregular, a custom metric HPA is essential. On the other hand, for services with predictable and consistent traffic, CPU/memory HPA alone is sufficient, and the overhead of adding a Prometheus Adapter is not justified in that case.
3 Steps to Start Right Now:
- Add the
/metricsendpoint to the MCP server: Implement the three metricsmcp_request_duration_seconds(histogram type),mcp_errors_total, andmcp_queue_depthby referring to the Step 1 code snippet above. - Prometheus Adapter Installation and API Exposure Verification: After installing
helm install prometheus-adapter ..., verify that the custom metric API responds viakubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/". - HPA YAML Application and Operation Verification: Real-time monitoring of metric collection status and scaling events from
kubectl apply -f mcp-server-hpa.yamltokubectl describe hpa mcp-server-hpa
Next Post: How to Implement Canary Deployment and Automatic Rollback on MCP Server by Combining KEDA and Argo Rollouts
Reference Materials
- Kubernetes MCP Server - AI-powered cluster management | Red Hat — Overview of the official MCP Server architecture for Kubernetes
- Scale LLM Tools With a Remote MCP Architecture on Kubernetes | The New Stack — Practical Examples of Remote MCP Server Scaling Patterns
- 15 Best Practices for Building MCP Servers in Production | The New Stack — MCP Server Production Operation Checklist
- Horizontal Pod Autoscaling Walkthrough | Kubernetes Official Documentation — Official Reference for HPA Configuration and Custom Metrics
- How to Use Custom Metrics with Kubernetes HPA | OneUptime — Detailed Explanation of Prometheus Adapter ConfigMap Rules
- Prometheus MCP Server - AI-driven monitoring intelligence | AWS — Prometheus MCP Integration in AWS Environment
- MCP Server Monitoring Via Prometheus & Grafana | Medium — Grafana Dashboard Panel Configuration Example
- MCP Observability with OpenTelemetry | SigNoz — OTel Collector Integration and Distributed Trace Setup
- Monitor MCP servers with OpenLIT and Grafana Cloud | Grafana — MCP monitoring based on Grafana Cloud
- KEDA - Kubernetes Event-driven Autoscaling | KEDA Official — KEDA Trigger Types and ScaledObject Reference
- Autoscaling AI Inference Workloads with KEDA | KEDAify — LLM Inference Workload KEDA Cost Optimization Case Study
- Best practices for autoscaling LLM inference workloads on GKE | Google Cloud — Official HPA recommendations for LLM in GKE environments
- The great migration: Why every AI platform is converging on Kubernetes | CNCF — Analysis of Kubernetes Convergence Trends in AI Infrastructure