Stabilizing MCP Servers with HPA Custom Metrics + Grafana Dashboards: Practical Operation of AI Agent Servers on Kubernetes

The moment the AI agent goes into production, the infrastructure team faces a perplexing situation. The HPA, configured with a 70% CPU threshold, fails to respond to traffic spikes at all, and user requests begin piling up in the waiting queue by the hundreds. This is because the LLM inference workload creates a bottleneck by causing model latency to skyrocket rather than consuming excessive CPU resources. Deploying without this setting leaves the situation unchecked, allowing the error rate to surge during spikes.

This article covers step-by-step how to deploy an MCP (Model Context Protocol) AI agent server on Kubernetes, configure HPA with custom metrics based on call latency and error rates, and make the entire stack observable with Prometheus and Grafana.

If you read this article to the end, you can configure the Prometheus Adapter and custom metric HPA yourself, and complete a dashboard in Grafana that displays p99 latency, error rate, and queue depth at a glance.

Prerequisites for this article: This article is intended for backend/infrastructure developers with a basic understanding of kubectl usage, Kubernetes Deployment and Service concepts, and experience building Docker images. Minikube or kind is recommended for local practice, and the same applies to actual EKS, GKE, and AKS clusters.

Key Concepts

MCP Server and Kubernetes Deployment Architecture

The Model Context Protocol (MCP) is a communication protocol that enables AI assistants to access external tools and data sources in a standardized manner. The MCP server acts as a context layer between the "agent and the outside world," standardizing communication between AI agents and tools just as REST APIs standardize communication between web services.

MCP (Model Context Protocol): An open standard proposed by Anthropic that enables LLM agents to access resources such as file systems, databases, and external APIs through a consistent interface.

Deployment in a Kubernetes environment follows the following pattern.

Transmission Protocol: Streamable HTTP for production, Stdio for local development
Packaging: Docker Image → Kubernetes Deployment
Configuration Management: Integrated management of OAuth, resource limits, and telemetry options with Helm Charts
Authentication: HTTP-based MCP servers must implement OAuth 2.1 after March 2025 (Stdio transmission is not required)

HPA Operation Principles and Custom Metrics

HPA checks the metric every 15 seconds and determines the number of Pods based on the ratio of the current value to the target value.

원하는 Pod 수 = ceil(현재 Pod 수 × (현재 메트릭 값 / 목표 메트릭 값))

The problem lies in the characteristics of the LLM workload. While CPU utilization on a standard web server increases as the number of requests grows, most of the request time for LLM inference is spent on model waiting and I/O. No matter how well the CPU threshold is set, it is impossible to detect the phenomenon where the average response time increases from 300ms to 3,000ms.

Metric Type	Source	MCP Server Compatibility
CPU Usage	kubelet	Low — LLM latency is not reflected in CPU
Memory Usage	kubelet	Average — Used only as a secondary metric
Call Delay (p99)	Prometheus	High — Directly related to perceived quality
Error Rate (5xx)	Prometheus	High — Key metric for service stability
Queue Depth	Prometheus	Very High — Preemptive Scaling Available

Overall Flow of Monitoring Stack

MCP 서버 Pod
  └── /metrics 엔드포인트 (Prometheus 형식)
        ↓
Prometheus
  └── 15초 간격 스크래핑
        ↓
Prometheus Adapter
  └── custom.metrics.k8s.io API로 변환
        ↓
HPA 컨트롤러
  └── 15초마다 메트릭 조회 → Pod 수 결정
        ↓
Grafana
  └── 시각화 및 알림

Prometheus Adapter: This is a bridge component that converts Prometheus metrics into the Kubernetes Custom Metrics API (custom.metrics.k8s.io). Without this, HPA cannot read Prometheus metrics directly.

Now, let's apply these concepts to an actual cluster step by step.

Practical Application

Step 1: Configure MCP Server /metrics Endpoint

To use HPA custom metrics, the MCP server must first expose the metrics in Prometheus format. In particular, mcp_request_duration_seconds must be exposed as a histogram type. If exposed as a simple gauge or counter, the histogram_quantile PromQL used later will not work.

javascript

// Node.js (TypeScript) 예시
import { Registry, Histogram, Counter, Gauge } from 'prom-client';
 
const registry = new Registry();
 
// histogram 타입 필수: _bucket, _sum, _count가 자동 생성됨
export const requestDuration = new Histogram({
  name: 'mcp_request_duration_seconds',
  help: 'MCP 요청 처리 지연 시간',
  labelNames: ['method', 'status'],
  buckets: [0.05, 0.1, 0.2, 0.5, 1, 2, 5], // 초 단위
  registers: [registry],
});
 
export const errorsTotal = new Counter({
  name: 'mcp_errors_total',
  help: 'MCP 에러 총 횟수',
  labelNames: ['error_type'],
  registers: [registry],
});
 
export const queueDepth = new Gauge({
  name: 'mcp_queue_depth',
  help: '처리 대기 중인 요청 수',
  labelNames: ['service'],
  registers: [registry],
});
 
// /metrics 엔드포인트
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
});

For HPA to operate, resources.requests must be set on the target Deployment. Without this, HPA itself will not start.

yaml

# mcp-server-deployment.yaml (핵심 부분)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: mcp-server
        image: your-registry/mcp-server:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "250m"       # HPA 작동을 위해 필수
            memory: "512Mi"
          limits:
            cpu: "2"
            memory: "2Gi"

Step 2: Install Prometheus Adapter and Set Up Custom Metrics

First, check the cluster's Prometheus service name. The service name and namespace vary depending on the installation environment.

bash

# Prometheus 서비스 이름 확인 (환경마다 다름)
kubectl get svc -n monitoring
 
# 확인된 서비스명으로 Adapter 설치
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-operated.monitoring.svc.cluster.local \
  --set prometheus.port=9090

Defines custom metric rules. When calculating average delay, you must divide by _sum / _count. Using only the rate of _sum results in the rate of increase of the total delay amount, rather than the actual average delay.

yaml

# prometheus-adapter-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'mcp_request_duration_seconds_sum{service="mcp-server"}'
      resources:
        overrides:
          namespace: { resource: namespace }
          pod: { resource: pod }
      name:
        matches: "^mcp_request_(.*)_seconds_sum"
        as: "mcp_request_${1}_latency"
      # 올바른 평균 지연 계산: sum을 count로 나눠야 실제 지연값이 됨
      metricsQuery: |
        avg(
          rate(mcp_request_duration_seconds_sum{service="mcp-server"}[2m])
          / rate(mcp_request_duration_seconds_count{service="mcp-server"}[2m])
        )
 
    - seriesQuery: 'mcp_errors_total{service="mcp-server"}'
      resources:
        overrides:
          namespace: { resource: namespace }
          pod: { resource: pod }
      name:
        matches: "^mcp_(.*)_total"
        as: "mcp_${1}_rate"
      metricsQuery: 'sum(rate(mcp_errors_total{service="mcp-server"}[2m]))'

Why the resources stanza is needed: Without this block, HPA cannot find metrics in the Namespace·Pod scope in custom.metrics.k8s.io. If omitted, only the "unable to get metric" error will be repeated in kubectl describe hpa.

Verify that the custom metric API is exposed correctly.

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq '.resources[].name'

Step 3: HPA Custom Metrics Setup

Calculate the maxReplicas value based on the number of cluster nodes and the amount of Pod resource requests. For example, in an environment with 5 nodes and 4 CPUs per node, if the number of Pods is requests.cpu: 250m, theoretically up to 80 are possible. Here, we set it to 20 to provide a safety margin.

yaml

# mcp-server-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server
  minReplicas: 2
  maxReplicas: 20  # 노드 5개 × 4 CPU / 0.25(requests) × 0.25(안전마진)
  metrics:
  - type: Pods
    pods:
      metric:
        name: mcp_request_duration_latency
      target:
        type: AverageValue
        averageValue: "200m"   # Pod당 평균 지연 200ms 초과 시 스케일 아웃
  - type: Pods
    pods:
      metric:
        name: mcp_errors_rate
      target:
        type: AverageValue
        averageValue: "0.01"   # Pod당 에러율 1% 초과 시 스케일 아웃
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0      # 스파이크에 즉각 반응
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15                # 15초마다 최대 2배 확장
    scaleDown:
      stabilizationWindowSeconds: 300    # 5분간 안정화 후 축소
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

Setting Item	Value	Meaning
`minReplicas: 2`	2	Minimum number of Pods to prevent single point of failure
`averageValue: "200m"`	200ms	Trigger scale-out when this delay is exceeded
`scaleUp.stabilizationWindowSeconds: 0`	Immediate	Respond immediately to delayed spikes
`scaleDown.stabilizationWindowSeconds: 300`	5 min	Prevents premature downsizing due to temporary load reduction

Step 4: Adding Queue-Based Preemptive Scaling with KEDA

While HPA reacts to current latency levels, KEDA preemptively scales up Pods the moment queues start to build up. Since the two methods have different roles, they can be used together in a complementary manner.

yaml

# mcp-server-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: mcp-server-keda
spec:
  scaleTargetRef:
    name: mcp-server
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring.svc.cluster.local:9090
      metricName: mcp_request_queue_depth
      query: 'sum(mcp_queue_depth{service="mcp-server"})'
      threshold: '30'           # 큐에 30개 이상이면 스케일 아웃
      activationThreshold: '5'  # 큐 5개 미만이면 KEDA 비활성 (불필요한 기동 방지)

KEDA (Kubernetes Event-driven Autoscaling): An autoscaler that supports over 65 event sources (Kafka, Redis, SQS, Prometheus, etc.). You can reduce costs by setting the minReplicaCount: 0 to zero Pods when completely idle. However, for applications where responsiveness is critical, such as MCP servers, it is safer to maintain minReplicaCount: 1 or higher, as cold starts of 30 to 120 seconds may occur.

Step 5: Configure Grafana Dashboard Core Panels

These are the four core PromQL panels required for operation. histogram_quantile in Panel 1 works only if the MCP server exposes metrics as a histogram type (see Step 1).

python

# 패널 1: p99 호출 지연 — histogram 타입 메트릭 필수
histogram_quantile(0.99,
  rate(mcp_request_duration_seconds_bucket[5m])
)
 
# 패널 2: 에러율 (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
 
# 패널 3: 현재 활성 Pod 수
count(up{job="mcp-servers"})
 
# 패널 4: 큐 깊이 (실시간)
mcp_queue_depth{service="mcp-server"}

Panel	Notification Threshold	Meaning
p99 Latency	> 500ms	Starting point of perceived quality degradation
Error Rate	> 1%	Service Stability Boundary
Number of Pods	80% of maxReplicas	Capacity Limit Early Alert
Queue depth	> 50	Processing delay accumulated signal

Step 6: Integrating Distributed Trace with OpenTelemetry Collector

By collecting distributed traces per MCP tool call in addition to metrics, you can immediately identify which tool is causing the latency.

yaml

# otel-collector-configmap.yaml
receivers:
  otlp:                        # traces 파이프라인이 참조하는 수신기 — 반드시 정의 필요
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: 'mcp-servers'
        static_configs:
        - targets: ['localhost:8080']
 
processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
 
exporters:
  prometheus:
    endpoint: "0.0.0.0:8888"
  jaeger:
    endpoint: jaeger-collector:14250
 
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]

Pros and Cons Analysis

Situations where this stack is suitable

Item	Content
Irregular LLM Traffic Patterns	Sophisticated Response to Unpredictable Spikes by Combining Latency, Error Rate, and Queue Depth
Cost optimization needed	Pods can be completely removed when idle with KEDA `minReplicaCount: 0`
Diverse Event Sources	Integrate over 65 triggers including Kafka, Redis, SQS, etc. into a single Autoscaler
Vendor-neutral observability	Integrates metrics, traces, and logs with OpenTelemetry standards, no platform lock-in

Situations where this stack should not be used

Item	Content	Alternatives
Small single cluster	Adapter and KEDA management overhead is excessive relative to service scale	Kubernetes native CPU/memory HPA is sufficient
Environment without Prometheus	Entire HPA relies on Prometheus for custom metrics	Utilizes external metric providers such as Datadog, New Relic, etc.
Service unable to cold start	Pod startup 30 seconds + model warm-up up to 2 minutes	Minimum replicas always maintained, warm-up readinessProbe configured
Insufficient Cardinality Management	Out of Memory (OOM) when generating time series of millions with eigenvalue labels such as `user_id`	Code review is mandatory for labels that only allow low-cardinality values

Cardinality: The number of unique time series for a metric. Using unique values like user_id as labels generates millions of time series, causing Prometheus memory to skyrocket. Please use only low-cardinality values for labels, such as service names, endpoints, and HTTP status codes.

The Most Common Mistakes in Practice

Set up CPU-based HPA only: LLM inference uses very little CPU while requests are waiting for the model. You will completely miss scaling timing based on CPU utilization. Set call delay and queue depth as the primary triggers from the start.
Omit the resources stanza in the Prometheus Adapter ConfigMap: If you deploy without this block, the HPA will fail to find the metric in custom.metrics.k8s.io, resulting in a continuous "unable to get metric" error. Please check the API exposure status with kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" first.
Set scaleDown stabilizationWindowSeconds to 0: Unlike scale-up, scale-down must have a stabilization window. If set to 0, oscillations occur where the number of Pods repeatedly shrinks and expands with every momentary drop in traffic, which actually reduces availability.

In Conclusion

The criteria for judgment are simple. For services like LLM-based MCP servers, where request latency fluctuates from hundreds of milliseconds to several seconds and traffic patterns are irregular, a custom metric HPA is essential. On the other hand, for services with predictable and consistent traffic, CPU/memory HPA alone is sufficient, and the overhead of adding a Prometheus Adapter is not justified in that case.

3 Steps to Start Right Now:

Add the /metrics endpoint to the MCP server: Implement the three metrics mcp_request_duration_seconds (histogram type), mcp_errors_total, and mcp_queue_depth by referring to the Step 1 code snippet above.
Prometheus Adapter Installation and API Exposure Verification: After installing helm install prometheus-adapter ..., verify that the custom metric API responds via kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/".
HPA YAML Application and Operation Verification: Real-time monitoring of metric collection status and scaling events from kubectl apply -f mcp-server-hpa.yaml to kubectl describe hpa mcp-server-hpa

Next Post: How to Implement Canary Deployment and Automatic Rollback on MCP Server by Combining KEDA and Argo Rollouts

Reference Materials

Kubernetes MCP Server - AI-powered cluster management | Red Hat — Overview of the official MCP Server architecture for Kubernetes
Scale LLM Tools With a Remote MCP Architecture on Kubernetes | The New Stack — Practical Examples of Remote MCP Server Scaling Patterns
15 Best Practices for Building MCP Servers in Production | The New Stack — MCP Server Production Operation Checklist
Horizontal Pod Autoscaling Walkthrough | Kubernetes Official Documentation — Official Reference for HPA Configuration and Custom Metrics
How to Use Custom Metrics with Kubernetes HPA | OneUptime — Detailed Explanation of Prometheus Adapter ConfigMap Rules
Prometheus MCP Server - AI-driven monitoring intelligence | AWS — Prometheus MCP Integration in AWS Environment
MCP Server Monitoring Via Prometheus & Grafana | Medium — Grafana Dashboard Panel Configuration Example
MCP Observability with OpenTelemetry | SigNoz — OTel Collector Integration and Distributed Trace Setup
Monitor MCP servers with OpenLIT and Grafana Cloud | Grafana — MCP monitoring based on Grafana Cloud
KEDA - Kubernetes Event-driven Autoscaling | KEDA Official — KEDA Trigger Types and ScaledObject Reference
Autoscaling AI Inference Workloads with KEDA | KEDAify — LLM Inference Workload KEDA Cost Optimization Case Study
Best practices for autoscaling LLM inference workloads on GKE | Google Cloud — Official HPA recommendations for LLM in GKE environments
The great migration: Why every AI platform is converging on Kubernetes | CNCF — Analysis of Kubernetes Convergence Trends in AI Infrastructure

Stabilizing MCP Servers with HPA Custom Metrics + Grafana Dashboards: Practical Operation of AI Agent Servers on Kubernetes | DEV BAK - 기술블로그

Stabilizing MCP Servers with HPA Custom Metrics + Grafana Dashboards: Practical Operation of AI Agent Servers on Kubernetes

Key Concepts

MCP Server and Kubernetes Deployment Architecture

MCP (Model Context Protocol): An open standard proposed by Anthropic that enables LLM agents to access resources such as file systems, databases, and external APIs through a consistent interface.

Deployment in a Kubernetes environment follows the following pattern.

Transmission Protocol: Streamable HTTP for production, Stdio for local development
Packaging: Docker Image → Kubernetes Deployment
Configuration Management: Integrated management of OAuth, resource limits, and telemetry options with Helm Charts
Authentication: HTTP-based MCP servers must implement OAuth 2.1 after March 2025 (Stdio transmission is not required)

HPA Operation Principles and Custom Metrics

HPA checks the metric every 15 seconds and determines the number of Pods based on the ratio of the current value to the target value.

원하는 Pod 수 = ceil(현재 Pod 수 × (현재 메트릭 값 / 목표 메트릭 값))

Metric Type	Source	MCP Server Compatibility
CPU Usage	kubelet	Low — LLM latency is not reflected in CPU
Memory Usage	kubelet	Average — Used only as a secondary metric
Call Delay (p99)	Prometheus	High — Directly related to perceived quality
Error Rate (5xx)	Prometheus	High — Key metric for service stability
Queue Depth	Prometheus	Very High — Preemptive Scaling Available

Overall Flow of Monitoring Stack

MCP 서버 Pod
  └── /metrics 엔드포인트 (Prometheus 형식)
        ↓
Prometheus
  └── 15초 간격 스크래핑
        ↓
Prometheus Adapter
  └── custom.metrics.k8s.io API로 변환
        ↓
HPA 컨트롤러
  └── 15초마다 메트릭 조회 → Pod 수 결정
        ↓
Grafana
  └── 시각화 및 알림

Now, let's apply these concepts to an actual cluster step by step.

Practical Application

Step 1: Configure MCP Server /metrics Endpoint

javascript

// Node.js (TypeScript) 예시
import { Registry, Histogram, Counter, Gauge } from 'prom-client';
 
const registry = new Registry();
 
// histogram 타입 필수: _bucket, _sum, _count가 자동 생성됨
export const requestDuration = new Histogram({
  name: 'mcp_request_duration_seconds',
  help: 'MCP 요청 처리 지연 시간',
  labelNames: ['method', 'status'],
  buckets: [0.05, 0.1, 0.2, 0.5, 1, 2, 5], // 초 단위
  registers: [registry],
});
 
export const errorsTotal = new Counter({
  name: 'mcp_errors_total',
  help: 'MCP 에러 총 횟수',
  labelNames: ['error_type'],
  registers: [registry],
});
 
export const queueDepth = new Gauge({
  name: 'mcp_queue_depth',
  help: '처리 대기 중인 요청 수',
  labelNames: ['service'],
  registers: [registry],
});
 
// /metrics 엔드포인트
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
});

For HPA to operate, resources.requests must be set on the target Deployment. Without this, HPA itself will not start.

yaml

# mcp-server-deployment.yaml (핵심 부분)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: mcp-server
        image: your-registry/mcp-server:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "250m"       # HPA 작동을 위해 필수
            memory: "512Mi"
          limits:
            cpu: "2"
            memory: "2Gi"

Step 2: Install Prometheus Adapter and Set Up Custom Metrics

First, check the cluster's Prometheus service name. The service name and namespace vary depending on the installation environment.

bash

# Prometheus 서비스 이름 확인 (환경마다 다름)
kubectl get svc -n monitoring
 
# 확인된 서비스명으로 Adapter 설치
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-operated.monitoring.svc.cluster.local \
  --set prometheus.port=9090

yaml

# prometheus-adapter-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'mcp_request_duration_seconds_sum{service="mcp-server"}'
      resources:
        overrides:
          namespace: { resource: namespace }
          pod: { resource: pod }
      name:
        matches: "^mcp_request_(.*)_seconds_sum"
        as: "mcp_request_${1}_latency"
      # 올바른 평균 지연 계산: sum을 count로 나눠야 실제 지연값이 됨
      metricsQuery: |
        avg(
          rate(mcp_request_duration_seconds_sum{service="mcp-server"}[2m])
          / rate(mcp_request_duration_seconds_count{service="mcp-server"}[2m])
        )
 
    - seriesQuery: 'mcp_errors_total{service="mcp-server"}'
      resources:
        overrides:
          namespace: { resource: namespace }
          pod: { resource: pod }
      name:
        matches: "^mcp_(.*)_total"
        as: "mcp_${1}_rate"
      metricsQuery: 'sum(rate(mcp_errors_total{service="mcp-server"}[2m]))'

Verify that the custom metric API is exposed correctly.

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq '.resources[].name'

Step 3: HPA Custom Metrics Setup

yaml

# mcp-server-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server
  minReplicas: 2
  maxReplicas: 20  # 노드 5개 × 4 CPU / 0.25(requests) × 0.25(안전마진)
  metrics:
  - type: Pods
    pods:
      metric:
        name: mcp_request_duration_latency
      target:
        type: AverageValue
        averageValue: "200m"   # Pod당 평균 지연 200ms 초과 시 스케일 아웃
  - type: Pods
    pods:
      metric:
        name: mcp_errors_rate
      target:
        type: AverageValue
        averageValue: "0.01"   # Pod당 에러율 1% 초과 시 스케일 아웃
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0      # 스파이크에 즉각 반응
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15                # 15초마다 최대 2배 확장
    scaleDown:
      stabilizationWindowSeconds: 300    # 5분간 안정화 후 축소
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

Setting Item	Value	Meaning
`minReplicas: 2`	2	Minimum number of Pods to prevent single point of failure
`averageValue: "200m"`	200ms	Trigger scale-out when this delay is exceeded
`scaleUp.stabilizationWindowSeconds: 0`	Immediate	Respond immediately to delayed spikes
`scaleDown.stabilizationWindowSeconds: 300`	5 min	Prevents premature downsizing due to temporary load reduction

Step 4: Adding Queue-Based Preemptive Scaling with KEDA

yaml

# mcp-server-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: mcp-server-keda
spec:
  scaleTargetRef:
    name: mcp-server
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring.svc.cluster.local:9090
      metricName: mcp_request_queue_depth
      query: 'sum(mcp_queue_depth{service="mcp-server"})'
      threshold: '30'           # 큐에 30개 이상이면 스케일 아웃
      activationThreshold: '5'  # 큐 5개 미만이면 KEDA 비활성 (불필요한 기동 방지)

Step 5: Configure Grafana Dashboard Core Panels

These are the four core PromQL panels required for operation. histogram_quantile in Panel 1 works only if the MCP server exposes metrics as a histogram type (see Step 1).

python

# 패널 1: p99 호출 지연 — histogram 타입 메트릭 필수
histogram_quantile(0.99,
  rate(mcp_request_duration_seconds_bucket[5m])
)
 
# 패널 2: 에러율 (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
 
# 패널 3: 현재 활성 Pod 수
count(up{job="mcp-servers"})
 
# 패널 4: 큐 깊이 (실시간)
mcp_queue_depth{service="mcp-server"}

Panel	Notification Threshold	Meaning
p99 Latency	> 500ms	Starting point of perceived quality degradation
Error Rate	> 1%	Service Stability Boundary
Number of Pods	80% of maxReplicas	Capacity Limit Early Alert
Queue depth	> 50	Processing delay accumulated signal

Step 6: Integrating Distributed Trace with OpenTelemetry Collector

By collecting distributed traces per MCP tool call in addition to metrics, you can immediately identify which tool is causing the latency.

yaml

# otel-collector-configmap.yaml
receivers:
  otlp:                        # traces 파이프라인이 참조하는 수신기 — 반드시 정의 필요
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: 'mcp-servers'
        static_configs:
        - targets: ['localhost:8080']
 
processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
 
exporters:
  prometheus:
    endpoint: "0.0.0.0:8888"
  jaeger:
    endpoint: jaeger-collector:14250
 
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]

Pros and Cons Analysis

Situations where this stack is suitable

Item	Content
Irregular LLM Traffic Patterns	Sophisticated Response to Unpredictable Spikes by Combining Latency, Error Rate, and Queue Depth
Cost optimization needed	Pods can be completely removed when idle with KEDA `minReplicaCount: 0`
Diverse Event Sources	Integrate over 65 triggers including Kafka, Redis, SQS, etc. into a single Autoscaler
Vendor-neutral observability	Integrates metrics, traces, and logs with OpenTelemetry standards, no platform lock-in

Situations where this stack should not be used

Item	Content	Alternatives
Small single cluster	Adapter and KEDA management overhead is excessive relative to service scale	Kubernetes native CPU/memory HPA is sufficient
Environment without Prometheus	Entire HPA relies on Prometheus for custom metrics	Utilizes external metric providers such as Datadog, New Relic, etc.
Service unable to cold start	Pod startup 30 seconds + model warm-up up to 2 minutes	Minimum replicas always maintained, warm-up readinessProbe configured
Insufficient Cardinality Management	Out of Memory (OOM) when generating time series of millions with eigenvalue labels such as `user_id`	Code review is mandatory for labels that only allow low-cardinality values

The Most Common Mistakes in Practice

Set up CPU-based HPA only: LLM inference uses very little CPU while requests are waiting for the model. You will completely miss scaling timing based on CPU utilization. Set call delay and queue depth as the primary triggers from the start.
Omit the resources stanza in the Prometheus Adapter ConfigMap: If you deploy without this block, the HPA will fail to find the metric in custom.metrics.k8s.io, resulting in a continuous "unable to get metric" error. Please check the API exposure status with kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" first.
Set scaleDown stabilizationWindowSeconds to 0: Unlike scale-up, scale-down must have a stabilization window. If set to 0, oscillations occur where the number of Pods repeatedly shrinks and expands with every momentary drop in traffic, which actually reduces availability.

In Conclusion

3 Steps to Start Right Now:

Add the /metrics endpoint to the MCP server: Implement the three metrics mcp_request_duration_seconds (histogram type), mcp_errors_total, and mcp_queue_depth by referring to the Step 1 code snippet above.
Prometheus Adapter Installation and API Exposure Verification: After installing helm install prometheus-adapter ..., verify that the custom metric API responds via kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/".
HPA YAML Application and Operation Verification: Real-time monitoring of metric collection status and scaling events from kubectl apply -f mcp-server-hpa.yaml to kubectl describe hpa mcp-server-hpa

Next Post: How to Implement Canary Deployment and Automatic Rollback on MCP Server by Combining KEDA and Argo Rollouts

Reference Materials

Kubernetes MCP Server - AI-powered cluster management | Red Hat — Overview of the official MCP Server architecture for Kubernetes
Scale LLM Tools With a Remote MCP Architecture on Kubernetes | The New Stack — Practical Examples of Remote MCP Server Scaling Patterns
15 Best Practices for Building MCP Servers in Production | The New Stack — MCP Server Production Operation Checklist
Horizontal Pod Autoscaling Walkthrough | Kubernetes Official Documentation — Official Reference for HPA Configuration and Custom Metrics
How to Use Custom Metrics with Kubernetes HPA | OneUptime — Detailed Explanation of Prometheus Adapter ConfigMap Rules
Prometheus MCP Server - AI-driven monitoring intelligence | AWS — Prometheus MCP Integration in AWS Environment
MCP Server Monitoring Via Prometheus & Grafana | Medium — Grafana Dashboard Panel Configuration Example
MCP Observability with OpenTelemetry | SigNoz — OTel Collector Integration and Distributed Trace Setup
Monitor MCP servers with OpenLIT and Grafana Cloud | Grafana — MCP monitoring based on Grafana Cloud
KEDA - Kubernetes Event-driven Autoscaling | KEDA Official — KEDA Trigger Types and ScaledObject Reference
Autoscaling AI Inference Workloads with KEDA | KEDAify — LLM Inference Workload KEDA Cost Optimization Case Study
Best practices for autoscaling LLM inference workloads on GKE | Google Cloud — Official HPA recommendations for LLM in GKE environments
The great migration: Why every AI platform is converging on Kubernetes | CNCF — Analysis of Kubernetes Convergence Trends in AI Infrastructure

Key Concepts

MCP Server and Kubernetes Deployment Architecture

HPA Operation Principles and Custom Metrics

Overall Flow of Monitoring Stack

Practical Application

Step 1: Configure MCP Server /metrics Endpoint

Step 2: Install Prometheus Adapter and Set Up Custom Metrics

Step 3: HPA Custom Metrics Setup

Step 4: Adding Queue-Based Preemptive Scaling with KEDA

Step 5: Configure Grafana Dashboard Core Panels

Step 6: Integrating Distributed Trace with OpenTelemetry Collector

Pros and Cons Analysis

Situations where this stack is suitable

Situations where this stack should not be used

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

MCP Server and Kubernetes Deployment Architecture

HPA Operation Principles and Custom Metrics

Overall Flow of Monitoring Stack

Practical Application

Step 1: Configure MCP Server /metrics Endpoint

Step 2: Install Prometheus Adapter and Set Up Custom Metrics

Step 3: HPA Custom Metrics Setup

Step 4: Adding Queue-Based Preemptive Scaling with KEDA

Step 5: Configure Grafana Dashboard Core Panels

Step 6: Integrating Distributed Trace with OpenTelemetry Collector

Pros and Cons Analysis

Situations where this stack is suitable

Situations where this stack should not be used

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

Implementing AI Agent (MCP) Server Canary Deployment and Automatic Rollback with KEDA + Argo Rollouts

Simplifying Canary Deployment with a Single Flagger CRD: From KEDA ScaledObject Separation Issues to Argo CD ApplicationSet Multicluster MCP Server Automation

Configuring LLM p99 Latency-Based Canary Auto-Rollback with Flagger MetricTemplate

MCP Server Docker Deployment in 3 Steps — SSE Deprecated, Now Streamable HTTP

Practical Guide to MCP Server Development — Transforming Internal APIs and DBs into a Professional Domain Agent

Context Engineering 4 Strategies — Applying Write, Select, Compress, and Isolate to Multi-Agent Systems