HPA 커스텀 메트릭 + Grafana 대시보드로 MCP 서버 안정화하기: Kubernetes에서 AI 에이전트 서버 운영 실전

AI 에이전트가 프로덕션에 올라가는 순간, 인프라 팀은 당혹스러운 상황을 마주합니다. CPU 70% 임계값으로 설정한 HPA가 트래픽 스파이크에 전혀 반응하지 않고, 사용자 요청은 대기 큐에 수백 개씩 쌓이기 시작합니다. LLM 추론 워크로드는 CPU를 과도하게 소모하는 대신 모델 대기 시간이 폭증하는 방식으로 병목이 생기기 때문입니다. 이 설정 없이 배포하면 스파이크 시 에러율이 급등하는 상황을 그대로 방치하게 됩니다.

이 글에서는 MCP(Model Context Protocol) AI 에이전트 서버를 Kubernetes에 배포하고, HPA를 호출 지연·에러율 기반 커스텀 메트릭으로 구성하며, Prometheus와 Grafana로 전체 스택을 관찰 가능하게 만드는 방법을 단계별로 다룹니다.

이 글을 끝까지 읽으면 Prometheus Adapter와 커스텀 메트릭 HPA를 직접 구성하고, Grafana에서 p99 지연·에러율·큐 깊이를 한눈에 보는 대시보드를 완성할 수 있습니다.

이 글의 전제 조건: kubectl 기본 사용법, Kubernetes Deployment·Service 개념, Docker 이미지 빌드 경험이 있는 백엔드/인프라 개발자를 대상으로 합니다. 로컬 실습은 minikube 또는 kind를 권장하며, EKS·GKE·AKS 실제 클러스터에서도 동일하게 적용됩니다.

핵심 개념

MCP 서버와 Kubernetes 배포 아키텍처

Model Context Protocol (MCP)은 AI 어시스턴트가 외부 도구와 데이터 소스에 표준화된 방식으로 접근할 수 있게 해주는 통신 프로토콜입니다. MCP 서버는 "에이전트 ↔ 외부 세계" 사이의 컨텍스트 레이어로, REST API가 웹 서비스 간 통신을 표준화한 것처럼 AI 에이전트와 도구 간 통신을 표준화합니다.

MCP (Model Context Protocol): Anthropic이 제안한 오픈 표준으로, LLM 에이전트가 파일 시스템·데이터베이스·외부 API 같은 자원에 일관된 인터페이스로 접근할 수 있게 해줍니다.

Kubernetes 환경에서의 배포는 다음 패턴을 따릅니다.

전송 프로토콜: 프로덕션은 Streamable HTTP, 로컬 개발은 Stdio
패키징: Docker 이미지 → Kubernetes Deployment
설정 관리: Helm Chart로 OAuth, 리소스 제한, 텔레메트리 옵션 통합 관리
인증: HTTP 기반 MCP 서버는 2025년 3월 이후 OAuth 2.1 구현이 필수 (Stdio 전송은 불필요)

HPA 동작 원리와 커스텀 메트릭

HPA는 15초마다 메트릭을 조회하고, 현재 값과 목표 값의 비율로 Pod 수를 결정합니다.

원하는 Pod 수 = ceil(현재 Pod 수 × (현재 메트릭 값 / 목표 메트릭 값))

문제는 LLM 워크로드 특성에 있습니다. 일반 웹 서버는 요청이 많아질수록 CPU 사용률이 함께 오르지만, LLM 추론은 요청 대부분의 시간이 모델 대기와 I/O에 소비됩니다. CPU 임계값을 아무리 잘 설정해도 평균 응답시간이 300ms에서 3,000ms로 늘어나는 현상을 감지할 수 없습니다.

메트릭 종류	소스	MCP 서버 적합성
CPU 사용률	kubelet	낮음 — LLM 대기 시간은 CPU에 반영 안 됨
메모리 사용률	kubelet	보통 — 보조 지표로만 활용
호출 지연 (p99)	Prometheus	높음 — 사용자 체감 품질 직결
에러율 (5xx)	Prometheus	높음 — 서비스 안정성 핵심 지표
큐 깊이	Prometheus	매우 높음 — 선제적 스케일링 가능

모니터링 스택 전체 흐름

MCP 서버 Pod
  └── /metrics 엔드포인트 (Prometheus 형식)
        ↓
Prometheus
  └── 15초 간격 스크래핑
        ↓
Prometheus Adapter
  └── custom.metrics.k8s.io API로 변환
        ↓
HPA 컨트롤러
  └── 15초마다 메트릭 조회 → Pod 수 결정
        ↓
Grafana
  └── 시각화 및 알림

Prometheus Adapter: Prometheus 메트릭을 Kubernetes Custom Metrics API(custom.metrics.k8s.io)로 변환해주는 브릿지 컴포넌트입니다. 이것이 없으면 HPA는 Prometheus 메트릭을 직접 읽을 수 없습니다.

이제 이 개념들을 실제 클러스터에 단계별로 적용해봅니다.

실전 적용

1단계: MCP 서버 /metrics 엔드포인트 구성

HPA 커스텀 메트릭을 사용하려면 MCP 서버가 먼저 Prometheus 형식으로 메트릭을 노출해야 합니다. 특히 mcp_request_duration_seconds는 반드시 histogram 타입으로 노출해야 합니다. 단순 gauge나 counter로 노출하면 뒤에서 사용할 histogram_quantile PromQL이 작동하지 않습니다.

javascript

// Node.js (TypeScript) 예시
import { Registry, Histogram, Counter, Gauge } from 'prom-client';
 
const registry = new Registry();
 
// histogram 타입 필수: _bucket, _sum, _count가 자동 생성됨
export const requestDuration = new Histogram({
  name: 'mcp_request_duration_seconds',
  help: 'MCP 요청 처리 지연 시간',
  labelNames: ['method', 'status'],
  buckets: [0.05, 0.1, 0.2, 0.5, 1, 2, 5], // 초 단위
  registers: [registry],
});
 
export const errorsTotal = new Counter({
  name: 'mcp_errors_total',
  help: 'MCP 에러 총 횟수',
  labelNames: ['error_type'],
  registers: [registry],
});
 
export const queueDepth = new Gauge({
  name: 'mcp_queue_depth',
  help: '처리 대기 중인 요청 수',
  labelNames: ['service'],
  registers: [registry],
});
 
// /metrics 엔드포인트
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
});

HPA가 작동하려면 대상 Deployment에 반드시 resources.requests가 설정되어 있어야 합니다. 이것이 없으면 HPA 자체가 시작되지 않습니다.

yaml

# mcp-server-deployment.yaml (핵심 부분)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: mcp-server
        image: your-registry/mcp-server:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "250m"       # HPA 작동을 위해 필수
            memory: "512Mi"
          limits:
            cpu: "2"
            memory: "2Gi"

2단계: Prometheus Adapter 설치 및 커스텀 메트릭 설정

먼저 클러스터의 Prometheus 서비스 이름을 확인합니다. 설치 환경마다 서비스명과 네임스페이스가 다릅니다.

bash

# Prometheus 서비스 이름 확인 (환경마다 다름)
kubectl get svc -n monitoring
 
# 확인된 서비스명으로 Adapter 설치
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-operated.monitoring.svc.cluster.local \
  --set prometheus.port=9090

커스텀 메트릭 규칙을 정의합니다. 평균 지연 계산 시 반드시 _sum / _count로 나눠야 합니다. _sum의 rate만 사용하면 실제 평균 지연이 아닌 총 지연량의 증가 속도가 됩니다.

yaml

# prometheus-adapter-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'mcp_request_duration_seconds_sum{service="mcp-server"}'
      resources:
        overrides:
          namespace: { resource: namespace }
          pod: { resource: pod }
      name:
        matches: "^mcp_request_(.*)_seconds_sum"
        as: "mcp_request_${1}_latency"
      # 올바른 평균 지연 계산: sum을 count로 나눠야 실제 지연값이 됨
      metricsQuery: |
        avg(
          rate(mcp_request_duration_seconds_sum{service="mcp-server"}[2m])
          / rate(mcp_request_duration_seconds_count{service="mcp-server"}[2m])
        )
 
    - seriesQuery: 'mcp_errors_total{service="mcp-server"}'
      resources:
        overrides:
          namespace: { resource: namespace }
          pod: { resource: pod }
      name:
        matches: "^mcp_(.*)_total"
        as: "mcp_${1}_rate"
      metricsQuery: 'sum(rate(mcp_errors_total{service="mcp-server"}[2m]))'

resources 스탠자가 왜 필요한가: 이 블록 없이는 HPA가 custom.metrics.k8s.io에서 Namespace·Pod 범위의 메트릭을 찾지 못합니다. 빠뜨리면 kubectl describe hpa에서 "unable to get metric" 오류만 반복됩니다.

커스텀 메트릭 API가 정상 노출되는지 확인합니다.

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq '.resources[].name'

3단계: HPA 커스텀 메트릭 설정

maxReplicas 값은 클러스터 노드 수와 Pod 리소스 요청량 기준으로 계산하세요. 예를 들어 노드 5개, 노드당 4 CPU 환경에서 Pod가 requests.cpu: 250m이면 이론상 최대 80개까지 가능합니다. 여기서는 안전 마진을 두어 20으로 설정합니다.

yaml

# mcp-server-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server
  minReplicas: 2
  maxReplicas: 20  # 노드 5개 × 4 CPU / 0.25(requests) × 0.25(안전마진)
  metrics:
  - type: Pods
    pods:
      metric:
        name: mcp_request_duration_latency
      target:
        type: AverageValue
        averageValue: "200m"   # Pod당 평균 지연 200ms 초과 시 스케일 아웃
  - type: Pods
    pods:
      metric:
        name: mcp_errors_rate
      target:
        type: AverageValue
        averageValue: "0.01"   # Pod당 에러율 1% 초과 시 스케일 아웃
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0      # 스파이크에 즉각 반응
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15                # 15초마다 최대 2배 확장
    scaleDown:
      stabilizationWindowSeconds: 300    # 5분간 안정화 후 축소
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

설정 항목	값	의미
`minReplicas: 2`	2	단일 장애점 방지용 최소 Pod 수
`averageValue: "200m"`	200ms	이 지연 초과 시 스케일 아웃 트리거
`scaleUp.stabilizationWindowSeconds: 0`	즉시	지연 스파이크 발생 시 즉각 대응
`scaleDown.stabilizationWindowSeconds: 300`	5분	일시적 부하 감소에 조기 축소 방지

4단계: KEDA로 큐 기반 선제적 확장 추가

HPA가 현재 지연 수치에 반응하는 방식이라면, KEDA는 큐가 쌓이기 시작하는 순간 선제적으로 Pod를 늘립니다. 두 방식은 역할이 다르므로 상호보완적으로 함께 사용할 수 있습니다.

yaml

# mcp-server-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: mcp-server-keda
spec:
  scaleTargetRef:
    name: mcp-server
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring.svc.cluster.local:9090
      metricName: mcp_request_queue_depth
      query: 'sum(mcp_queue_depth{service="mcp-server"})'
      threshold: '30'           # 큐에 30개 이상이면 스케일 아웃
      activationThreshold: '5'  # 큐 5개 미만이면 KEDA 비활성 (불필요한 기동 방지)

KEDA(Kubernetes Event-driven Autoscaling): 65개 이상의 이벤트 소스(Kafka, Redis, SQS, Prometheus 등)를 지원하는 오토스케일러입니다. minReplicaCount: 0 설정으로 완전 유휴 시 Pod를 0으로 줄여 비용을 절감할 수 있습니다. 단, MCP 서버처럼 응답성이 중요한 경우 콜드 스타트 30~120초가 발생하므로 minReplicaCount: 1 이상을 유지하는 것이 안전합니다.

5단계: Grafana 대시보드 핵심 패널 구성

운영에 필요한 4개 핵심 패널 PromQL입니다. 패널 1의 histogram_quantile은 MCP 서버가 histogram 타입으로 메트릭을 노출한 경우에만 작동합니다(1단계 참조).

python

# 패널 1: p99 호출 지연 — histogram 타입 메트릭 필수
histogram_quantile(0.99,
  rate(mcp_request_duration_seconds_bucket[5m])
)
 
# 패널 2: 에러율 (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
 
# 패널 3: 현재 활성 Pod 수
count(up{job="mcp-servers"})
 
# 패널 4: 큐 깊이 (실시간)
mcp_queue_depth{service="mcp-server"}

패널	알림 임계값	의미
p99 지연	> 500ms	사용자 체감 품질 저하 시작점
에러율	> 1%	서비스 안정성 경계
Pod 수	maxReplicas의 80%	용량 한계 사전 경보
큐 깊이	> 50	처리 지연 누적 신호

6단계: OpenTelemetry Collector로 분산 추적 통합

메트릭에 더해 MCP 도구 호출별 분산 추적까지 수집하면 어느 도구에서 지연이 발생하는지 즉시 특정할 수 있습니다.

yaml

# otel-collector-configmap.yaml
receivers:
  otlp:                        # traces 파이프라인이 참조하는 수신기 — 반드시 정의 필요
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: 'mcp-servers'
        static_configs:
        - targets: ['localhost:8080']
 
processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
 
exporters:
  prometheus:
    endpoint: "0.0.0.0:8888"
  jaeger:
    endpoint: jaeger-collector:14250
 
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]

장단점 분석

이 스택이 적합한 상황

항목	내용
불규칙한 LLM 트래픽 패턴	지연·에러율·큐 깊이를 조합해 예측 불가능한 스파이크에 정교하게 대응
비용 최적화 필요	KEDA `minReplicaCount: 0`으로 유휴 시 Pod 완전 제거 가능
다양한 이벤트 소스	Kafka·Redis·SQS 등 65개 이상 트리거를 단일 오토스케일러로 통합
벤더 중립적 관찰성	OpenTelemetry 표준으로 메트릭·추적·로그 통합, 플랫폼 종속 없음

이 스택을 쓰지 말아야 하는 상황

항목	내용	대안
소규모 단일 클러스터	Adapter·KEDA 관리 오버헤드가 서비스 규모 대비 과도함	Kubernetes 기본 CPU/메모리 HPA로 충분
Prometheus 없는 환경	커스텀 메트릭 HPA 전체가 Prometheus 의존	Datadog, New Relic 등 외부 메트릭 프로바이더 활용
콜드 스타트 불가 서비스	Pod 기동 30초 + 모델 워밍업 최대 2분 발생	최소 replica 상시 유지, 워밍업 readinessProbe 설정
Cardinality 관리 미흡	`user_id` 등 고유값 레이블로 수백만 시계열 생성 시 OOM	레이블은 저카디널리티값만 허용하는 코드 리뷰 필수

Cardinality(카디널리티): 메트릭의 고유 시계열 수입니다. user_id 같은 고유값을 레이블로 쓰면 수백만 개의 시계열이 생성되어 Prometheus 메모리가 폭증합니다. 레이블은 서비스명·엔드포인트·HTTP 상태코드 등 저카디널리티값만 사용하세요.

실무에서 가장 흔한 실수

CPU 기반 HPA만 설정하기: LLM 추론은 요청이 모델 대기에 머무르는 동안 CPU를 거의 사용하지 않습니다. CPU 사용률로는 스케일링 타이밍을 완전히 놓칩니다. 처음부터 호출 지연과 큐 깊이를 주 트리거로 설정하세요.
Prometheus Adapter ConfigMap에 resources 스탠자 생략하기: 이 블록 없이 배포하면 HPA가 custom.metrics.k8s.io에서 메트릭을 찾지 못해 "unable to get metric" 오류가 계속 납니다. kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" 으로 먼저 API 노출 여부를 확인하세요.
scaleDown stabilizationWindowSeconds를 0으로 설정하기: 스케일 업과 달리 스케일 다운은 반드시 안정화 윈도우를 두어야 합니다. 0으로 설정하면 순간적인 트래픽 감소마다 Pod를 줄였다 늘였다 반복하는 진동 현상이 발생해 오히려 가용성이 떨어집니다.

마치며

판단 기준은 단순합니다. LLM 기반 MCP 서버처럼 요청 지연이 수백ms~수 초를 오가고 트래픽 패턴이 불규칙한 경우에는 커스텀 메트릭 HPA가 필수입니다. 반면 트래픽이 예측 가능하고 일정한 서비스라면 CPU/메모리 HPA만으로도 충분하며, Prometheus Adapter를 추가하는 오버헤드는 그 경우 정당화되지 않습니다.

지금 바로 시작할 수 있는 3단계:

MCP 서버에 /metrics 엔드포인트 추가: mcp_request_duration_seconds(histogram 타입), mcp_errors_total, mcp_queue_depth 세 가지 메트릭을 위 1단계 코드 스니펫을 참고해 구현
Prometheus Adapter 설치 및 API 노출 확인: helm install prometheus-adapter ... 설치 후 kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" 으로 커스텀 메트릭 API가 응답하는지 검증
HPA YAML 적용 및 동작 확인: kubectl apply -f mcp-server-hpa.yaml 후 kubectl describe hpa mcp-server-hpa 로 메트릭 수집 상태와 스케일링 이벤트 실시간 모니터링

다음 글: KEDA와 Argo Rollouts를 결합해 MCP 서버의 카나리 배포와 자동 롤백을 구현하는 방법

참고 자료

Kubernetes MCP Server - AI-powered cluster management | Red Hat — Kubernetes용 공식 MCP 서버 아키텍처 전반
Scale LLM Tools With a Remote MCP Architecture on Kubernetes | The New Stack — 원격 MCP 서버 확장 패턴 실전 사례
15 Best Practices for Building MCP Servers in Production | The New Stack — MCP 서버 프로덕션 운영 체크리스트
Horizontal Pod Autoscaling Walkthrough | Kubernetes 공식 문서 — HPA 설정 및 커스텀 메트릭 공식 레퍼런스
How to Use Custom Metrics with Kubernetes HPA | OneUptime — Prometheus Adapter ConfigMap 규칙 상세 설명
Prometheus MCP Server - AI-driven monitoring intelligence | AWS — AWS 환경에서의 Prometheus MCP 통합
MCP Server Monitoring Via Prometheus & Grafana | Medium — Grafana 대시보드 패널 구성 예시
MCP Observability with OpenTelemetry | SigNoz — OTel Collector 통합 및 분산 추적 설정
Monitor MCP servers with OpenLIT and Grafana Cloud | Grafana — Grafana Cloud 기반 MCP 모니터링
KEDA - Kubernetes Event-driven Autoscaling | KEDA 공식 — KEDA 트리거 종류 및 ScaledObject 레퍼런스
Autoscaling AI Inference Workloads with KEDA | KEDAify — LLM 추론 워크로드 KEDA 비용 최적화 사례
Best practices for autoscaling LLM inference workloads on GKE | Google Cloud — GKE 환경 LLM HPA 공식 권장 사항
The great migration: Why every AI platform is converging on Kubernetes | CNCF — AI 인프라의 Kubernetes 수렴 트렌드 분석

HPA 커스텀 메트릭 + Grafana 대시보드로 MCP 서버 안정화하기: Kubernetes에서 AI 에이전트 서버 운영 실전

이 글의 전제 조건: kubectl 기본 사용법, Kubernetes Deployment·Service 개념, Docker 이미지 빌드 경험이 있는 백엔드/인프라 개발자를 대상으로 합니다. 로컬 실습은 minikube 또는 kind를 권장하며, EKS·GKE·AKS 실제 클러스터에서도 동일하게 적용됩니다.

핵심 개념

MCP 서버와 Kubernetes 배포 아키텍처

MCP (Model Context Protocol): Anthropic이 제안한 오픈 표준으로, LLM 에이전트가 파일 시스템·데이터베이스·외부 API 같은 자원에 일관된 인터페이스로 접근할 수 있게 해줍니다.

Kubernetes 환경에서의 배포는 다음 패턴을 따릅니다.

전송 프로토콜: 프로덕션은 Streamable HTTP, 로컬 개발은 Stdio
패키징: Docker 이미지 → Kubernetes Deployment
설정 관리: Helm Chart로 OAuth, 리소스 제한, 텔레메트리 옵션 통합 관리
인증: HTTP 기반 MCP 서버는 2025년 3월 이후 OAuth 2.1 구현이 필수 (Stdio 전송은 불필요)

HPA 동작 원리와 커스텀 메트릭

HPA는 15초마다 메트릭을 조회하고, 현재 값과 목표 값의 비율로 Pod 수를 결정합니다.

원하는 Pod 수 = ceil(현재 Pod 수 × (현재 메트릭 값 / 목표 메트릭 값))

메트릭 종류	소스	MCP 서버 적합성
CPU 사용률	kubelet	낮음 — LLM 대기 시간은 CPU에 반영 안 됨
메모리 사용률	kubelet	보통 — 보조 지표로만 활용
호출 지연 (p99)	Prometheus	높음 — 사용자 체감 품질 직결
에러율 (5xx)	Prometheus	높음 — 서비스 안정성 핵심 지표
큐 깊이	Prometheus	매우 높음 — 선제적 스케일링 가능

모니터링 스택 전체 흐름

MCP 서버 Pod
  └── /metrics 엔드포인트 (Prometheus 형식)
        ↓
Prometheus
  └── 15초 간격 스크래핑
        ↓
Prometheus Adapter
  └── custom.metrics.k8s.io API로 변환
        ↓
HPA 컨트롤러
  └── 15초마다 메트릭 조회 → Pod 수 결정
        ↓
Grafana
  └── 시각화 및 알림

Prometheus Adapter: Prometheus 메트릭을 Kubernetes Custom Metrics API(custom.metrics.k8s.io)로 변환해주는 브릿지 컴포넌트입니다. 이것이 없으면 HPA는 Prometheus 메트릭을 직접 읽을 수 없습니다.

이제 이 개념들을 실제 클러스터에 단계별로 적용해봅니다.

실전 적용

1단계: MCP 서버 /metrics 엔드포인트 구성

javascript

// Node.js (TypeScript) 예시
import { Registry, Histogram, Counter, Gauge } from 'prom-client';
 
const registry = new Registry();
 
// histogram 타입 필수: _bucket, _sum, _count가 자동 생성됨
export const requestDuration = new Histogram({
  name: 'mcp_request_duration_seconds',
  help: 'MCP 요청 처리 지연 시간',
  labelNames: ['method', 'status'],
  buckets: [0.05, 0.1, 0.2, 0.5, 1, 2, 5], // 초 단위
  registers: [registry],
});
 
export const errorsTotal = new Counter({
  name: 'mcp_errors_total',
  help: 'MCP 에러 총 횟수',
  labelNames: ['error_type'],
  registers: [registry],
});
 
export const queueDepth = new Gauge({
  name: 'mcp_queue_depth',
  help: '처리 대기 중인 요청 수',
  labelNames: ['service'],
  registers: [registry],
});
 
// /metrics 엔드포인트
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
});

HPA가 작동하려면 대상 Deployment에 반드시 resources.requests가 설정되어 있어야 합니다. 이것이 없으면 HPA 자체가 시작되지 않습니다.

yaml

# mcp-server-deployment.yaml (핵심 부분)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: mcp-server
        image: your-registry/mcp-server:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "250m"       # HPA 작동을 위해 필수
            memory: "512Mi"
          limits:
            cpu: "2"
            memory: "2Gi"

2단계: Prometheus Adapter 설치 및 커스텀 메트릭 설정

먼저 클러스터의 Prometheus 서비스 이름을 확인합니다. 설치 환경마다 서비스명과 네임스페이스가 다릅니다.

bash

# Prometheus 서비스 이름 확인 (환경마다 다름)
kubectl get svc -n monitoring
 
# 확인된 서비스명으로 Adapter 설치
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-operated.monitoring.svc.cluster.local \
  --set prometheus.port=9090

yaml

# prometheus-adapter-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'mcp_request_duration_seconds_sum{service="mcp-server"}'
      resources:
        overrides:
          namespace: { resource: namespace }
          pod: { resource: pod }
      name:
        matches: "^mcp_request_(.*)_seconds_sum"
        as: "mcp_request_${1}_latency"
      # 올바른 평균 지연 계산: sum을 count로 나눠야 실제 지연값이 됨
      metricsQuery: |
        avg(
          rate(mcp_request_duration_seconds_sum{service="mcp-server"}[2m])
          / rate(mcp_request_duration_seconds_count{service="mcp-server"}[2m])
        )
 
    - seriesQuery: 'mcp_errors_total{service="mcp-server"}'
      resources:
        overrides:
          namespace: { resource: namespace }
          pod: { resource: pod }
      name:
        matches: "^mcp_(.*)_total"
        as: "mcp_${1}_rate"
      metricsQuery: 'sum(rate(mcp_errors_total{service="mcp-server"}[2m]))'

resources 스탠자가 왜 필요한가: 이 블록 없이는 HPA가 custom.metrics.k8s.io에서 Namespace·Pod 범위의 메트릭을 찾지 못합니다. 빠뜨리면 kubectl describe hpa에서 "unable to get metric" 오류만 반복됩니다.

커스텀 메트릭 API가 정상 노출되는지 확인합니다.

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq '.resources[].name'

3단계: HPA 커스텀 메트릭 설정

yaml

# mcp-server-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server
  minReplicas: 2
  maxReplicas: 20  # 노드 5개 × 4 CPU / 0.25(requests) × 0.25(안전마진)
  metrics:
  - type: Pods
    pods:
      metric:
        name: mcp_request_duration_latency
      target:
        type: AverageValue
        averageValue: "200m"   # Pod당 평균 지연 200ms 초과 시 스케일 아웃
  - type: Pods
    pods:
      metric:
        name: mcp_errors_rate
      target:
        type: AverageValue
        averageValue: "0.01"   # Pod당 에러율 1% 초과 시 스케일 아웃
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0      # 스파이크에 즉각 반응
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15                # 15초마다 최대 2배 확장
    scaleDown:
      stabilizationWindowSeconds: 300    # 5분간 안정화 후 축소
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

설정 항목	값	의미
`minReplicas: 2`	2	단일 장애점 방지용 최소 Pod 수
`averageValue: "200m"`	200ms	이 지연 초과 시 스케일 아웃 트리거
`scaleUp.stabilizationWindowSeconds: 0`	즉시	지연 스파이크 발생 시 즉각 대응
`scaleDown.stabilizationWindowSeconds: 300`	5분	일시적 부하 감소에 조기 축소 방지

4단계: KEDA로 큐 기반 선제적 확장 추가

yaml

# mcp-server-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: mcp-server-keda
spec:
  scaleTargetRef:
    name: mcp-server
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.monitoring.svc.cluster.local:9090
      metricName: mcp_request_queue_depth
      query: 'sum(mcp_queue_depth{service="mcp-server"})'
      threshold: '30'           # 큐에 30개 이상이면 스케일 아웃
      activationThreshold: '5'  # 큐 5개 미만이면 KEDA 비활성 (불필요한 기동 방지)

KEDA(Kubernetes Event-driven Autoscaling): 65개 이상의 이벤트 소스(Kafka, Redis, SQS, Prometheus 등)를 지원하는 오토스케일러입니다. minReplicaCount: 0 설정으로 완전 유휴 시 Pod를 0으로 줄여 비용을 절감할 수 있습니다. 단, MCP 서버처럼 응답성이 중요한 경우 콜드 스타트 30~120초가 발생하므로 minReplicaCount: 1 이상을 유지하는 것이 안전합니다.

5단계: Grafana 대시보드 핵심 패널 구성

운영에 필요한 4개 핵심 패널 PromQL입니다. 패널 1의 histogram_quantile은 MCP 서버가 histogram 타입으로 메트릭을 노출한 경우에만 작동합니다(1단계 참조).

python

# 패널 1: p99 호출 지연 — histogram 타입 메트릭 필수
histogram_quantile(0.99,
  rate(mcp_request_duration_seconds_bucket[5m])
)
 
# 패널 2: 에러율 (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
 
# 패널 3: 현재 활성 Pod 수
count(up{job="mcp-servers"})
 
# 패널 4: 큐 깊이 (실시간)
mcp_queue_depth{service="mcp-server"}

패널	알림 임계값	의미
p99 지연	> 500ms	사용자 체감 품질 저하 시작점
에러율	> 1%	서비스 안정성 경계
Pod 수	maxReplicas의 80%	용량 한계 사전 경보
큐 깊이	> 50	처리 지연 누적 신호

6단계: OpenTelemetry Collector로 분산 추적 통합

메트릭에 더해 MCP 도구 호출별 분산 추적까지 수집하면 어느 도구에서 지연이 발생하는지 즉시 특정할 수 있습니다.

yaml

# otel-collector-configmap.yaml
receivers:
  otlp:                        # traces 파이프라인이 참조하는 수신기 — 반드시 정의 필요
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: 'mcp-servers'
        static_configs:
        - targets: ['localhost:8080']
 
processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
 
exporters:
  prometheus:
    endpoint: "0.0.0.0:8888"
  jaeger:
    endpoint: jaeger-collector:14250
 
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]

장단점 분석

이 스택이 적합한 상황

항목	내용
불규칙한 LLM 트래픽 패턴	지연·에러율·큐 깊이를 조합해 예측 불가능한 스파이크에 정교하게 대응
비용 최적화 필요	KEDA `minReplicaCount: 0`으로 유휴 시 Pod 완전 제거 가능
다양한 이벤트 소스	Kafka·Redis·SQS 등 65개 이상 트리거를 단일 오토스케일러로 통합
벤더 중립적 관찰성	OpenTelemetry 표준으로 메트릭·추적·로그 통합, 플랫폼 종속 없음

이 스택을 쓰지 말아야 하는 상황

항목	내용	대안
소규모 단일 클러스터	Adapter·KEDA 관리 오버헤드가 서비스 규모 대비 과도함	Kubernetes 기본 CPU/메모리 HPA로 충분
Prometheus 없는 환경	커스텀 메트릭 HPA 전체가 Prometheus 의존	Datadog, New Relic 등 외부 메트릭 프로바이더 활용
콜드 스타트 불가 서비스	Pod 기동 30초 + 모델 워밍업 최대 2분 발생	최소 replica 상시 유지, 워밍업 readinessProbe 설정
Cardinality 관리 미흡	`user_id` 등 고유값 레이블로 수백만 시계열 생성 시 OOM	레이블은 저카디널리티값만 허용하는 코드 리뷰 필수

Cardinality(카디널리티): 메트릭의 고유 시계열 수입니다. user_id 같은 고유값을 레이블로 쓰면 수백만 개의 시계열이 생성되어 Prometheus 메모리가 폭증합니다. 레이블은 서비스명·엔드포인트·HTTP 상태코드 등 저카디널리티값만 사용하세요.

실무에서 가장 흔한 실수

CPU 기반 HPA만 설정하기: LLM 추론은 요청이 모델 대기에 머무르는 동안 CPU를 거의 사용하지 않습니다. CPU 사용률로는 스케일링 타이밍을 완전히 놓칩니다. 처음부터 호출 지연과 큐 깊이를 주 트리거로 설정하세요.
Prometheus Adapter ConfigMap에 resources 스탠자 생략하기: 이 블록 없이 배포하면 HPA가 custom.metrics.k8s.io에서 메트릭을 찾지 못해 "unable to get metric" 오류가 계속 납니다. kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" 으로 먼저 API 노출 여부를 확인하세요.
scaleDown stabilizationWindowSeconds를 0으로 설정하기: 스케일 업과 달리 스케일 다운은 반드시 안정화 윈도우를 두어야 합니다. 0으로 설정하면 순간적인 트래픽 감소마다 Pod를 줄였다 늘였다 반복하는 진동 현상이 발생해 오히려 가용성이 떨어집니다.

마치며

지금 바로 시작할 수 있는 3단계:

MCP 서버에 /metrics 엔드포인트 추가: mcp_request_duration_seconds(histogram 타입), mcp_errors_total, mcp_queue_depth 세 가지 메트릭을 위 1단계 코드 스니펫을 참고해 구현
Prometheus Adapter 설치 및 API 노출 확인: helm install prometheus-adapter ... 설치 후 kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" 으로 커스텀 메트릭 API가 응답하는지 검증
HPA YAML 적용 및 동작 확인: kubectl apply -f mcp-server-hpa.yaml 후 kubectl describe hpa mcp-server-hpa 로 메트릭 수집 상태와 스케일링 이벤트 실시간 모니터링

다음 글: KEDA와 Argo Rollouts를 결합해 MCP 서버의 카나리 배포와 자동 롤백을 구현하는 방법

참고 자료

Kubernetes MCP Server - AI-powered cluster management | Red Hat — Kubernetes용 공식 MCP 서버 아키텍처 전반
Scale LLM Tools With a Remote MCP Architecture on Kubernetes | The New Stack — 원격 MCP 서버 확장 패턴 실전 사례
15 Best Practices for Building MCP Servers in Production | The New Stack — MCP 서버 프로덕션 운영 체크리스트
Horizontal Pod Autoscaling Walkthrough | Kubernetes 공식 문서 — HPA 설정 및 커스텀 메트릭 공식 레퍼런스
How to Use Custom Metrics with Kubernetes HPA | OneUptime — Prometheus Adapter ConfigMap 규칙 상세 설명
Prometheus MCP Server - AI-driven monitoring intelligence | AWS — AWS 환경에서의 Prometheus MCP 통합
MCP Server Monitoring Via Prometheus & Grafana | Medium — Grafana 대시보드 패널 구성 예시
MCP Observability with OpenTelemetry | SigNoz — OTel Collector 통합 및 분산 추적 설정
Monitor MCP servers with OpenLIT and Grafana Cloud | Grafana — Grafana Cloud 기반 MCP 모니터링
KEDA - Kubernetes Event-driven Autoscaling | KEDA 공식 — KEDA 트리거 종류 및 ScaledObject 레퍼런스
Autoscaling AI Inference Workloads with KEDA | KEDAify — LLM 추론 워크로드 KEDA 비용 최적화 사례
Best practices for autoscaling LLM inference workloads on GKE | Google Cloud — GKE 환경 LLM HPA 공식 권장 사항
The great migration: Why every AI platform is converging on Kubernetes | CNCF — AI 인프라의 Kubernetes 수렴 트렌드 분석