Configuring LLM p99 Latency-Based Canary Auto-Rollback with Flagger MetricTemplate
Immediately after upgrading the model version, production p99 latency spiked from 400ms to 3.2 seconds. Users were experiencing timeout errors, but the monitoring dashboard remained silent as the HTTP status code still returned 200. Even when p99 latency quietly exceeds 3 seconds, standard error rate-based rollbacks do not work. This is why LLM services require latency-dedicated canary analysis.
In this article, we will complete the configuration code for a pipeline that uses Flagger's MetricTemplate CRD to connect LLM tool call latency (p99) as a rollback trigger. You can complete the MetricTemplate pipeline by following the steps in order, starting from the Python instrumentation code to ServiceMonitor, MetricTemplate, and Canary YAML.
Things to know before reading this article: Kubernetes Pod/Deployment/Service basics, Prometheus metric collection concepts (scrape, label), Python FastAPI basics. Refer to the Official Installation Guide for Flagger installation and the kube-prometheus-stack Helm chart for Prometheus Operator installation.
Key Concepts
If you understand these three things, you can immediately follow the example below.
Progressive Delivery와 Flagger
Progressive Delivery: A deployment strategy that exposes a new version to a small number of users first to verify stability, followed by a gradual rollout. Unlike Blue/Green delivery, it finely controls traffic in percentage units, and automatic promotion or rollback occurs based on metric analysis results.
Flagger is a CNCF Graduated project and a Progressive Delivery operator that automates canary deployments, A/B testing, and blue/green deployments on Kubernetes. The core operating principle is as follows.
- Detect new version deployment → Create canary pod
- Gradually move traffic to the canary in increments of the configured
stepWeight. - Run metric analysis every
interval - Passing threshold → Promotion / Consecutive failures → Rollback
MetricTemplate: A Flagger CRD that is an analysis unit that executes PromQL/queries on Prometheus, Datadog, CloudWatch, etc., returns a float64 value, and compares it with a threshold. Queries can be dynamically constructed using template variables such as {{ namespace }}, {{ target }}, {{ interval }}, and {{ variables.xxx }}.
Prometheus Operator와 ServiceMonitor
The Prometheus Operator declaratively manages Prometheus instances on Kubernetes. Without the need to directly edit prometheus.yml, deploying the ServiceMonitor CRD automatically detects scrap targets based on label selectors.
# Prometheus Operator가 관리하는 서비스 디스커버리 흐름
ServiceMonitor (레이블 셀렉터)
→ Prometheus Operator가 감지
→ prometheus.yml scrape_configs 자동 생성
→ Prometheus가 /metrics 엔드포인트 수집Note: prometheus-operated, which is specified in provider.address of MetricTemplate, is the default service name for kube-prometheus-stack Helm charts. Since the name may vary depending on the installation method, you must verify the actual service name using kubectl get svc -n monitoring.
Why is LLM tool call latency used as the rollback standard?
In the LLM agent, turns involving function calls involve a chain of LLM inference + function execution + re-inference. Because of this, latency is at least twice that of a normal conversation turn, and if the function call schema changes in a new version or regression occurs in the parsing logic, p99 spikes. The HTTP error rate is normal, but the user experience is already ruined.
Histogram Quantile: Prometheus's histogram_quantile(φ, ...) function estimates the φ quantiles by linearly interpolating bucket data. p99 (φ=0.99) represents the latency upper bound for the top 1% of requests. For accurate interpolation, the rollback threshold (e.g., 3 seconds) must be included in the histogram bucket boundaries.
The rate() function converts the counter metric into a growth rate per second. In the sum(rate(histogram_bucket[2m])) by (le) pattern, [2m] is a PromQL range selector that calculates the growth rate over the last 2 minutes. The interval: 2m setting in Flagger indicates the analysis period, and this value is inserted directly into the [{{ interval }}] of the query.
Practical Application
Example 1: Measuring Tool Call Latency on a Python FastAPI Server
Add a histogram to the FastAPI app with prometheus_client and expose the /metrics endpoint. Bucket design is key — you must include a rollback threshold (e.g., 3 seconds) in the bucket boundaries to minimize the interpolation error of histogram_quantile.
from prometheus_client import Histogram, make_asgi_app
from contextlib import contextmanager
from fastapi import FastAPI
app = FastAPI()
# /metrics 엔드포인트 마운트
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
tool_call_latency = Histogram(
"llm_tool_call_duration_seconds",
"LLM tool call duration in seconds",
["tool_name", "model"],
# 임계치인 3.0을 버킷에 명시적으로 포함
buckets=[0.1, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0]
)
@contextmanager
def track_tool_call(tool_name: str, model: str):
with tool_call_latency.labels(
tool_name=tool_name,
model=model
).time():
yield
# 사용 예시
# with 블록 안에 await가 있어도 time()은 정상 동작한다.
# contextmanager는 동기이지만 내부 await는 이벤트 루프를 블로킹하지 않는다.
async def call_web_search(query: str) -> str:
with track_tool_call("web_search", "gpt-4o"):
return await search_api(query)After running the server, check if metrics are exposed in /metrics.
uvicorn main:app --reload
curl http://localhost:8000/metrics | grep llm_tool_call_duration| Code Point | Description |
|---|---|
make_asgi_app() |
Mount /metrics endpoint to FastAPI |
buckets=[..., 3.0, ...] |
Include rollback threshold in bucket boundaries → Ensure interpolation accuracy |
["tool_name", "model"] |
Separate aggregation by tool and model possible via labels |
contextmanager + time() |
with Automatically measure block entry to exit time |
Example 2: Configuring Kubernetes Service and ServiceMonitor
For ServiceMonitor to detect the target, a Service resource that exposes the metrics port is required first.
# Service — metrics 포트 노출
apiVersion: v1
kind: Service
metadata:
name: llm-agent
namespace: production
labels:
app: llm-agent # ServiceMonitor의 matchLabels와 일치해야 함
spec:
selector:
app: llm-agent
ports:
- name: http
port: 8000
targetPort: 8000
- name: metrics # ServiceMonitor가 참조하는 포트 이름
port: 8001
targetPort: 8001# ServiceMonitor — /metrics 엔드포인트 자동 수집
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: llm-agent-monitor
namespace: production
labels:
app: llm-agent
spec:
selector:
matchLabels:
app: llm-agent # Service의 레이블과 일치해야 함 (Deployment 레이블 아님)
endpoints:
- port: metrics # Service의 포트 이름
path: /metrics
interval: 15sNote: ServiceMonitor.spec.selector.matchLabels must exactly match the label of the Service resource. Note that it is not the Deployment label.
Example 3: MetricTemplate — p99 Latency Query (Basic / Extended)
Deploys the MetricTemplate CRD to the monitoring namespace. Flagger executes this query during canary analysis and returns a float64 value.
Basic Form — Measures p99 of the entire canary pod.
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: llm-tool-call-p99-latency
namespace: monitoring
spec:
provider:
type: prometheus
# kubectl get svc -n monitoring 으로 실제 서비스 이름 확인 필요
address: http://prometheus-operated.monitoring:9090
query: |
histogram_quantile(0.99,
sum(
rate(
llm_tool_call_duration_seconds_bucket{
namespace="{{ namespace }}",
pod=~"{{ target }}-[0-9a-zA-Z]+.*"
}[{{ interval }}]
)
) by (le)
)Flagger's canary pod names follow the {deployment-name}-{canary|primary}-{replicaset-hash}-{pod-hash} rule. The pod=~"{{ target }}-[0-9a-zA-Z]+.*" pattern includes both canary pods and primary pods according to this rule, but since {{ target }} points to a canary Deployment in the canary analysis context, only that pod is filtered out. If primary pods were included without this filter, the canary's latency deterioration would be diluted, preventing a rollback from being triggered.
Extended — Measures only the p99 of a specific tool using templateVariables. Isolates and analyzes only specific tools within a canary pod by applying the pod filter and the tool_name filter together.
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: llm-tool-call-p99-latency
namespace: monitoring
spec:
provider:
type: prometheus
address: http://prometheus-operated.monitoring:9090
query: |
histogram_quantile(0.99,
sum(
rate(
llm_tool_call_duration_seconds_bucket{
namespace="{{ namespace }}",
pod=~"{{ target }}-[0-9a-zA-Z]+.*",
tool_name="{{ variables.toolName }}"
}[{{ interval }}]
)
) by (le)
)| Template Variable | Value Example | Description |
|---|---|---|
{{ namespace }} |
production |
Namespace of Canary resource |
{{ target }} |
llm-agent |
Canary의 targetRef.name |
{{ interval }} |
2m |
Analysis Cycle (Injection from Canary spec) |
{{ variables.toolName }} |
web_search |
Injection in Canary's templateVariables |
Example 4: Canary Resource — Declaring Tool-Specific Threshold Analysis Policy
This is a Canary configuration that applies different thresholds per tool using an extensible MetricTemplate.
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: llm-agent
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-agent
analysis:
interval: 1m
threshold: 5 # 5회 연속 실패 시 자동 롤백
maxWeight: 50 # 카나리 최대 50% 트래픽
stepWeight: 10 # 매 interval마다 10%씩 증가
metrics:
- name: web-search-latency
templateRef:
name: llm-tool-call-p99-latency
namespace: monitoring
templateVariables:
toolName: "web_search"
thresholdRange:
max: 5.0 # web_search p99 > 5초 → 분석 실패
interval: 2m # 쿼리의 [{{ interval }}]에 2m이 삽입됨
- name: code-exec-latency
templateRef:
name: llm-tool-call-p99-latency
namespace: monitoring
templateVariables:
toolName: "code_exec"
thresholdRange:
max: 10.0 # code_exec p99 > 10초 → 분석 실패
interval: 2mPros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Automated Safety Net | Detect and Rollback Quality Degradation via Metrics Without Manual Operator Intervention |
| Declarative GitOps Integration | Code analysis policies into a single YAML page, seamlessly integrate with Flux CD |
| Custom metric flexibility | Combine LLM-specific metrics such as tool call success rate and token throughput in addition to HTTP error rate |
| templateVariables Reusability | Apply tool-specific and service-specific thresholds with a single MetricTemplate by simply changing parameters |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Cold Start | Insufficient histogram samples due to lack of initial canary traffic | Set interval to at least 2 minutes |
| Histogram bucket design error | Interpolation error occurs if threshold is not at bucket boundary | Explicitly include rollback threshold in buckets array |
| Prometheus address misconfigured | Service name varies by installation environment | Check actual name with kubectl get svc -n monitoring |
| Insufficient threshold tuning | Too low overreacts to transient spikes, too high delays rollback | Set after analyzing existing production p99 distribution |
| Intrinsic Variability of LLM Latency | Latency itself varies significantly depending on input length and model state | Consider setting the threshold as a relative value at the level of baseline p99 × 2 |
The Most Common Mistakes in Practice
- Bucket does not include threshold: If you set it to
buckets=[0.5, 1.0, 2.0, 5.0]and then applythresholdRange.max: 3.0,histogram_quantileinterpolates the range 2.0 to 5.0 and returns a value lower or higher than the actual value. You must add3.0to the bucket. - Canary Pod Filter Not Applied: If the query scope includes the primary pod without the
pod=~"{{ target }}-[0-9a-zA-Z]+.*"pattern, the latency deterioration of the canary is diluted and a rollback is not triggered. - ServiceMonitor Label Mismatch: When
ServiceMonitor.spec.selector.matchLabelsrefers to theDeploymentlabel instead of theServicelabel, resulting in 0 scrap targets. Be sure to check the target detection status withkubectl get servicemonitor -o yaml.
Troubleshooting
This is the point to check when metrics are not collected or Flagger analysis fails.
| Symptoms | Confirmation Command | Checkpoint |
|---|---|---|
| ServiceMonitor Target Not Detected | kubectl get servicemonitor -n production -o yaml |
matchLabels Matches Service Label |
| No metric in Prometheus | Prometheus UI → Status > Targets |
Whether the status of the target is UP |
| Canary Analysis Failure Log | kubectl describe canary llm-agent -n production |
Events Metric Lookup Error Message for Section |
| histogram_quantile returns NaN | PromQL: llm_tool_call_duration_seconds_count |
Check if bucket sample count is sufficient |
In Conclusion
Protect LLM service canary deployments with latency, not error rate. You can declaratively detect and automatically roll back latency regressions that destroy the user experience while returning HTTP 200 using Flagger MetricTemplate and Prometheus Operator.
3 Steps to Start Right Now:
- Start with instrumentation: Add the
prometheus_clientHistogram to the FastAPI app and include a rollback threshold (e.g.,3.0) inbuckets, then start the server withuvicorn main:app --reloadand check if the metric is exposed tocurl localhost:8000/metrics | grep llm_tool_call. - ServiceMonitor Deployment: After applying the Service and ServiceMonitor YAML from Example 2 above to the cluster, verify that the target in
Status > Targetsof the Prometheus UI is in theUPstate. - Apply MetricTemplate + Canary: Deploy the YAML from Examples 3 and 4 in order, and monitor the analysis logs of the
Eventssection in real time usingkubectl describe canary llm-agent -n production.
Next Part
This covers how to visualize Flagger canary analysis status in real-time using a Grafana dashboard and how to integrate rollback event notifications with Slack using AlertManager.
Reference Materials
- Metrics Analysis | Flagger Official Documentation
- Canary analysis with Prometheus Operator | Flagger
- Canary analysis with Prometheus Operator | Flux
- How it works | Flagger
- vLLM Metrics Official Documentation
- An Introduction to Observability for LLM-based applications using OpenTelemetry
- Canary analysis metrics templating · Issue #418 · fluxcd/flagger
- Prometheus Operator Official Documentation