Configuring LLM p99 Latency-Based Canary Auto-Rollback with Flagger MetricTemplate

Immediately after upgrading the model version, production p99 latency spiked from 400ms to 3.2 seconds. Users were experiencing timeout errors, but the monitoring dashboard remained silent as the HTTP status code still returned 200. Even when p99 latency quietly exceeds 3 seconds, standard error rate-based rollbacks do not work. This is why LLM services require latency-dedicated canary analysis.

In this article, we will complete the configuration code for a pipeline that uses Flagger's MetricTemplate CRD to connect LLM tool call latency (p99) as a rollback trigger. You can complete the MetricTemplate pipeline by following the steps in order, starting from the Python instrumentation code to ServiceMonitor, MetricTemplate, and Canary YAML.

Things to know before reading this article: Kubernetes Pod/Deployment/Service basics, Prometheus metric collection concepts (scrape, label), Python FastAPI basics. Refer to the Official Installation Guide for Flagger installation and the kube-prometheus-stack Helm chart for Prometheus Operator installation.

Key Concepts

If you understand these three things, you can immediately follow the example below.

Progressive Delivery와 Flagger

Progressive Delivery: A deployment strategy that exposes a new version to a small number of users first to verify stability, followed by a gradual rollout. Unlike Blue/Green delivery, it finely controls traffic in percentage units, and automatic promotion or rollback occurs based on metric analysis results.

Flagger is a CNCF Graduated project and a Progressive Delivery operator that automates canary deployments, A/B testing, and blue/green deployments on Kubernetes. The core operating principle is as follows.

Detect new version deployment → Create canary pod
Gradually move traffic to the canary in increments of the configured stepWeight.
Run metric analysis every interval
Passing threshold → Promotion / Consecutive failures → Rollback

MetricTemplate: A Flagger CRD that is an analysis unit that executes PromQL/queries on Prometheus, Datadog, CloudWatch, etc., returns a float64 value, and compares it with a threshold. Queries can be dynamically constructed using template variables such as {{ namespace }}, {{ target }}, {{ interval }}, and {{ variables.xxx }}.

Prometheus Operator와 ServiceMonitor

The Prometheus Operator declaratively manages Prometheus instances on Kubernetes. Without the need to directly edit prometheus.yml, deploying the ServiceMonitor CRD automatically detects scrap targets based on label selectors.

# Prometheus Operator가 관리하는 서비스 디스커버리 흐름
ServiceMonitor (레이블 셀렉터)
  → Prometheus Operator가 감지
    → prometheus.yml scrape_configs 자동 생성
      → Prometheus가 /metrics 엔드포인트 수집

Note: prometheus-operated, which is specified in provider.address of MetricTemplate, is the default service name for kube-prometheus-stack Helm charts. Since the name may vary depending on the installation method, you must verify the actual service name using kubectl get svc -n monitoring.

Why is LLM tool call latency used as the rollback standard?

In the LLM agent, turns involving function calls involve a chain of LLM inference + function execution + re-inference. Because of this, latency is at least twice that of a normal conversation turn, and if the function call schema changes in a new version or regression occurs in the parsing logic, p99 spikes. The HTTP error rate is normal, but the user experience is already ruined.

Histogram Quantile: Prometheus's histogram_quantile(φ, ...) function estimates the φ quantiles by linearly interpolating bucket data. p99 (φ=0.99) represents the latency upper bound for the top 1% of requests. For accurate interpolation, the rollback threshold (e.g., 3 seconds) must be included in the histogram bucket boundaries.

The rate() function converts the counter metric into a growth rate per second. In the sum(rate(histogram_bucket[2m])) by (le) pattern, [2m] is a PromQL range selector that calculates the growth rate over the last 2 minutes. The interval: 2m setting in Flagger indicates the analysis period, and this value is inserted directly into the [{{ interval }}] of the query.

Practical Application

Example 1: Measuring Tool Call Latency on a Python FastAPI Server

Add a histogram to the FastAPI app with prometheus_client and expose the /metrics endpoint. Bucket design is key — you must include a rollback threshold (e.g., 3 seconds) in the bucket boundaries to minimize the interpolation error of histogram_quantile.

python

from prometheus_client import Histogram, make_asgi_app
from contextlib import contextmanager
from fastapi import FastAPI
 
app = FastAPI()
 
# /metrics 엔드포인트 마운트
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
 
tool_call_latency = Histogram(
    "llm_tool_call_duration_seconds",
    "LLM tool call duration in seconds",
    ["tool_name", "model"],
    # 임계치인 3.0을 버킷에 명시적으로 포함
    buckets=[0.1, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0]
)
 
@contextmanager
def track_tool_call(tool_name: str, model: str):
    with tool_call_latency.labels(
        tool_name=tool_name,
        model=model
    ).time():
        yield
 
# 사용 예시
# with 블록 안에 await가 있어도 time()은 정상 동작한다.
# contextmanager는 동기이지만 내부 await는 이벤트 루프를 블로킹하지 않는다.
async def call_web_search(query: str) -> str:
    with track_tool_call("web_search", "gpt-4o"):
        return await search_api(query)

After running the server, check if metrics are exposed in /metrics.

uvicorn main:app --reload
curl http://localhost:8000/metrics | grep llm_tool_call_duration

Code Point	Description
`make_asgi_app()`	Mount `/metrics` endpoint to FastAPI
`buckets=[..., 3.0, ...]`	Include rollback threshold in bucket boundaries → Ensure interpolation accuracy
`["tool_name", "model"]`	Separate aggregation by tool and model possible via labels
`contextmanager + time()`	with Automatically measure block entry to exit time

Example 2: Configuring Kubernetes Service and ServiceMonitor

For ServiceMonitor to detect the target, a Service resource that exposes the metrics port is required first.

yaml

# Service — metrics 포트 노출
apiVersion: v1
kind: Service
metadata:
  name: llm-agent
  namespace: production
  labels:
    app: llm-agent          # ServiceMonitor의 matchLabels와 일치해야 함
spec:
  selector:
    app: llm-agent
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: metrics           # ServiceMonitor가 참조하는 포트 이름
    port: 8001
    targetPort: 8001

yaml

# ServiceMonitor — /metrics 엔드포인트 자동 수집
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llm-agent-monitor
  namespace: production
  labels:
    app: llm-agent
spec:
  selector:
    matchLabels:
      app: llm-agent        # Service의 레이블과 일치해야 함 (Deployment 레이블 아님)
  endpoints:
  - port: metrics           # Service의 포트 이름
    path: /metrics
    interval: 15s

Note: ServiceMonitor.spec.selector.matchLabels must exactly match the label of the Service resource. Note that it is not the Deployment label.

Example 3: MetricTemplate — p99 Latency Query (Basic / Extended)

Deploys the MetricTemplate CRD to the monitoring namespace. Flagger executes this query during canary analysis and returns a float64 value.

Basic Form — Measures p99 of the entire canary pod.

yaml

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: llm-tool-call-p99-latency
  namespace: monitoring
spec:
  provider:
    type: prometheus
    # kubectl get svc -n monitoring 으로 실제 서비스 이름 확인 필요
    address: http://prometheus-operated.monitoring:9090
  query: |
    histogram_quantile(0.99,
      sum(
        rate(
          llm_tool_call_duration_seconds_bucket{
            namespace="{{ namespace }}",
            pod=~"{{ target }}-[0-9a-zA-Z]+.*"
          }[{{ interval }}]
        )
      ) by (le)
    )

Flagger's canary pod names follow the {deployment-name}-{canary|primary}-{replicaset-hash}-{pod-hash} rule. The pod=~"{{ target }}-[0-9a-zA-Z]+.*" pattern includes both canary pods and primary pods according to this rule, but since {{ target }} points to a canary Deployment in the canary analysis context, only that pod is filtered out. If primary pods were included without this filter, the canary's latency deterioration would be diluted, preventing a rollback from being triggered.

Extended — Measures only the p99 of a specific tool using templateVariables. Isolates and analyzes only specific tools within a canary pod by applying the pod filter and the tool_name filter together.

yaml

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: llm-tool-call-p99-latency
  namespace: monitoring
spec:
  provider:
    type: prometheus
    address: http://prometheus-operated.monitoring:9090
  query: |
    histogram_quantile(0.99,
      sum(
        rate(
          llm_tool_call_duration_seconds_bucket{
            namespace="{{ namespace }}",
            pod=~"{{ target }}-[0-9a-zA-Z]+.*",
            tool_name="{{ variables.toolName }}"
          }[{{ interval }}]
        )
      ) by (le)
    )

Template Variable	Value Example	Description
`{{ namespace }}`	`production`	Namespace of Canary resource
`{{ target }}`	`llm-agent`	Canary의 `targetRef.name`
`{{ interval }}`	`2m`	Analysis Cycle (Injection from Canary spec)
`{{ variables.toolName }}`	`web_search`	Injection in Canary's `templateVariables`

Example 4: Canary Resource — Declaring Tool-Specific Threshold Analysis Policy

This is a Canary configuration that applies different thresholds per tool using an extensible MetricTemplate.

yaml

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: llm-agent
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-agent
  analysis:
    interval: 1m
    threshold: 5        # 5회 연속 실패 시 자동 롤백
    maxWeight: 50       # 카나리 최대 50% 트래픽
    stepWeight: 10      # 매 interval마다 10%씩 증가
    metrics:
    - name: web-search-latency
      templateRef:
        name: llm-tool-call-p99-latency
        namespace: monitoring
      templateVariables:
        toolName: "web_search"
      thresholdRange:
        max: 5.0        # web_search p99 > 5초 → 분석 실패
      interval: 2m      # 쿼리의 [{{ interval }}]에 2m이 삽입됨
 
    - name: code-exec-latency
      templateRef:
        name: llm-tool-call-p99-latency
        namespace: monitoring
      templateVariables:
        toolName: "code_exec"
      thresholdRange:
        max: 10.0       # code_exec p99 > 10초 → 분석 실패
      interval: 2m

Pros and Cons Analysis

Advantages

Item	Content
Automated Safety Net	Detect and Rollback Quality Degradation via Metrics Without Manual Operator Intervention
Declarative GitOps Integration	Code analysis policies into a single YAML page, seamlessly integrate with Flux CD
Custom metric flexibility	Combine LLM-specific metrics such as tool call success rate and token throughput in addition to HTTP error rate
templateVariables Reusability	Apply tool-specific and service-specific thresholds with a single MetricTemplate by simply changing parameters

Disadvantages and Precautions

Item	Content	Response Plan
Cold Start	Insufficient histogram samples due to lack of initial canary traffic	Set `interval` to at least 2 minutes
Histogram bucket design error	Interpolation error occurs if threshold is not at bucket boundary	Explicitly include rollback threshold in `buckets` array
Prometheus address misconfigured	Service name varies by installation environment	Check actual name with `kubectl get svc -n monitoring`
Insufficient threshold tuning	Too low overreacts to transient spikes, too high delays rollback	Set after analyzing existing production p99 distribution
Intrinsic Variability of LLM Latency	Latency itself varies significantly depending on input length and model state	Consider setting the threshold as a relative value at the level of baseline p99 × 2

The Most Common Mistakes in Practice

Bucket does not include threshold: If you set it to buckets=[0.5, 1.0, 2.0, 5.0] and then apply thresholdRange.max: 3.0, histogram_quantile interpolates the range 2.0 to 5.0 and returns a value lower or higher than the actual value. You must add 3.0 to the bucket.
Canary Pod Filter Not Applied: If the query scope includes the primary pod without the pod=~"{{ target }}-[0-9a-zA-Z]+.*" pattern, the latency deterioration of the canary is diluted and a rollback is not triggered.
ServiceMonitor Label Mismatch: When ServiceMonitor.spec.selector.matchLabels refers to the Deployment label instead of the Service label, resulting in 0 scrap targets. Be sure to check the target detection status with kubectl get servicemonitor -o yaml.

Troubleshooting

This is the point to check when metrics are not collected or Flagger analysis fails.

Symptoms	Confirmation Command	Checkpoint
ServiceMonitor Target Not Detected	`kubectl get servicemonitor -n production -o yaml`	`matchLabels` Matches Service Label
No metric in Prometheus	Prometheus UI → `Status > Targets`	Whether the status of the target is `UP`
Canary Analysis Failure Log	`kubectl describe canary llm-agent -n production`	`Events` Metric Lookup Error Message for Section
histogram_quantile returns NaN	PromQL: `llm_tool_call_duration_seconds_count`	Check if bucket sample count is sufficient

In Conclusion

Protect LLM service canary deployments with latency, not error rate. You can declaratively detect and automatically roll back latency regressions that destroy the user experience while returning HTTP 200 using Flagger MetricTemplate and Prometheus Operator.

3 Steps to Start Right Now:

Start with instrumentation: Add the prometheus_client Histogram to the FastAPI app and include a rollback threshold (e.g., 3.0) in buckets, then start the server with uvicorn main:app --reload and check if the metric is exposed to curl localhost:8000/metrics | grep llm_tool_call.
ServiceMonitor Deployment: After applying the Service and ServiceMonitor YAML from Example 2 above to the cluster, verify that the target in Status > Targets of the Prometheus UI is in the UP state.
Apply MetricTemplate + Canary: Deploy the YAML from Examples 3 and 4 in order, and monitor the analysis logs of the Events section in real time using kubectl describe canary llm-agent -n production.

Next Part

This covers how to visualize Flagger canary analysis status in real-time using a Grafana dashboard and how to integrate rollback event notifications with Slack using AlertManager.

Reference Materials

Configuring LLM p99 Latency-Based Canary Auto-Rollback with Flagger MetricTemplate | DEV BAK - 기술블로그

Configuring LLM p99 Latency-Based Canary Auto-Rollback with Flagger MetricTemplate

Key Concepts

If you understand these three things, you can immediately follow the example below.

Progressive Delivery와 Flagger

Detect new version deployment → Create canary pod
Gradually move traffic to the canary in increments of the configured stepWeight.
Run metric analysis every interval
Passing threshold → Promotion / Consecutive failures → Rollback

Prometheus Operator와 ServiceMonitor

# Prometheus Operator가 관리하는 서비스 디스커버리 흐름
ServiceMonitor (레이블 셀렉터)
  → Prometheus Operator가 감지
    → prometheus.yml scrape_configs 자동 생성
      → Prometheus가 /metrics 엔드포인트 수집

Why is LLM tool call latency used as the rollback standard?

Practical Application

Example 1: Measuring Tool Call Latency on a Python FastAPI Server

python

from prometheus_client import Histogram, make_asgi_app
from contextlib import contextmanager
from fastapi import FastAPI
 
app = FastAPI()
 
# /metrics 엔드포인트 마운트
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
 
tool_call_latency = Histogram(
    "llm_tool_call_duration_seconds",
    "LLM tool call duration in seconds",
    ["tool_name", "model"],
    # 임계치인 3.0을 버킷에 명시적으로 포함
    buckets=[0.1, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0]
)
 
@contextmanager
def track_tool_call(tool_name: str, model: str):
    with tool_call_latency.labels(
        tool_name=tool_name,
        model=model
    ).time():
        yield
 
# 사용 예시
# with 블록 안에 await가 있어도 time()은 정상 동작한다.
# contextmanager는 동기이지만 내부 await는 이벤트 루프를 블로킹하지 않는다.
async def call_web_search(query: str) -> str:
    with track_tool_call("web_search", "gpt-4o"):
        return await search_api(query)

After running the server, check if metrics are exposed in /metrics.

uvicorn main:app --reload
curl http://localhost:8000/metrics | grep llm_tool_call_duration

Code Point	Description
`make_asgi_app()`	Mount `/metrics` endpoint to FastAPI
`buckets=[..., 3.0, ...]`	Include rollback threshold in bucket boundaries → Ensure interpolation accuracy
`["tool_name", "model"]`	Separate aggregation by tool and model possible via labels
`contextmanager + time()`	with Automatically measure block entry to exit time

Example 2: Configuring Kubernetes Service and ServiceMonitor

For ServiceMonitor to detect the target, a Service resource that exposes the metrics port is required first.

yaml

# Service — metrics 포트 노출
apiVersion: v1
kind: Service
metadata:
  name: llm-agent
  namespace: production
  labels:
    app: llm-agent          # ServiceMonitor의 matchLabels와 일치해야 함
spec:
  selector:
    app: llm-agent
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: metrics           # ServiceMonitor가 참조하는 포트 이름
    port: 8001
    targetPort: 8001

yaml

# ServiceMonitor — /metrics 엔드포인트 자동 수집
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llm-agent-monitor
  namespace: production
  labels:
    app: llm-agent
spec:
  selector:
    matchLabels:
      app: llm-agent        # Service의 레이블과 일치해야 함 (Deployment 레이블 아님)
  endpoints:
  - port: metrics           # Service의 포트 이름
    path: /metrics
    interval: 15s

Note: ServiceMonitor.spec.selector.matchLabels must exactly match the label of the Service resource. Note that it is not the Deployment label.

Example 3: MetricTemplate — p99 Latency Query (Basic / Extended)

Deploys the MetricTemplate CRD to the monitoring namespace. Flagger executes this query during canary analysis and returns a float64 value.

Basic Form — Measures p99 of the entire canary pod.

yaml

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: llm-tool-call-p99-latency
  namespace: monitoring
spec:
  provider:
    type: prometheus
    # kubectl get svc -n monitoring 으로 실제 서비스 이름 확인 필요
    address: http://prometheus-operated.monitoring:9090
  query: |
    histogram_quantile(0.99,
      sum(
        rate(
          llm_tool_call_duration_seconds_bucket{
            namespace="{{ namespace }}",
            pod=~"{{ target }}-[0-9a-zA-Z]+.*"
          }[{{ interval }}]
        )
      ) by (le)
    )

yaml

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: llm-tool-call-p99-latency
  namespace: monitoring
spec:
  provider:
    type: prometheus
    address: http://prometheus-operated.monitoring:9090
  query: |
    histogram_quantile(0.99,
      sum(
        rate(
          llm_tool_call_duration_seconds_bucket{
            namespace="{{ namespace }}",
            pod=~"{{ target }}-[0-9a-zA-Z]+.*",
            tool_name="{{ variables.toolName }}"
          }[{{ interval }}]
        )
      ) by (le)
    )

Template Variable	Value Example	Description
`{{ namespace }}`	`production`	Namespace of Canary resource
`{{ target }}`	`llm-agent`	Canary의 `targetRef.name`
`{{ interval }}`	`2m`	Analysis Cycle (Injection from Canary spec)
`{{ variables.toolName }}`	`web_search`	Injection in Canary's `templateVariables`

Example 4: Canary Resource — Declaring Tool-Specific Threshold Analysis Policy

This is a Canary configuration that applies different thresholds per tool using an extensible MetricTemplate.

yaml

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: llm-agent
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-agent
  analysis:
    interval: 1m
    threshold: 5        # 5회 연속 실패 시 자동 롤백
    maxWeight: 50       # 카나리 최대 50% 트래픽
    stepWeight: 10      # 매 interval마다 10%씩 증가
    metrics:
    - name: web-search-latency
      templateRef:
        name: llm-tool-call-p99-latency
        namespace: monitoring
      templateVariables:
        toolName: "web_search"
      thresholdRange:
        max: 5.0        # web_search p99 > 5초 → 분석 실패
      interval: 2m      # 쿼리의 [{{ interval }}]에 2m이 삽입됨
 
    - name: code-exec-latency
      templateRef:
        name: llm-tool-call-p99-latency
        namespace: monitoring
      templateVariables:
        toolName: "code_exec"
      thresholdRange:
        max: 10.0       # code_exec p99 > 10초 → 분석 실패
      interval: 2m

Pros and Cons Analysis

Advantages

Item	Content
Automated Safety Net	Detect and Rollback Quality Degradation via Metrics Without Manual Operator Intervention
Declarative GitOps Integration	Code analysis policies into a single YAML page, seamlessly integrate with Flux CD
Custom metric flexibility	Combine LLM-specific metrics such as tool call success rate and token throughput in addition to HTTP error rate
templateVariables Reusability	Apply tool-specific and service-specific thresholds with a single MetricTemplate by simply changing parameters

Disadvantages and Precautions

Item	Content	Response Plan
Cold Start	Insufficient histogram samples due to lack of initial canary traffic	Set `interval` to at least 2 minutes
Histogram bucket design error	Interpolation error occurs if threshold is not at bucket boundary	Explicitly include rollback threshold in `buckets` array
Prometheus address misconfigured	Service name varies by installation environment	Check actual name with `kubectl get svc -n monitoring`
Insufficient threshold tuning	Too low overreacts to transient spikes, too high delays rollback	Set after analyzing existing production p99 distribution
Intrinsic Variability of LLM Latency	Latency itself varies significantly depending on input length and model state	Consider setting the threshold as a relative value at the level of baseline p99 × 2

The Most Common Mistakes in Practice

Bucket does not include threshold: If you set it to buckets=[0.5, 1.0, 2.0, 5.0] and then apply thresholdRange.max: 3.0, histogram_quantile interpolates the range 2.0 to 5.0 and returns a value lower or higher than the actual value. You must add 3.0 to the bucket.
Canary Pod Filter Not Applied: If the query scope includes the primary pod without the pod=~"{{ target }}-[0-9a-zA-Z]+.*" pattern, the latency deterioration of the canary is diluted and a rollback is not triggered.
ServiceMonitor Label Mismatch: When ServiceMonitor.spec.selector.matchLabels refers to the Deployment label instead of the Service label, resulting in 0 scrap targets. Be sure to check the target detection status with kubectl get servicemonitor -o yaml.

Troubleshooting

This is the point to check when metrics are not collected or Flagger analysis fails.

Symptoms	Confirmation Command	Checkpoint
ServiceMonitor Target Not Detected	`kubectl get servicemonitor -n production -o yaml`	`matchLabels` Matches Service Label
No metric in Prometheus	Prometheus UI → `Status > Targets`	Whether the status of the target is `UP`
Canary Analysis Failure Log	`kubectl describe canary llm-agent -n production`	`Events` Metric Lookup Error Message for Section
histogram_quantile returns NaN	PromQL: `llm_tool_call_duration_seconds_count`	Check if bucket sample count is sufficient

In Conclusion

3 Steps to Start Right Now:

Start with instrumentation: Add the prometheus_client Histogram to the FastAPI app and include a rollback threshold (e.g., 3.0) in buckets, then start the server with uvicorn main:app --reload and check if the metric is exposed to curl localhost:8000/metrics | grep llm_tool_call.
ServiceMonitor Deployment: After applying the Service and ServiceMonitor YAML from Example 2 above to the cluster, verify that the target in Status > Targets of the Prometheus UI is in the UP state.
Apply MetricTemplate + Canary: Deploy the YAML from Examples 3 and 4 in order, and monitor the analysis logs of the Events section in real time using kubectl describe canary llm-agent -n production.

Next Part

This covers how to visualize Flagger canary analysis status in real-time using a Grafana dashboard and how to integrate rollback event notifications with Slack using AlertManager.

Key Concepts

Progressive Delivery와 Flagger

Prometheus Operator와 ServiceMonitor

Why is LLM tool call latency used as the rollback standard?

Practical Application

Example 1: Measuring Tool Call Latency on a Python FastAPI Server

Example 2: Configuring Kubernetes Service and ServiceMonitor

Example 3: MetricTemplate — p99 Latency Query (Basic / Extended)

Example 4: Canary Resource — Declaring Tool-Specific Threshold Analysis Policy

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Next Part

Reference Materials

Key Concepts

Progressive Delivery와 Flagger

Prometheus Operator와 ServiceMonitor

Why is LLM tool call latency used as the rollback standard?

Practical Application

Example 1: Measuring Tool Call Latency on a Python FastAPI Server

Example 2: Configuring Kubernetes Service and ServiceMonitor

Example 3: MetricTemplate — p99 Latency Query (Basic / Extended)

Example 4: Canary Resource — Declaring Tool-Specific Threshold Analysis Policy

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Next Part

Reference Materials

Recommended Posts

Using Flagger MetricTemplate CRD for automating Datadog and New Relic canary deployments

Flagger + Istio A/B Routing: Integrating New Relic NRQL with Conversion Rate as Distribution Gating Criteria

Implementing Canary Deployment Gating Without Unnecessary Rollbacks with Flagger Webhook — The Complete Guide to Mann-Whitney Statistical Validation Services

Simplifying Canary Deployment with a Single Flagger CRD: From KEDA ScaledObject Separation Issues to Argo CD ApplicationSet Multicluster MCP Server Automation

Implementing AI Agent (MCP) Server Canary Deployment and Automatic Rollback with KEDA + Argo Rollouts

Stabilizing MCP Servers with HPA Custom Metrics + Grafana Dashboards: Practical Operation of AI Agent Servers on Kubernetes