Kubernetes SLO Automation: Declarative SLO Management with Sloth and Pyrra
Prometheus Operator CRD-Based Approach and Comparison with Grafana SLO
As services grow more complex, it becomes increasingly difficult to clearly answer the question: "How well is our service performing right now?" Alerts fire constantly, yet it's hard to tell what truly matters, and teams find themselves repeatedly reacting only after an incident occurs. In the Platform Engineering trends of 2024–2025, "SLO as Code" has emerged as a key methodology for addressing this problem. By declaring SLOs (Service Level Objectives) as code and managing them through a GitOps workflow, service reliability becomes a shared language across the entire team.
This article focuses on Sloth and Pyrra — open-source tools that enable SLO-as-Code management in Kubernetes environments — exploring what the Prometheus Operator CRD-based approach entails, how it differs from Grafana Cloud's managed SLO service, and walking through real YAML examples. After reading this, you will have a concrete understanding of how to choose the right SLO tool for your team's situation and integrate SLOs into your GitOps workflow. This article targets developers and SREs operating Kubernetes. Prior PromQL experience will help you follow along faster, but each concept is explained so that newcomers can also understand.
Core Concepts
SLO, SLI, and Error Budget — How the Three Concepts Relate
Before adopting SLOs, you need to distinguish three key terms.
SLI (Service Level Indicator): The actual metric measuring service quality. Examples: HTTP request success rate, response latency (p99 latency)
SLO (Service Level Objective): The target for an SLI. Example: "Maintain HTTP success rate of 99.9% or above over a 4-week period"
Error Budget: The allowable failure margin defined by the SLO. If the SLO is 99.9%, then 0.1% is the error budget. Exhausting it becomes the basis for decisions such as halting new deployments.
When these three align, operational decisions like "is it safe to deploy now?" become data-driven rather than gut-feel.
What Is Prometheus Operator + CRD-Based SLO Management?
The modern approach to managing SLOs in Kubernetes is to leverage CRDs (Custom Resource Definitions). Understanding the related components first gives you the full picture.
PrometheusRule: A CRD provided by Prometheus Operator. It allows you to declare alerting rules and recording rules (rules that pre-compute frequently used PromQL expressions and store the results) as Kubernetes resources. Prometheus periodically evaluates the expressions defined in this YAML to fire alerts or generate new metric time series.
Sloth and Pyrra each provide their own CRDs. When these CRDs are deployed to Kubernetes, each operator detects CRD changes via a watch mechanism and automatically creates or updates PrometheusRule resources through a reconcile loop. In other words, the user only needs to define a simple SLO CRD, and the operator automatically generates the complex alerting rule YAML.
Developer → SLO CRD (YAML) → Git Repository
↓ ArgoCD/Flux
K8s Cluster
↓ Sloth/Pyrra Operator (watch → reconcile)
PrometheusRule CR (alerting/recording rules auto-generated)
↓
Prometheus (rule evaluation → alert firing)The key insight of this flow is that the SLO definition itself becomes a Kubernetes resource. You can apply existing development workflows — code review, GitOps deployment, version control — directly to SLO management.
What Is a Multi-Window, Multi-Burn-Rate Alert?
The core challenge of SLO alerting is balancing "fast detection" with "noise minimization." The multi-window, multi-burn-rate alerting methodology proposed in the Google SRE Workbook combines short windows and long windows: when the error budget is burning quickly, it alerts immediately; when it burns slowly, the alert is classified as lower severity.
Burn Rate: The speed at which the error budget is consumed. A burn rate of 1 means the error budget is exhausted exactly over the SLO window (e.g., 4 weeks), while a burn rate of 14 means the 4-week error budget is exhausted in 2 days, requiring immediate action.
When you declare alerting.page (fast burn, critical) and alerting.ticket (slow burn, warning) in Sloth, it internally auto-generates 6 alerting rules based on Google SRE guidelines.
| Alert Type | Long Window | Short Window | Burn Rate |
|---|---|---|---|
| page (critical) | 1h | 5m | 14× |
| page (critical) | 6h | 30m | 6× |
| ticket (warning) | 3d | 6h | 3× |
| ticket (warning) | 3d | 6h | 1× |
(The actual rules generated may vary depending on the Sloth version and configuration.)
Both Sloth and Pyrra automatically generate these complex multi-burn-rate alerting rules, so users never need to write them manually in PromQL.
Sloth vs Pyrra vs Grafana SLO — Positioning of Each Tool
The three tools solve the same problem with different philosophies.
| Sloth | Pyrra | Grafana SLO | |
|---|---|---|---|
| Type | CLI + K8s Operator | K8s Operator | Managed Cloud Service |
| Open Source | ✅ | ✅ | ❌ |
| Built-in UI | ❌ | ✅ | ✅ |
| GitOps Friendliness | High | Medium | Low |
| Thanos/Mimir Support | Limited | ✅ (v0.8.0+) | ✅ |
| OpenSLO Support | ✅ | ❌ | ❌ |
| Cost | Free | Free | $25,000+/year |
Thanos / Grafana Mimir: A layer responsible for long-term metric retention and high-availability querying for Prometheus. Used alongside Prometheus in environments that need to consolidate metrics from multiple clusters or retain months of data.
Why Pyrra is rated "Medium" for GitOps friendliness: Pyrra operates exclusively as an operator and does not offer a CLI mode like Sloth. This makes offline workflows — such as pre-generating or validating rules in a CI pipeline without the operator — impossible. While CRDs themselves can be managed in Git, there is no independent validation mechanism like the Sloth CLI to verify the resulting PrometheusRule output at the PR stage.
Practical Application
Example 1: Defining an HTTP Availability SLO with Sloth
Using Sloth's CRD PrometheusServiceLevel, you can declare complex multi-burn-rate alerting rules in simple YAML.
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: my-service-slo
namespace: monitoring
spec:
service: "my-service"
slos:
- name: "requests-availability"
objective: 99.9
description: "HTTP 요청 성공률 99.9% 유지"
sli:
events:
# 5xx 응답의 초당 발생 비율 합산 (에러 이벤트)
errorQuery: >
sum(rate(http_requests_total{job="my-service",code=~"5.."}[{{.window}}]))
# 전체 요청의 초당 발생 비율 합산
totalQuery: >
sum(rate(http_requests_total{job="my-service"}[{{.window}}]))
alerting:
name: MyServiceHighErrorRate
page:
labels:
severity: critical
ticket:
labels:
severity: warning| Field | Description |
|---|---|
objective: 99.9 |
99.9% availability target over a 4-week period |
sli.events.errorQuery |
PromQL for events counted as errors. {{.window}} is auto-substituted by Sloth when generating rules |
sli.events.totalQuery |
PromQL for total events |
alerting.page |
Critical alert fired on high burn rate (fast consumption) |
alerting.ticket |
Warning alert fired on low burn rate (slow consumption) |
Important: The {{.window}} in the YAML is Go template syntax internal to Sloth. Rather than running kubectl apply on this YAML directly, the Sloth operator watches the CRD and pre-processes it automatically, or the Sloth CLI first converts it to a PrometheusRule YAML before applying.
# CLI mode: pre-generate and validate PrometheusRule YAML (works without the operator)
sloth generate -i my-service-slo.yaml -o output-rules.yamlApplying this single YAML causes the Sloth operator to automatically generate a PrometheusRule containing the 6 alerting rules described earlier.
Example 2: Defining a gRPC Error Rate SLO with Pyrra
Pyrra's CRD ServiceLevelObjective offers an even more concise syntax.
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: grpc-service-availability
namespace: monitoring
labels:
pyrra.dev/team: "platform" # metadata.labels section — used for team-based filtering in Pyrra UI
spec:
target: "99.5" # string type (Pyrra CRD spec defines this as string)
window: 4w
indicator:
ratio:
errors:
metric: grpc_server_handled_total{job="grpc-service",grpc_code!="OK"}
total:
metric: grpc_server_handled_total{job="grpc-service"}| Field | Location | Description |
|---|---|---|
pyrra.dev/team |
metadata.labels |
Label used for team-based filtering in the Pyrra UI. Located under metadata, not spec |
target: "99.5" |
spec |
99.5% availability target over 4 weeks. Quotes are required because the Pyrra CRD spec defines this field as a string type |
window: 4w |
spec |
SLO evaluation window (4 weeks) |
indicator.ratio |
spec |
Ratio-based SLI definition |
errors.metric |
spec.indicator.ratio |
Metric selector for events counted as errors |
total.metric |
spec.indicator.ratio |
Metric selector for total requests |
Pyrra not only generates PrometheusRule from this CRD, but also visualizes the error budget burn rate and remaining error budget in real time through its built-in Web UI. While the ability to immediately understand SLO status without Grafana is a major differentiator for Pyrra, its true value in production lies in query optimization for Thanos environments. With built-in subquery pre-aggregation for high-cardinality metric environments, query performance improves significantly in multi-cluster setups using Thanos.
Example 3: GitOps Workflow Integration Pattern
In real production environments, the standard pattern is to store SLO CRDs in a dedicated Git repository (or a subdirectory of an infrastructure repository) and deploy via ArgoCD or Flux.
infra-repo/
├── slos/
│ ├── my-service-availability.yaml # Sloth CRD
│ ├── grpc-service-slo.yaml # Pyrra CRD
│ └── kustomization.yaml # Used when deploying with kubectl apply -k
└── argocd/
└── slo-app.yaml # ArgoCD Application# slos/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- my-service-availability.yaml
- grpc-service-slo.yaml# argocd/slo-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: slos
namespace: argocd
spec:
source:
repoURL: https://github.com/my-org/infra-repo
targetRevision: main
path: slos
destination:
server: https://kubernetes.default.svc
namespace: monitoring
syncPolicy:
automated:
prune: true
selfHeal: trueIn this pattern, every SLO change must go through the sequence: PR → code review → merge → automated deployment. "Who changed which SLO target, and when" is fully traceable through Git history.
Pros and Cons Analysis
Advantages
| Item | Sloth | Pyrra | Grafana SLO |
|---|---|---|---|
| GitOps Integration | Supports offline CI pipeline validation via CLI | Declarative CRD-based management (no CLI) | Terraform IaC support |
| Alert Quality | Auto-generates multi-burn-rate rules based on Google SRE | Same level of alert auto-generation | Pre-configured alerts provided |
| Scalability | Reuse SLI logic via plugin system | Built-in high-cardinality optimization for Thanos/Mimir | Full integration with Grafana Cloud ecosystem |
| Accessibility | Requires PromQL knowledge | Requires PromQL knowledge | Configurable via UI alone |
| Standard Support | Accepts OpenSLO spec as direct input | — | — |
OpenSLO: A vendor-neutral SLO declaration spec in the CNCF ecosystem. Sloth can accept this spec directly as input to generate Prometheus rules, enabling SLO definitions that are not locked to a specific tool.
Drawbacks and Caveats
Cardinality: The number of unique label combinations in metric time series. Higher cardinality increases Prometheus memory usage and query cost. Pyrra has built-in subquery pre-aggregation optimization for high-cardinality environments, making it especially advantageous in Thanos setups.
| Item | Details | Mitigation |
|---|---|---|
| No built-in UI for Sloth | Sloth provides no visualization tooling | Import official Grafana dashboard ID 14348 or build your own |
| Limited Pyrra-Grafana integration | Visualizing Pyrra-generated rules in Grafana requires the -generic-rules flag; grouping is not supported |
Use Pyrra's built-in UI and Grafana side by side for different purposes |
| Grafana SLO vendor lock-in | Grafana Cloud only; self-hosting not available | Factor in migration costs before initial adoption if you may switch to open source later |
| High cost of Grafana SLO | Enterprise pricing of $25,000+/year | Evaluate team size and ROI in advance |
| Sloth visualization effort | Higher initial Grafana dashboard setup effort compared to Pyrra | Recommended to start by importing the official dashboard template |
Most Common Mistakes in Practice
- Setting up SLOs without monitoring the error budget burn rate — Even with SLOs configured, if you don't regularly review how quickly the error budget is being consumed, they remain purely ceremonial metrics. Consider making a weekly error budget review part of your team routine.
- Applying SLOs to too many services at once — Rolling out SLOs across all services from the start increases alert fatigue. It's recommended to begin with 1–2 of your most critical services and expand gradually.
- Setting SLO targets arbitrarily high — Overly ambitious targets like 99.99% make the error budget so small that even routine deployments become difficult. Measure your actual current service level first, then set a realistic target.
Closing Thoughts
Sloth and Pyrra are powerful open-source tools that abstract complex SLO alerting rules into simple CRD declarations, enabling seamless integration of SLOs into GitOps workflows in Kubernetes environments.
When choosing a tool, the following checklist — based on your team's current stack and needs — should help:
| Checklist Item | Sloth | Pyrra | Grafana SLO |
|---|---|---|---|
| Already actively using Grafana | ✅ | — | — |
| Need immediate error budget visualization without Grafana | — | ✅ | ✅ |
| Thanos/Mimir multi-cluster environment | — | ✅ | ✅ |
| Need offline SLO rule validation in CI pipeline | ✅ | — | — |
| Want to configure via UI without PromQL | — | — | ✅ |
| Open source, self-hosted required | ✅ | ✅ | — |
| Fast adoption, already using Grafana Cloud | — | — | ✅ |
Three steps you can start right now:
-
Deploy your first SLO with Pyrra — Install Pyrra with the command below. If the
monitoringnamespace doesn't exist, create it first withkubectl create namespace monitoring, and make sure Prometheus Operator (or kube-prometheus-stack) is already installed.bashhelm repo add pyrra https://pyrra-dev.github.io/pyrra helm install pyrra pyrra/pyrra -n monitoringAfter installation, apply the
ServiceLevelObjectiveYAML from the example above to one of your most important services and verify that the error budget is visualized in Pyrra's built-in UI. -
Create a
slos/directory in your GitOps repository — Add aslos/directory to your existing infrastructure repository and apply a PR-based workflow for managing SLO CRD YAMLs. Adding just the single ArgoCD Application YAML from the example above is all it takes to complete the GitOps integration. -
Introduce a weekly error budget review routine — After adopting an SLO tool, try building a habit of spending 5 minutes in your team's weekly meeting reviewing the error budget burn rate. This small habit prevents SLOs from becoming purely ceremonial metrics and serves as the starting point for building a data-driven deployment decision culture.
Next article: Error budget policy automation — how to configure a GitOps pipeline that automatically blocks deployment gates when SLOs are violated
References
- Sloth Official Documentation — Kubernetes CRD Spec
- Sloth GitHub Repository (slok/sloth)
- Pyrra Official Site
- Pyrra GitHub Repository (pyrra-dev/pyrra)
- Service Level Objectives made easy with Sloth and Pyrra — 0xDC.me
- Service Level Objectives made easy with Sloth and Pyrra — Medium (David Calvert)
- SLO Reporting Frameworks: Pyrra vs. SloK — TECHVZERO
- Monitor SLOs with the Grafana LGTM Stack: Daimler Truck Case Study — Grafana Labs
- Grafana SLO Official Documentation
- Introduction to SLOTH Prometheus SLI Generator — Medium (Oct 2024)
- The SLO Toolkit: Setup & Alerting with Pyrra — tb.lx insider
- Pyrra on Wikimedia — Wikitech
- How we built a complex SLO app tightly integrated with Grafana — GrafanaCON 2024
- How to Define and Configure SLOs for Kubernetes Services — OneUptime Blog