Kubernetes SLO Automation: Declarative SLO Management with Sloth and Pyrra

Prometheus Operator CRD-Based Approach and Comparison with Grafana SLO

As services grow more complex, it becomes increasingly difficult to clearly answer the question: "How well is our service performing right now?" Alerts fire constantly, yet it's hard to tell what truly matters, and teams find themselves repeatedly reacting only after an incident occurs. In the Platform Engineering trends of 2024–2025, "SLO as Code" has emerged as a key methodology for addressing this problem. By declaring SLOs (Service Level Objectives) as code and managing them through a GitOps workflow, service reliability becomes a shared language across the entire team.

This article focuses on Sloth and Pyrra — open-source tools that enable SLO-as-Code management in Kubernetes environments — exploring what the Prometheus Operator CRD-based approach entails, how it differs from Grafana Cloud's managed SLO service, and walking through real YAML examples. After reading this, you will have a concrete understanding of how to choose the right SLO tool for your team's situation and integrate SLOs into your GitOps workflow. This article targets developers and SREs operating Kubernetes. Prior PromQL experience will help you follow along faster, but each concept is explained so that newcomers can also understand.

Core Concepts

SLO, SLI, and Error Budget — How the Three Concepts Relate

Before adopting SLOs, you need to distinguish three key terms.

SLI (Service Level Indicator): The actual metric measuring service quality. Examples: HTTP request success rate, response latency (p99 latency)

SLO (Service Level Objective): The target for an SLI. Example: "Maintain HTTP success rate of 99.9% or above over a 4-week period"

Error Budget: The allowable failure margin defined by the SLO. If the SLO is 99.9%, then 0.1% is the error budget. Exhausting it becomes the basis for decisions such as halting new deployments.

When these three align, operational decisions like "is it safe to deploy now?" become data-driven rather than gut-feel.

What Is Prometheus Operator + CRD-Based SLO Management?

The modern approach to managing SLOs in Kubernetes is to leverage CRDs (Custom Resource Definitions). Understanding the related components first gives you the full picture.

PrometheusRule: A CRD provided by Prometheus Operator. It allows you to declare alerting rules and recording rules (rules that pre-compute frequently used PromQL expressions and store the results) as Kubernetes resources. Prometheus periodically evaluates the expressions defined in this YAML to fire alerts or generate new metric time series.

Sloth and Pyrra each provide their own CRDs. When these CRDs are deployed to Kubernetes, each operator detects CRD changes via a watch mechanism and automatically creates or updates PrometheusRule resources through a reconcile loop. In other words, the user only needs to define a simple SLO CRD, and the operator automatically generates the complex alerting rule YAML.

Developer → SLO CRD (YAML) → Git Repository
                                ↓ ArgoCD/Flux
                          K8s Cluster
                                ↓ Sloth/Pyrra Operator (watch → reconcile)
                          PrometheusRule CR (alerting/recording rules auto-generated)
                                ↓
                          Prometheus (rule evaluation → alert firing)

The key insight of this flow is that the SLO definition itself becomes a Kubernetes resource. You can apply existing development workflows — code review, GitOps deployment, version control — directly to SLO management.

What Is a Multi-Window, Multi-Burn-Rate Alert?

The core challenge of SLO alerting is balancing "fast detection" with "noise minimization." The multi-window, multi-burn-rate alerting methodology proposed in the Google SRE Workbook combines short windows and long windows: when the error budget is burning quickly, it alerts immediately; when it burns slowly, the alert is classified as lower severity.

Burn Rate: The speed at which the error budget is consumed. A burn rate of 1 means the error budget is exhausted exactly over the SLO window (e.g., 4 weeks), while a burn rate of 14 means the 4-week error budget is exhausted in 2 days, requiring immediate action.

When you declare alerting.page (fast burn, critical) and alerting.ticket (slow burn, warning) in Sloth, it internally auto-generates 6 alerting rules based on Google SRE guidelines.

Alert Type	Long Window	Short Window	Burn Rate
page (critical)	1h	5m	14×
page (critical)	6h	30m	6×
ticket (warning)	3d	6h	3×
ticket (warning)	3d	6h	1×

(The actual rules generated may vary depending on the Sloth version and configuration.)

Both Sloth and Pyrra automatically generate these complex multi-burn-rate alerting rules, so users never need to write them manually in PromQL.

Sloth vs Pyrra vs Grafana SLO — Positioning of Each Tool

The three tools solve the same problem with different philosophies.

	Sloth	Pyrra	Grafana SLO
Type	CLI + K8s Operator	K8s Operator	Managed Cloud Service
Open Source	✅	✅	❌
Built-in UI	❌	✅	✅
GitOps Friendliness	High	Medium	Low
Thanos/Mimir Support	Limited	✅ (v0.8.0+)	✅
OpenSLO Support	✅	❌	❌
Cost	Free	Free	$25,000+/year

Thanos / Grafana Mimir: A layer responsible for long-term metric retention and high-availability querying for Prometheus. Used alongside Prometheus in environments that need to consolidate metrics from multiple clusters or retain months of data.

Why Pyrra is rated "Medium" for GitOps friendliness: Pyrra operates exclusively as an operator and does not offer a CLI mode like Sloth. This makes offline workflows — such as pre-generating or validating rules in a CI pipeline without the operator — impossible. While CRDs themselves can be managed in Git, there is no independent validation mechanism like the Sloth CLI to verify the resulting PrometheusRule output at the PR stage.

Practical Application

Example 1: Defining an HTTP Availability SLO with Sloth

Using Sloth's CRD PrometheusServiceLevel, you can declare complex multi-burn-rate alerting rules in simple YAML.

yaml

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: my-service-slo
  namespace: monitoring
spec:
  service: "my-service"
  slos:
    - name: "requests-availability"
      objective: 99.9
      description: "HTTP 요청 성공률 99.9% 유지"
      sli:
        events:
          # 5xx 응답의 초당 발생 비율 합산 (에러 이벤트)
          errorQuery: >
            sum(rate(http_requests_total{job="my-service",code=~"5.."}[{{.window}}]))
          # 전체 요청의 초당 발생 비율 합산
          totalQuery: >
            sum(rate(http_requests_total{job="my-service"}[{{.window}}]))
      alerting:
        name: MyServiceHighErrorRate
        page:
          labels:
            severity: critical
        ticket:
          labels:
            severity: warning

Field	Description
`objective: 99.9`	99.9% availability target over a 4-week period
`sli.events.errorQuery`	PromQL for events counted as errors. `{{.window}}` is auto-substituted by Sloth when generating rules
`sli.events.totalQuery`	PromQL for total events
`alerting.page`	Critical alert fired on high burn rate (fast consumption)
`alerting.ticket`	Warning alert fired on low burn rate (slow consumption)

Important: The {{.window}} in the YAML is Go template syntax internal to Sloth. Rather than running kubectl apply on this YAML directly, the Sloth operator watches the CRD and pre-processes it automatically, or the Sloth CLI first converts it to a PrometheusRule YAML before applying.

bash

# CLI mode: pre-generate and validate PrometheusRule YAML (works without the operator)
sloth generate -i my-service-slo.yaml -o output-rules.yaml

Applying this single YAML causes the Sloth operator to automatically generate a PrometheusRule containing the 6 alerting rules described earlier.

Example 2: Defining a gRPC Error Rate SLO with Pyrra

Pyrra's CRD ServiceLevelObjective offers an even more concise syntax.

yaml

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: grpc-service-availability
  namespace: monitoring
  labels:
    pyrra.dev/team: "platform"   # metadata.labels section — used for team-based filtering in Pyrra UI
spec:
  target: "99.5"   # string type (Pyrra CRD spec defines this as string)
  window: 4w
  indicator:
    ratio:
      errors:
        metric: grpc_server_handled_total{job="grpc-service",grpc_code!="OK"}
      total:
        metric: grpc_server_handled_total{job="grpc-service"}

Field	Location	Description
`pyrra.dev/team`	`metadata.labels`	Label used for team-based filtering in the Pyrra UI. Located under `metadata`, not `spec`
`target: "99.5"`	`spec`	99.5% availability target over 4 weeks. Quotes are required because the Pyrra CRD spec defines this field as a string type
`window: 4w`	`spec`	SLO evaluation window (4 weeks)
`indicator.ratio`	`spec`	Ratio-based SLI definition
`errors.metric`	`spec.indicator.ratio`	Metric selector for events counted as errors
`total.metric`	`spec.indicator.ratio`	Metric selector for total requests

Pyrra not only generates PrometheusRule from this CRD, but also visualizes the error budget burn rate and remaining error budget in real time through its built-in Web UI. While the ability to immediately understand SLO status without Grafana is a major differentiator for Pyrra, its true value in production lies in query optimization for Thanos environments. With built-in subquery pre-aggregation for high-cardinality metric environments, query performance improves significantly in multi-cluster setups using Thanos.

Example 3: GitOps Workflow Integration Pattern

In real production environments, the standard pattern is to store SLO CRDs in a dedicated Git repository (or a subdirectory of an infrastructure repository) and deploy via ArgoCD or Flux.

python

infra-repo/
├── slos/
│   ├── my-service-availability.yaml   # Sloth CRD
│   ├── grpc-service-slo.yaml          # Pyrra CRD
│   └── kustomization.yaml             # Used when deploying with kubectl apply -k
└── argocd/
    └── slo-app.yaml                   # ArgoCD Application

yaml

# slos/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - my-service-availability.yaml
  - grpc-service-slo.yaml

yaml

# argocd/slo-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: slos
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/my-org/infra-repo
    targetRevision: main
    path: slos
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

In this pattern, every SLO change must go through the sequence: PR → code review → merge → automated deployment. "Who changed which SLO target, and when" is fully traceable through Git history.

Pros and Cons Analysis

Advantages

Item	Sloth	Pyrra	Grafana SLO
GitOps Integration	Supports offline CI pipeline validation via CLI	Declarative CRD-based management (no CLI)	Terraform IaC support
Alert Quality	Auto-generates multi-burn-rate rules based on Google SRE	Same level of alert auto-generation	Pre-configured alerts provided
Scalability	Reuse SLI logic via plugin system	Built-in high-cardinality optimization for Thanos/Mimir	Full integration with Grafana Cloud ecosystem
Accessibility	Requires PromQL knowledge	Requires PromQL knowledge	Configurable via UI alone
Standard Support	Accepts OpenSLO spec as direct input	—	—

OpenSLO: A vendor-neutral SLO declaration spec in the CNCF ecosystem. Sloth can accept this spec directly as input to generate Prometheus rules, enabling SLO definitions that are not locked to a specific tool.

Drawbacks and Caveats

Cardinality: The number of unique label combinations in metric time series. Higher cardinality increases Prometheus memory usage and query cost. Pyrra has built-in subquery pre-aggregation optimization for high-cardinality environments, making it especially advantageous in Thanos setups.

Item	Details	Mitigation
No built-in UI for Sloth	Sloth provides no visualization tooling	Import official Grafana dashboard ID 14348 or build your own
Limited Pyrra-Grafana integration	Visualizing Pyrra-generated rules in Grafana requires the `-generic-rules` flag; grouping is not supported	Use Pyrra's built-in UI and Grafana side by side for different purposes
Grafana SLO vendor lock-in	Grafana Cloud only; self-hosting not available	Factor in migration costs before initial adoption if you may switch to open source later
High cost of Grafana SLO	Enterprise pricing of $25,000+/year	Evaluate team size and ROI in advance
Sloth visualization effort	Higher initial Grafana dashboard setup effort compared to Pyrra	Recommended to start by importing the official dashboard template

Most Common Mistakes in Practice

Setting up SLOs without monitoring the error budget burn rate — Even with SLOs configured, if you don't regularly review how quickly the error budget is being consumed, they remain purely ceremonial metrics. Consider making a weekly error budget review part of your team routine.
Applying SLOs to too many services at once — Rolling out SLOs across all services from the start increases alert fatigue. It's recommended to begin with 1–2 of your most critical services and expand gradually.
Setting SLO targets arbitrarily high — Overly ambitious targets like 99.99% make the error budget so small that even routine deployments become difficult. Measure your actual current service level first, then set a realistic target.

Closing Thoughts

Sloth and Pyrra are powerful open-source tools that abstract complex SLO alerting rules into simple CRD declarations, enabling seamless integration of SLOs into GitOps workflows in Kubernetes environments.

When choosing a tool, the following checklist — based on your team's current stack and needs — should help:

Checklist Item	Sloth	Pyrra	Grafana SLO
Already actively using Grafana	✅	—	—
Need immediate error budget visualization without Grafana	—	✅	✅
Thanos/Mimir multi-cluster environment	—	✅	✅
Need offline SLO rule validation in CI pipeline	✅	—	—
Want to configure via UI without PromQL	—	—	✅
Open source, self-hosted required	✅	✅	—
Fast adoption, already using Grafana Cloud	—	—	✅

Three steps you can start right now:

Deploy your first SLO with Pyrra — Install Pyrra with the command below. If the monitoring namespace doesn't exist, create it first with kubectl create namespace monitoring, and make sure Prometheus Operator (or kube-prometheus-stack) is already installed.
bash
```
helm repo add pyrra https://pyrra-dev.github.io/pyrra
helm install pyrra pyrra/pyrra -n monitoring
```
After installation, apply the ServiceLevelObjective YAML from the example above to one of your most important services and verify that the error budget is visualized in Pyrra's built-in UI.
Create a slos/ directory in your GitOps repository — Add a slos/ directory to your existing infrastructure repository and apply a PR-based workflow for managing SLO CRD YAMLs. Adding just the single ArgoCD Application YAML from the example above is all it takes to complete the GitOps integration.
Introduce a weekly error budget review routine — After adopting an SLO tool, try building a habit of spending 5 minutes in your team's weekly meeting reviewing the error budget burn rate. This small habit prevents SLOs from becoming purely ceremonial metrics and serves as the starting point for building a data-driven deployment decision culture.

Next article: Error budget policy automation — how to configure a GitOps pipeline that automatically blocks deployment gates when SLOs are violated

References

Kubernetes SLO Automation: Declarative SLO Management with Sloth and Pyrra | DEV BAK - 기술블로그

DevOps

Kubernetes SLO Automation: Declarative SLO Management with Sloth and Pyrra

Prometheus Operator CRD-Based Approach and Comparison with Grafana SLO

Core Concepts

SLO, SLI, and Error Budget — How the Three Concepts Relate

Before adopting SLOs, you need to distinguish three key terms.

SLI (Service Level Indicator): The actual metric measuring service quality. Examples: HTTP request success rate, response latency (p99 latency)

SLO (Service Level Objective): The target for an SLI. Example: "Maintain HTTP success rate of 99.9% or above over a 4-week period"

Error Budget: The allowable failure margin defined by the SLO. If the SLO is 99.9%, then 0.1% is the error budget. Exhausting it becomes the basis for decisions such as halting new deployments.

When these three align, operational decisions like "is it safe to deploy now?" become data-driven rather than gut-feel.

What Is Prometheus Operator + CRD-Based SLO Management?

The modern approach to managing SLOs in Kubernetes is to leverage CRDs (Custom Resource Definitions). Understanding the related components first gives you the full picture.

PrometheusRule: A CRD provided by Prometheus Operator. It allows you to declare alerting rules and recording rules (rules that pre-compute frequently used PromQL expressions and store the results) as Kubernetes resources. Prometheus periodically evaluates the expressions defined in this YAML to fire alerts or generate new metric time series.

Developer → SLO CRD (YAML) → Git Repository
                                ↓ ArgoCD/Flux
                          K8s Cluster
                                ↓ Sloth/Pyrra Operator (watch → reconcile)
                          PrometheusRule CR (alerting/recording rules auto-generated)
                                ↓
                          Prometheus (rule evaluation → alert firing)

What Is a Multi-Window, Multi-Burn-Rate Alert?

Burn Rate: The speed at which the error budget is consumed. A burn rate of 1 means the error budget is exhausted exactly over the SLO window (e.g., 4 weeks), while a burn rate of 14 means the 4-week error budget is exhausted in 2 days, requiring immediate action.

When you declare alerting.page (fast burn, critical) and alerting.ticket (slow burn, warning) in Sloth, it internally auto-generates 6 alerting rules based on Google SRE guidelines.

Alert Type	Long Window	Short Window	Burn Rate
page (critical)	1h	5m	14×
page (critical)	6h	30m	6×
ticket (warning)	3d	6h	3×
ticket (warning)	3d	6h	1×

(The actual rules generated may vary depending on the Sloth version and configuration.)

Both Sloth and Pyrra automatically generate these complex multi-burn-rate alerting rules, so users never need to write them manually in PromQL.

Sloth vs Pyrra vs Grafana SLO — Positioning of Each Tool

The three tools solve the same problem with different philosophies.

	Sloth	Pyrra	Grafana SLO
Type	CLI + K8s Operator	K8s Operator	Managed Cloud Service
Open Source	✅	✅	❌
Built-in UI	❌	✅	✅
GitOps Friendliness	High	Medium	Low
Thanos/Mimir Support	Limited	✅ (v0.8.0+)	✅
OpenSLO Support	✅	❌	❌
Cost	Free	Free	$25,000+/year

Thanos / Grafana Mimir: A layer responsible for long-term metric retention and high-availability querying for Prometheus. Used alongside Prometheus in environments that need to consolidate metrics from multiple clusters or retain months of data.

Practical Application

Example 1: Defining an HTTP Availability SLO with Sloth

Using Sloth's CRD PrometheusServiceLevel, you can declare complex multi-burn-rate alerting rules in simple YAML.

yaml

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: my-service-slo
  namespace: monitoring
spec:
  service: "my-service"
  slos:
    - name: "requests-availability"
      objective: 99.9
      description: "HTTP 요청 성공률 99.9% 유지"
      sli:
        events:
          # 5xx 응답의 초당 발생 비율 합산 (에러 이벤트)
          errorQuery: >
            sum(rate(http_requests_total{job="my-service",code=~"5.."}[{{.window}}]))
          # 전체 요청의 초당 발생 비율 합산
          totalQuery: >
            sum(rate(http_requests_total{job="my-service"}[{{.window}}]))
      alerting:
        name: MyServiceHighErrorRate
        page:
          labels:
            severity: critical
        ticket:
          labels:
            severity: warning

Field	Description
`objective: 99.9`	99.9% availability target over a 4-week period
`sli.events.errorQuery`	PromQL for events counted as errors. `{{.window}}` is auto-substituted by Sloth when generating rules
`sli.events.totalQuery`	PromQL for total events
`alerting.page`	Critical alert fired on high burn rate (fast consumption)
`alerting.ticket`	Warning alert fired on low burn rate (slow consumption)

bash

# CLI mode: pre-generate and validate PrometheusRule YAML (works without the operator)
sloth generate -i my-service-slo.yaml -o output-rules.yaml

Applying this single YAML causes the Sloth operator to automatically generate a PrometheusRule containing the 6 alerting rules described earlier.

Example 2: Defining a gRPC Error Rate SLO with Pyrra

Pyrra's CRD ServiceLevelObjective offers an even more concise syntax.

yaml

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: grpc-service-availability
  namespace: monitoring
  labels:
    pyrra.dev/team: "platform"   # metadata.labels section — used for team-based filtering in Pyrra UI
spec:
  target: "99.5"   # string type (Pyrra CRD spec defines this as string)
  window: 4w
  indicator:
    ratio:
      errors:
        metric: grpc_server_handled_total{job="grpc-service",grpc_code!="OK"}
      total:
        metric: grpc_server_handled_total{job="grpc-service"}

Field	Location	Description
`pyrra.dev/team`	`metadata.labels`	Label used for team-based filtering in the Pyrra UI. Located under `metadata`, not `spec`
`target: "99.5"`	`spec`	99.5% availability target over 4 weeks. Quotes are required because the Pyrra CRD spec defines this field as a string type
`window: 4w`	`spec`	SLO evaluation window (4 weeks)
`indicator.ratio`	`spec`	Ratio-based SLI definition
`errors.metric`	`spec.indicator.ratio`	Metric selector for events counted as errors
`total.metric`	`spec.indicator.ratio`	Metric selector for total requests

Example 3: GitOps Workflow Integration Pattern

In real production environments, the standard pattern is to store SLO CRDs in a dedicated Git repository (or a subdirectory of an infrastructure repository) and deploy via ArgoCD or Flux.

python

infra-repo/
├── slos/
│   ├── my-service-availability.yaml   # Sloth CRD
│   ├── grpc-service-slo.yaml          # Pyrra CRD
│   └── kustomization.yaml             # Used when deploying with kubectl apply -k
└── argocd/
    └── slo-app.yaml                   # ArgoCD Application

yaml

# slos/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - my-service-availability.yaml
  - grpc-service-slo.yaml

yaml

# argocd/slo-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: slos
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/my-org/infra-repo
    targetRevision: main
    path: slos
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

In this pattern, every SLO change must go through the sequence: PR → code review → merge → automated deployment. "Who changed which SLO target, and when" is fully traceable through Git history.

Pros and Cons Analysis

Advantages

Item	Sloth	Pyrra	Grafana SLO
GitOps Integration	Supports offline CI pipeline validation via CLI	Declarative CRD-based management (no CLI)	Terraform IaC support
Alert Quality	Auto-generates multi-burn-rate rules based on Google SRE	Same level of alert auto-generation	Pre-configured alerts provided
Scalability	Reuse SLI logic via plugin system	Built-in high-cardinality optimization for Thanos/Mimir	Full integration with Grafana Cloud ecosystem
Accessibility	Requires PromQL knowledge	Requires PromQL knowledge	Configurable via UI alone
Standard Support	Accepts OpenSLO spec as direct input	—	—

OpenSLO: A vendor-neutral SLO declaration spec in the CNCF ecosystem. Sloth can accept this spec directly as input to generate Prometheus rules, enabling SLO definitions that are not locked to a specific tool.

Drawbacks and Caveats

Cardinality: The number of unique label combinations in metric time series. Higher cardinality increases Prometheus memory usage and query cost. Pyrra has built-in subquery pre-aggregation optimization for high-cardinality environments, making it especially advantageous in Thanos setups.

Item	Details	Mitigation
No built-in UI for Sloth	Sloth provides no visualization tooling	Import official Grafana dashboard ID 14348 or build your own
Limited Pyrra-Grafana integration	Visualizing Pyrra-generated rules in Grafana requires the `-generic-rules` flag; grouping is not supported	Use Pyrra's built-in UI and Grafana side by side for different purposes
Grafana SLO vendor lock-in	Grafana Cloud only; self-hosting not available	Factor in migration costs before initial adoption if you may switch to open source later
High cost of Grafana SLO	Enterprise pricing of $25,000+/year	Evaluate team size and ROI in advance
Sloth visualization effort	Higher initial Grafana dashboard setup effort compared to Pyrra	Recommended to start by importing the official dashboard template

Most Common Mistakes in Practice

Setting up SLOs without monitoring the error budget burn rate — Even with SLOs configured, if you don't regularly review how quickly the error budget is being consumed, they remain purely ceremonial metrics. Consider making a weekly error budget review part of your team routine.
Applying SLOs to too many services at once — Rolling out SLOs across all services from the start increases alert fatigue. It's recommended to begin with 1–2 of your most critical services and expand gradually.
Setting SLO targets arbitrarily high — Overly ambitious targets like 99.99% make the error budget so small that even routine deployments become difficult. Measure your actual current service level first, then set a realistic target.

Closing Thoughts

When choosing a tool, the following checklist — based on your team's current stack and needs — should help:

Checklist Item	Sloth	Pyrra	Grafana SLO
Already actively using Grafana	✅	—	—
Need immediate error budget visualization without Grafana	—	✅	✅
Thanos/Mimir multi-cluster environment	—	✅	✅
Need offline SLO rule validation in CI pipeline	✅	—	—
Want to configure via UI without PromQL	—	—	✅
Open source, self-hosted required	✅	✅	—
Fast adoption, already using Grafana Cloud	—	—	✅

Three steps you can start right now:

Deploy your first SLO with Pyrra — Install Pyrra with the command below. If the monitoring namespace doesn't exist, create it first with kubectl create namespace monitoring, and make sure Prometheus Operator (or kube-prometheus-stack) is already installed.
bash
```
helm repo add pyrra https://pyrra-dev.github.io/pyrra
helm install pyrra pyrra/pyrra -n monitoring
```
After installation, apply the ServiceLevelObjective YAML from the example above to one of your most important services and verify that the error budget is visualized in Pyrra's built-in UI.
Create a slos/ directory in your GitOps repository — Add a slos/ directory to your existing infrastructure repository and apply a PR-based workflow for managing SLO CRD YAMLs. Adding just the single ArgoCD Application YAML from the example above is all it takes to complete the GitOps integration.
Introduce a weekly error budget review routine — After adopting an SLO tool, try building a habit of spending 5 minutes in your team's weekly meeting reviewing the error budget burn rate. This small habit prevents SLOs from becoming purely ceremonial metrics and serves as the starting point for building a data-driven deployment decision culture.

Next article: Error budget policy automation — how to configure a GitOps pipeline that automatically blocks deployment gates when SLOs are violated

Core Concepts

SLO, SLI, and Error Budget — How the Three Concepts Relate

What Is Prometheus Operator + CRD-Based SLO Management?

What Is a Multi-Window, Multi-Burn-Rate Alert?

Sloth vs Pyrra vs Grafana SLO — Positioning of Each Tool

Practical Application

Example 1: Defining an HTTP Availability SLO with Sloth

Example 2: Defining a gRPC Error Rate SLO with Pyrra

Example 3: GitOps Workflow Integration Pattern

Pros and Cons Analysis

Advantages

Drawbacks and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

SLO, SLI, and Error Budget — How the Three Concepts Relate

What Is Prometheus Operator + CRD-Based SLO Management?

What Is a Multi-Window, Multi-Burn-Rate Alert?

Sloth vs Pyrra vs Grafana SLO — Positioning of Each Tool

Practical Application

Example 1: Defining an HTTP Availability SLO with Sloth

Example 2: Defining a gRPC Error Rate SLO with Pyrra

Example 3: GitOps Workflow Integration Pattern

Pros and Cons Analysis

Advantages

Drawbacks and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Error Budget Automation: A Practical Implementation Guide to Blocking SLO Violations with GitOps Deployment Gates

Complete Guide to Prometheus + Grafana Monitoring — From Docker Compose to Kubernetes

Grafana Loki + Tempo: Implementing Bidirectional Log-Trace Drill-Down with a Single Trace ID

Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline

Automating Fast-burn/Slow-burn Alerts with Grafana SLO

Building a P99 Latency & Error Rate SLO Dashboard — A Practical Guide to Grafana Loki LogQL