Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline
Tags: DevOps / Infrastructure / SRE / Observability
"When and why did this SLO change?" — If no one can answer that question during an on-call shift, it's a sign your SLO exists quietly somewhere in the UI. The UI is a graveyard for SLOs — no change history, no reviews, no rollbacks. An SLO is a reliability contract your team must uphold, and if that contract can be modified with a few clicks, building a trust-based SRE culture becomes very difficult.
SLO-as-Code is an approach that puts this contract into Git, reviews it via PRs, and deploys it through CI/CD — securing transparency and consistency in SLO management. By leveraging the grafana_slo resource provided by Grafana's official Terraform provider (grafana/grafana), you can declaratively define SLI queries, target values, and alerting rules as HCL code and manage them under version control.
What this article covers:
- Core query types and SLO components in
grafana_slo - Reusable Terraform module design and per-environment management
- A GitHub Actions GitOps pipeline with automated PR plans and sequential deployments
- Migration procedures for converting existing UI SLOs to code
- Advanced features including Knowledge Graph integration
Prerequisites: If you have basic familiarity with Terraform syntax (resources, variables, modules) and PromQL fundamentals, you can set up the pipeline in about an hour. Complete beginners are encouraged to go through HashiCorp's Terraform Get Started tutorial first.
Core Concepts
The Four Core Components of an SLO
Before writing SLO-as-Code, it helps to have a clear understanding of the concepts that make up an SLO.
| Concept | Description | Example |
|---|---|---|
| SLI (Service Level Indicator) | A metric that measures service quality | Ratio of successful requests excluding HTTP 5xx |
| SLO (Service Level Objective) | A target value for an SLI | Maintain 99.9% availability over 30 days |
| Error Budget | 1 - SLO target, the allowable margin of error |
99.9% SLO → 0.1% (43.2 minutes/month) |
| Burn Rate | The rate at which the error budget is consumed | Burn Rate 2 = budget exhausted in 15 days |
What is Burn Rate? A Burn Rate of 1 means the error budget is consumed at exactly the rate that exhausts it over the 30-day compliance window. A Burn Rate of 2 means half the window — 15 days — is all you get before the budget is gone, requiring immediate action.
SLI Query Types Supported by grafana_slo
Grafana provides four query types for defining SLIs.
| Type | Description | Best for |
|---|---|---|
ratio |
Ratio of successful requests / total requests | HTTP availability, error rate measurement |
freeform |
Arbitrary PromQL expression | Latency, composite condition measurement |
threshold |
Threshold-based measurement | Determining whether a value exceeds a limit |
grafana_queries |
Reuse existing Grafana panel queries | Integrating with existing dashboard queries |
The grafana_queries type lets you reuse already-built dashboard panel queries directly as SLIs, making it especially useful during initial migrations.
# grafana_queries type: reuse existing dashboard panel queries as SLIs
resource "grafana_slo" "panel_query_slo" {
name = "Checkout API Availability (Dashboard Query)"
description = "기존 대시보드 패널 쿼리를 SLI로 재활용"
query {
type = "grafana_queries"
grafana_queries {
success_query {
query_type = "instant"
ref_id = "A"
expr = "sum(http_requests_total{service=\"checkout\",status!~\"5..\"})"
}
total_query {
query_type = "instant"
ref_id = "B"
expr = "sum(http_requests_total{service=\"checkout\"})"
}
}
}
objectives {
value = 0.999
window = "30d"
}
destination_datasource {
uid = var.datasource_uid
}
}Multi-window Alerting Structure
Grafana SLO alerting follows the multi-window approach recommended in the Google SRE Workbook.
Multi-window Alerting: Simultaneously detects fastburn (severity: critical), which rapidly consumes the error budget in a short time, and slowburn (severity: warning), which steadily consumes it at a low rate. Alerts only fire when both conditions are met, dramatically reducing false positives.
Basic Structure of the grafana_slo Resource
Now that the concepts are clear, let's look at an actual grafana_slo resource.
The grafana_slo resource defines the SLI query, target value, datasource, labels, and alerting in a single block. Rather than hardcoding destination_datasource.uid, the best practice is to use a data "grafana_data_source" block for dynamic reference. Hardcoded UIDs are a common source of errors when migrating environments or recreating datasources.
terraform {
required_providers {
grafana = {
source = "grafana/grafana"
version = ">= 3.5.0"
}
}
}
provider "grafana" {
url = var.grafana_url
auth = var.grafana_service_account_token
}
# Dynamically reference the datasource (recommended over hardcoding)
data "grafana_data_source" "prometheus" {
name = "grafanacloud-prom"
}
# Ratio type: HTTP availability SLO
resource "grafana_slo" "api_availability" {
name = "API Availability SLO"
description = "HTTP 5xx 제외 성공 요청 비율"
query {
type = "ratio"
ratio {
success_metric = "http_requests_total{status!~\"5..\"}"
total_metric = "http_requests_total"
group_by_labels = ["service", "env"]
}
}
objectives {
value = 0.999
window = "30d"
}
destination_datasource {
uid = data.grafana_data_source.prometheus.uid
}
label {
key = "team"
value = "platform"
}
alerting {
fastburn {
annotation {
key = "summary"
value = "에러 버짓 급속 소진 중"
}
label {
key = "severity"
value = "critical"
}
}
slowburn {
annotation {
key = "summary"
value = "에러 버짓 지속 소진 중"
}
label {
key = "severity"
value = "warning"
}
}
}
}When a grafana_slo resource is created, Grafana automatically generates Recording Rules (Prometheus rules for SLI aggregation), an error budget dashboard, and fastburn/slowburn alert rules. These auto-generated resources are not directly managed by Terraform and are cleaned up when the SLO is deleted.
For SLIs that are difficult to express as a ratio — such as latency — you can use the freeform type.
# Freeform type: P99 latency SLO
resource "grafana_slo" "latency_slo" {
name = "API P99 지연시간 SLO"
description = "P99 응답시간 200ms 이하 유지"
query {
type = "freeform"
freeform {
query = "sum(rate(http_request_duration_seconds_bucket{le=\"0.2\"}[$__rate_interval])) / sum(rate(http_request_duration_seconds_count[$__rate_interval]))"
}
}
objectives {
value = 0.95
window = "7d"
}
destination_datasource {
uid = data.grafana_data_source.prometheus.uid
}
}
$__rate_intervaland freeform queries:$__rate_intervalis a built-in Grafana variable that automatically adjusts therate()calculation window to match the scrape interval. Duringterraform apply, the Grafana backend substitutes this variable at runtime with the evaluation window used internally by the SLO engine (e.g., 5 minutes, 30 minutes). However, if you run the query directly with tools outside Grafana — such aspromtoolorcurl /api/v1/query— this variable will not be substituted and will cause an error. It is strongly recommended to validate queries in the Grafana Explore panel before committing them to code.
Practical Application
Now that the concepts are clear, let's write actual code. We'll walk through each step: module design, CI/CD pipeline, and migration.
Example 1: Designing a Reusable SLO Module
When managing SLOs for multiple services, it is recommended to abstract the grafana_slo resource into a Terraform module rather than writing it repeatedly. The pattern has the platform team provide the module while development teams fill in the variables to create SLOs as a self-service operation.
Directory structure:
slo-gitops/
├── modules/
│ └── grafana-slo/
│ ├── main.tf # grafana_slo resource definition
│ ├── variables.tf # input variable declarations
│ └── outputs.tf # output value declarations
├── envs/
│ ├── staging/
│ │ ├── backend.tf # S3 remote state configuration
│ │ ├── main.tf
│ │ └── terraform.tfvars
│ └── production/
│ ├── backend.tf
│ ├── main.tf
│ └── terraform.tfvars
└── .github/
└── workflows/
└── slo-deploy.ymlModule variable declarations (modules/grafana-slo/variables.tf):
variable "service_name" {
type = string
description = "SLO 이름에 사용할 서비스 식별자"
}
variable "team" {
type = string
description = "담당 팀 이름 (레이블 및 설명에 사용)"
}
variable "slo_target" {
type = number
default = 0.999
description = "SLO 목표값 (0.0 ~ 1.0)"
}
variable "success_metric" {
type = string
description = "성공 요청을 나타내는 PromQL 메트릭"
}
variable "total_metric" {
type = string
description = "전체 요청을 나타내는 PromQL 메트릭"
}
variable "datasource_uid" {
type = string
description = "Prometheus 데이터소스 UID"
}Module resource definition (modules/grafana-slo/main.tf):
resource "grafana_slo" "this" {
name = "${var.service_name} Availability"
description = "Managed by Terraform - team: ${var.team}"
query {
type = "ratio"
ratio {
success_metric = var.success_metric
total_metric = var.total_metric
}
}
objectives {
value = var.slo_target
window = "30d"
}
destination_datasource {
uid = var.datasource_uid
}
label {
key = "team"
value = var.team
}
label {
key = "managed"
value = "terraform"
}
alerting {
fastburn {
label {
key = "severity"
value = "critical"
}
}
slowburn {
label {
key = "severity"
value = "warning"
}
}
}
}Module outputs (modules/grafana-slo/outputs.tf):
output "slo_id" {
value = grafana_slo.this.id
description = "생성된 SLO의 UUID"
}
output "slo_name" {
value = grafana_slo.this.name
description = "생성된 SLO의 이름"
}S3 backend configuration (envs/production/backend.tf):
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "slo/production/terraform.tfstate"
region = "ap-northeast-2"
encrypt = true
# Terraform 1.9 이하: DynamoDB 잠금 사용
dynamodb_table = "terraform-lock"
# Terraform 1.10+: 네이티브 S3 잠금 (위 dynamodb_table 대신 사용 가능)
# use_lockfile = true
}
}Terraform State: A file that stores the current state of the infrastructure Terraform manages. In Terraform 1.9 and below, DynamoDB locking was required to prevent concurrent modifications, but since 1.10, the
use_lockfile = trueoption enables native S3 locking, making a DynamoDB table unnecessary. Regardless of which approach you choose, configuring a remote backend is the first step toward team collaboration.
Calling the module in the production environment (envs/production/main.tf):
data "grafana_data_source" "prometheus" {
name = "grafanacloud-prom"
}
module "checkout_slo" {
source = "../../modules/grafana-slo"
service_name = "checkout-api"
team = "payments"
slo_target = 0.999
success_metric = "http_requests_total{service=\"checkout\",status!~\"5..\"}"
total_metric = "http_requests_total{service=\"checkout\"}"
datasource_uid = data.grafana_data_source.prometheus.uid
}
module "auth_slo" {
source = "../../modules/grafana-slo"
service_name = "auth-service"
team = "identity"
slo_target = 0.9995
success_metric = "http_requests_total{service=\"auth\",status!~\"5..\"}"
total_metric = "http_requests_total{service=\"auth\"}"
datasource_uid = data.grafana_data_source.prometheus.uid
}How to verify PromQL metric names: The metric names that go into
success_metricandtotal_metricvary by service. It is recommended to look up the actual metrics viakubectl exec -it <pod> -- curl localhost:8080/metricsor the Grafana Explore panel before putting them into code.
| Component | Role |
|---|---|
Separate variables.tf |
Separates variable definitions from resource logic, improving module reusability |
Separate outputs.tf |
Makes the SLO ID referenceable from other modules (alerts, dashboards) |
Separate backend.tf |
Clearly manages state file paths per environment |
managed = "terraform" label |
A guardrail to identify SLOs that have been modified without authorization in the UI |
Example 2: GitHub Actions GitOps Pipeline
This pipeline runs terraform plan for both staging and production when a PR is opened, posting the changes as a PR comment, and then deploys sequentially — staging first, then production — when the PR is merged into the main branch.
name: SLO Deploy
on:
push:
branches: [main]
paths: ["envs/**", "modules/**"]
pull_request:
branches: [main]
paths: ["envs/**", "modules/**"]
env:
TF_VAR_grafana_url: ${{ secrets.GRAFANA_URL }}
TF_VAR_grafana_token: ${{ secrets.GRAFANA_SERVICE_ACCOUNT_TOKEN }}
jobs:
plan:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
strategy:
matrix:
env: [staging, production]
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.9.x"
- name: Configure AWS Credentials (S3 backend)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ap-northeast-2
- name: Terraform Init
run: terraform init
working-directory: envs/${{ matrix.env }}
- name: Terraform Plan
id: plan
run: |
terraform plan -no-color -out=tfplan
terraform show -no-color tfplan > plan_output.txt
working-directory: envs/${{ matrix.env }}
- name: Comment Plan on PR
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
let output = fs.readFileSync('envs/${{ matrix.env }}/plan_output.txt', 'utf8');
// GitHub PR 코멘트는 65,536자 제한이 있으므로 초과 시 잘라냄
if (output.length > 60000) {
output = output.substring(0, 60000) + '\n...(출력이 길어 잘렸습니다. 전체 내용은 Actions 로그를 확인하세요)';
}
const body = `#### Terraform Plan — ${{ matrix.env }}\n\`\`\`\n${output}\n\`\`\``;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
# Apply staging first
apply-staging:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment: staging
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ap-northeast-2
- run: terraform init
working-directory: envs/staging
- run: terraform apply -auto-approve
working-directory: envs/staging
# Apply production after staging succeeds (sequential deployment)
apply-production:
runs-on: ubuntu-latest
needs: apply-staging
environment: production
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ap-northeast-2
- run: terraform init
working-directory: envs/production
- run: terraform apply -auto-approve
working-directory: envs/production
environment: production: Using GitHub Actions' Environment Protection Rules, you can add an additional gate that prevents apply from running without approval from specific reviewers. This is especially useful for sensitive changes like modifying SLO target values.
For non-AWS environments: If you use Terraform Cloud HCP instead of an S3 backend, you can use the
TFE_TOKENsecret instead ofAWS_ROLE_ARNand setcli_config_credentials_tokenin thehashicorp/setup-terraformaction. For a GCS (Google Cloud Storage) backend, you can configure authentication using thegoogle-github-actions/authaction.
Example 3: Migrating Existing UI SLOs to Code
Since 2025, Grafana officially supports exporting SLOs created in the UI in HCL format. This formalizes a "UI first → migrate to code" brownfield transition path.
# Step 1: Export existing SLOs in HCL format
curl -H "Authorization: Bearer $GRAFANA_TOKEN" \
"https://your-org.grafana.net/api/plugins/grafana-slo-app/resources/v1/slo?format=hcl" \
-o existing_slos.tf
# Or from the Grafana UI: SLO detail page → More > Export
# Step 2: Register existing SLOs in Terraform state (start managing without recreation)
terraform import grafana_slo.api_availability <SLO_UUID>
# Step 3: Verify drift between code and actual state with plan
terraform planAfter refactoring the exported HCL to fit the module structure and registering it in state with terraform import, you can begin managing existing SLOs as code without deleting and recreating them.
Advanced Topics
Knowledge Graph SLO Integration (New in 2025–2026)
Once you've mastered the basics, you can leverage Grafana's Knowledge Graph integration for richer SLO context.
GA Status Notice: Knowledge Graph SLO integration requires
grafana/grafanaprovider v3.7.0 or later and only works in environments where the Asserts feature is enabled in Grafana Cloud. Thesearch_expressionattribute is currently in GA (Generally Available), but support may be limited to Grafana Cloud, so it is recommended to verify the latest supported scope in the official documentation before using it in a self-hosted environment.
Adding the grafana_slo_provenance = "asserts" label connects the SLO to a specific service entity, allowing you to navigate directly to the RCA Workbench for root cause analysis when an alert fires.
# Requires grafana/grafana provider >= 3.7.0, only works in environments with Asserts enabled
resource "grafana_slo" "entity_slo" {
name = "Checkout API Availability (Knowledge Graph)"
description = "Knowledge Graph 연동 SLO"
query {
type = "ratio"
ratio {
success_metric = "http_requests_total{service=\"checkout\",status!~\"5..\"}"
total_metric = "http_requests_total{service=\"checkout\"}"
}
}
objectives {
value = 0.999
window = "30d"
}
destination_datasource {
uid = data.grafana_data_source.prometheus.uid
}
label {
key = "grafana_slo_provenance"
value = "asserts"
}
# Specify the service entity to link (provider >= 3.7.0, Asserts activation required)
search_expression = "service.name=\"checkout-api\" AND environment=\"production\""
}Pros and Cons
Advantages
| Item | Details |
|---|---|
| Version control | SLO change history is tracked in Git history, making it auditable — when and why it changed |
| Review process | PR-based code review automates team approval workflows for SLO target and query changes |
| Environment consistency | Managing staging/production SLOs with the same module prevents configuration mismatches across environments |
| Self-service | Terraform module abstraction lets development teams create SLOs without depending on the platform team |
| Drift detection | terraform plan immediately detects unauthorized changes made in the UI |
| Automation | CI/CD automates deployments and eliminates manual click errors |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Learning curve | Teams unfamiliar with Terraform and PromQL face initial onboarding costs | Provide reusable modules + internal documentation |
| State file management | Terraform state may contain sensitive information | S3 encryption + DynamoDB locking (or use_lockfile for 1.10+) configuration is required |
| UI sync issues | Directly modifying in the UI causes the code and actual state to diverge | Enforce a team rule against direct UI edits; use the managed=terraform label for identification |
| PromQL complexity | Built-in variables like $__rate_interval cannot be tested outside Grafana |
Validate queries in Grafana Explore before committing to code |
| Auto-generated resources | Dashboards and alert rules are not directly managed by Terraform | Verify the impact of auto-generated resources before deleting an SLO |
| Grafana Cloud dependency | Only works in environments where the SLO plugin is activated | Activate the grafana-slo-app plugin for self-hosted setups. Plugin support may differ from Cloud, so check feature availability in the official plugin docs |
| Multi-environment complexity | Managing SLOs for dozens of services makes module design complex | Define naming conventions and directory structure in advance |
⚠️ The Most Common Mistakes in Production
The following three issues occur most frequently in real production environments. Being aware of them in advance and making them team rules can reduce on-call fatigue.
-
Using a local state file: Keeping
terraform.tfstatelocally causes state inconsistencies among team members. It is recommended to configure a remote backend (backend.tf) as the very first step when starting a project. -
Making temporary UI edits without reflecting them in code: Changing an SLO in the UI during an incident and forgetting to update the code means the change will be reverted on the next
terraform apply. Runningterraform planregularly to detect drift early is important. -
Applying the same SLO target to every service: 99.9% is not always the right answer. It is recommended to differentiate targets per service based on service criticality and error budget consumption history. The
slo_targetvariable exists precisely so you can set different values per service.
Closing Thoughts
Building a culture where the entire team signs off on SLO changes — SLO-as-Code is the first step toward that. By declaring reliability targets as code, reviewing them via PRs, and deploying through CI/CD, you can secure both transparency in SLO management and trust across teams.
Three steps you can take right now:
-
(~5 minutes) Export your existing SLOs to HCL — The command
curl "https://your-org.grafana.net/api/plugins/grafana-slo-app/resources/v1/slo?format=hcl" -H "Authorization: Bearer $GRAFANA_TOKEN" -o existing_slos.tfextracts your current SLOs as code. You can also open an SLO in the Grafana UI and clickMore > Export. -
(~30 minutes) Set up the reusable module and import — Use the
modules/grafana-slostructure from the examples above to write your module, then register existing SLOs into state withterraform import grafana_slo.<name> <SLO_UUID>and runterraform planto confirm there is no drift. -
(~20 minutes) Add the GitHub Actions workflow — Add
.github/workflows/slo-deploy.ymlto your repository and register theGRAFANA_URL,GRAFANA_SERVICE_ACCOUNT_TOKEN, andAWS_ROLE_ARNsecrets, and a GitOps pipeline that automatically posts plan results as PR comments will be activated immediately.
Other Articles in This Series
Next article: Kubernetes-native SLO management with Sloth and pyrra — comparing a Prometheus Operator and CRD-based approach against Grafana SLO.
References
If you're just getting started, these two are recommended first:
- ⭐ grafana_slo Resource | Terraform Registry — Official reference for the resource schema
- ⭐ Provision SLO resources using Terraform | Grafana Cloud Official Docs — Official setup guide
Additional resources:
- Configure Knowledge Graph SLOs using Terraform | Grafana Official Docs
- Export Grafana SLOs into HCL Format | Grafana What's New (2025)
- Best Practices for Grafana SLOs | Grafana Cloud Docs
- Alerting on SLOs | Google SRE Workbook
- Automate Terraform with GitHub Actions | HashiCorp Developer
- How to Use Terraform with GitOps | Spacelift
- grafana-slo-app Plugin Docs — Terraform Provisioning