Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline

Tags: DevOps / Infrastructure / SRE / Observability

"When and why did this SLO change?" — If no one can answer that question during an on-call shift, it's a sign your SLO exists quietly somewhere in the UI. The UI is a graveyard for SLOs — no change history, no reviews, no rollbacks. An SLO is a reliability contract your team must uphold, and if that contract can be modified with a few clicks, building a trust-based SRE culture becomes very difficult.

SLO-as-Code is an approach that puts this contract into Git, reviews it via PRs, and deploys it through CI/CD — securing transparency and consistency in SLO management. By leveraging the grafana_slo resource provided by Grafana's official Terraform provider (grafana/grafana), you can declaratively define SLI queries, target values, and alerting rules as HCL code and manage them under version control.

What this article covers:

Core query types and SLO components in grafana_slo
Reusable Terraform module design and per-environment management
A GitHub Actions GitOps pipeline with automated PR plans and sequential deployments
Migration procedures for converting existing UI SLOs to code
Advanced features including Knowledge Graph integration

Prerequisites: If you have basic familiarity with Terraform syntax (resources, variables, modules) and PromQL fundamentals, you can set up the pipeline in about an hour. Complete beginners are encouraged to go through HashiCorp's Terraform Get Started tutorial first.

Core Concepts

The Four Core Components of an SLO

Before writing SLO-as-Code, it helps to have a clear understanding of the concepts that make up an SLO.

Concept	Description	Example
SLI (Service Level Indicator)	A metric that measures service quality	Ratio of successful requests excluding HTTP 5xx
SLO (Service Level Objective)	A target value for an SLI	Maintain 99.9% availability over 30 days
Error Budget	`1 - SLO target`, the allowable margin of error	99.9% SLO → 0.1% (43.2 minutes/month)
Burn Rate	The rate at which the error budget is consumed	Burn Rate 2 = budget exhausted in 15 days

What is Burn Rate? A Burn Rate of 1 means the error budget is consumed at exactly the rate that exhausts it over the 30-day compliance window. A Burn Rate of 2 means half the window — 15 days — is all you get before the budget is gone, requiring immediate action.

SLI Query Types Supported by grafana_slo

Grafana provides four query types for defining SLIs.

Type	Description	Best for
`ratio`	Ratio of successful requests / total requests	HTTP availability, error rate measurement
`freeform`	Arbitrary PromQL expression	Latency, composite condition measurement
`threshold`	Threshold-based measurement	Determining whether a value exceeds a limit
`grafana_queries`	Reuse existing Grafana panel queries	Integrating with existing dashboard queries

The grafana_queries type lets you reuse already-built dashboard panel queries directly as SLIs, making it especially useful during initial migrations.

hcl

# grafana_queries type: reuse existing dashboard panel queries as SLIs
resource "grafana_slo" "panel_query_slo" {
  name        = "Checkout API Availability (Dashboard Query)"
  description = "기존 대시보드 패널 쿼리를 SLI로 재활용"
 
  query {
    type = "grafana_queries"
    grafana_queries {
      success_query {
        query_type = "instant"
        ref_id     = "A"
        expr       = "sum(http_requests_total{service=\"checkout\",status!~\"5..\"})"
      }
      total_query {
        query_type = "instant"
        ref_id     = "B"
        expr       = "sum(http_requests_total{service=\"checkout\"})"
      }
    }
  }
 
  objectives {
    value  = 0.999
    window = "30d"
  }
 
  destination_datasource {
    uid = var.datasource_uid
  }
}

Multi-window Alerting Structure

Grafana SLO alerting follows the multi-window approach recommended in the Google SRE Workbook.

Multi-window Alerting: Simultaneously detects fastburn (severity: critical), which rapidly consumes the error budget in a short time, and slowburn (severity: warning), which steadily consumes it at a low rate. Alerts only fire when both conditions are met, dramatically reducing false positives.

Basic Structure of the grafana_slo Resource

Now that the concepts are clear, let's look at an actual grafana_slo resource.

The grafana_slo resource defines the SLI query, target value, datasource, labels, and alerting in a single block. Rather than hardcoding destination_datasource.uid, the best practice is to use a data "grafana_data_source" block for dynamic reference. Hardcoded UIDs are a common source of errors when migrating environments or recreating datasources.

hcl

terraform {
  required_providers {
    grafana = {
      source  = "grafana/grafana"
      version = ">= 3.5.0"
    }
  }
}
 
provider "grafana" {
  url  = var.grafana_url
  auth = var.grafana_service_account_token
}
 
# Dynamically reference the datasource (recommended over hardcoding)
data "grafana_data_source" "prometheus" {
  name = "grafanacloud-prom"
}
 
# Ratio type: HTTP availability SLO
resource "grafana_slo" "api_availability" {
  name        = "API Availability SLO"
  description = "HTTP 5xx 제외 성공 요청 비율"
 
  query {
    type = "ratio"
    ratio {
      success_metric  = "http_requests_total{status!~\"5..\"}"
      total_metric    = "http_requests_total"
      group_by_labels = ["service", "env"]
    }
  }
 
  objectives {
    value  = 0.999
    window = "30d"
  }
 
  destination_datasource {
    uid = data.grafana_data_source.prometheus.uid
  }
 
  label {
    key   = "team"
    value = "platform"
  }
 
  alerting {
    fastburn {
      annotation {
        key   = "summary"
        value = "에러 버짓 급속 소진 중"
      }
      label {
        key   = "severity"
        value = "critical"
      }
    }
    slowburn {
      annotation {
        key   = "summary"
        value = "에러 버짓 지속 소진 중"
      }
      label {
        key   = "severity"
        value = "warning"
      }
    }
  }
}

When a grafana_slo resource is created, Grafana automatically generates Recording Rules (Prometheus rules for SLI aggregation), an error budget dashboard, and fastburn/slowburn alert rules. These auto-generated resources are not directly managed by Terraform and are cleaned up when the SLO is deleted.

For SLIs that are difficult to express as a ratio — such as latency — you can use the freeform type.

hcl

# Freeform type: P99 latency SLO
resource "grafana_slo" "latency_slo" {
  name        = "API P99 지연시간 SLO"
  description = "P99 응답시간 200ms 이하 유지"
 
  query {
    type = "freeform"
    freeform {
      query = "sum(rate(http_request_duration_seconds_bucket{le=\"0.2\"}[$__rate_interval])) / sum(rate(http_request_duration_seconds_count[$__rate_interval]))"
    }
  }
 
  objectives {
    value  = 0.95
    window = "7d"
  }
 
  destination_datasource {
    uid = data.grafana_data_source.prometheus.uid
  }
}

$__rate_interval and freeform queries: $__rate_interval is a built-in Grafana variable that automatically adjusts the rate() calculation window to match the scrape interval. During terraform apply, the Grafana backend substitutes this variable at runtime with the evaluation window used internally by the SLO engine (e.g., 5 minutes, 30 minutes). However, if you run the query directly with tools outside Grafana — such as promtool or curl /api/v1/query — this variable will not be substituted and will cause an error. It is strongly recommended to validate queries in the Grafana Explore panel before committing them to code.

Practical Application

Now that the concepts are clear, let's write actual code. We'll walk through each step: module design, CI/CD pipeline, and migration.

Example 1: Designing a Reusable SLO Module

When managing SLOs for multiple services, it is recommended to abstract the grafana_slo resource into a Terraform module rather than writing it repeatedly. The pattern has the platform team provide the module while development teams fill in the variables to create SLOs as a self-service operation.

Directory structure:

slo-gitops/
├── modules/
│   └── grafana-slo/
│       ├── main.tf        # grafana_slo resource definition
│       ├── variables.tf   # input variable declarations
│       └── outputs.tf     # output value declarations
├── envs/
│   ├── staging/
│   │   ├── backend.tf     # S3 remote state configuration
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   └── production/
│       ├── backend.tf
│       ├── main.tf
│       └── terraform.tfvars
└── .github/
    └── workflows/
        └── slo-deploy.yml

Module variable declarations (modules/grafana-slo/variables.tf):

hcl

variable "service_name" {
  type        = string
  description = "SLO 이름에 사용할 서비스 식별자"
}
 
variable "team" {
  type        = string
  description = "담당 팀 이름 (레이블 및 설명에 사용)"
}
 
variable "slo_target" {
  type        = number
  default     = 0.999
  description = "SLO 목표값 (0.0 ~ 1.0)"
}
 
variable "success_metric" {
  type        = string
  description = "성공 요청을 나타내는 PromQL 메트릭"
}
 
variable "total_metric" {
  type        = string
  description = "전체 요청을 나타내는 PromQL 메트릭"
}
 
variable "datasource_uid" {
  type        = string
  description = "Prometheus 데이터소스 UID"
}

Module resource definition (modules/grafana-slo/main.tf):

hcl

resource "grafana_slo" "this" {
  name        = "${var.service_name} Availability"
  description = "Managed by Terraform - team: ${var.team}"
 
  query {
    type = "ratio"
    ratio {
      success_metric = var.success_metric
      total_metric   = var.total_metric
    }
  }
 
  objectives {
    value  = var.slo_target
    window = "30d"
  }
 
  destination_datasource {
    uid = var.datasource_uid
  }
 
  label {
    key   = "team"
    value = var.team
  }
  label {
    key   = "managed"
    value = "terraform"
  }
 
  alerting {
    fastburn {
      label {
        key   = "severity"
        value = "critical"
      }
    }
    slowburn {
      label {
        key   = "severity"
        value = "warning"
      }
    }
  }
}

Module outputs (modules/grafana-slo/outputs.tf):

hcl

output "slo_id" {
  value       = grafana_slo.this.id
  description = "생성된 SLO의 UUID"
}
 
output "slo_name" {
  value       = grafana_slo.this.name
  description = "생성된 SLO의 이름"
}

S3 backend configuration (envs/production/backend.tf):

hcl

terraform {
  backend "s3" {
    bucket  = "my-terraform-state"
    key     = "slo/production/terraform.tfstate"
    region  = "ap-northeast-2"
    encrypt = true
 
    # Terraform 1.9 이하: DynamoDB 잠금 사용
    dynamodb_table = "terraform-lock"
 
    # Terraform 1.10+: 네이티브 S3 잠금 (위 dynamodb_table 대신 사용 가능)
    # use_lockfile = true
  }
}

Terraform State: A file that stores the current state of the infrastructure Terraform manages. In Terraform 1.9 and below, DynamoDB locking was required to prevent concurrent modifications, but since 1.10, the use_lockfile = true option enables native S3 locking, making a DynamoDB table unnecessary. Regardless of which approach you choose, configuring a remote backend is the first step toward team collaboration.

Calling the module in the production environment (envs/production/main.tf):

hcl

data "grafana_data_source" "prometheus" {
  name = "grafanacloud-prom"
}
 
module "checkout_slo" {
  source = "../../modules/grafana-slo"
 
  service_name   = "checkout-api"
  team           = "payments"
  slo_target     = 0.999
  success_metric = "http_requests_total{service=\"checkout\",status!~\"5..\"}"
  total_metric   = "http_requests_total{service=\"checkout\"}"
  datasource_uid = data.grafana_data_source.prometheus.uid
}
 
module "auth_slo" {
  source = "../../modules/grafana-slo"
 
  service_name   = "auth-service"
  team           = "identity"
  slo_target     = 0.9995
  success_metric = "http_requests_total{service=\"auth\",status!~\"5..\"}"
  total_metric   = "http_requests_total{service=\"auth\"}"
  datasource_uid = data.grafana_data_source.prometheus.uid
}

How to verify PromQL metric names: The metric names that go into success_metric and total_metric vary by service. It is recommended to look up the actual metrics via kubectl exec -it <pod> -- curl localhost:8080/metrics or the Grafana Explore panel before putting them into code.

Component	Role
Separate `variables.tf`	Separates variable definitions from resource logic, improving module reusability
Separate `outputs.tf`	Makes the SLO ID referenceable from other modules (alerts, dashboards)
Separate `backend.tf`	Clearly manages state file paths per environment
`managed = "terraform"` label	A guardrail to identify SLOs that have been modified without authorization in the UI

Example 2: GitHub Actions GitOps Pipeline

This pipeline runs terraform plan for both staging and production when a PR is opened, posting the changes as a PR comment, and then deploys sequentially — staging first, then production — when the PR is merged into the main branch.

yaml

name: SLO Deploy
 
on:
  push:
    branches: [main]
    paths: ["envs/**", "modules/**"]
  pull_request:
    branches: [main]
    paths: ["envs/**", "modules/**"]
 
env:
  TF_VAR_grafana_url: ${{ secrets.GRAFANA_URL }}
  TF_VAR_grafana_token: ${{ secrets.GRAFANA_SERVICE_ACCOUNT_TOKEN }}
 
jobs:
  plan:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    strategy:
      matrix:
        env: [staging, production]
    steps:
      - uses: actions/checkout@v4
 
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.9.x"
 
      - name: Configure AWS Credentials (S3 backend)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ap-northeast-2
 
      - name: Terraform Init
        run: terraform init
        working-directory: envs/${{ matrix.env }}
 
      - name: Terraform Plan
        id: plan
        run: |
          terraform plan -no-color -out=tfplan
          terraform show -no-color tfplan > plan_output.txt
        working-directory: envs/${{ matrix.env }}
 
      - name: Comment Plan on PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            let output = fs.readFileSync('envs/${{ matrix.env }}/plan_output.txt', 'utf8');
            // GitHub PR 코멘트는 65,536자 제한이 있으므로 초과 시 잘라냄
            if (output.length > 60000) {
              output = output.substring(0, 60000) + '\n...(출력이 길어 잘렸습니다. 전체 내용은 Actions 로그를 확인하세요)';
            }
            const body = `#### Terraform Plan — ${{ matrix.env }}\n\`\`\`\n${output}\n\`\`\``;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });
 
  # Apply staging first
  apply-staging:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
 
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ap-northeast-2
 
      - run: terraform init
        working-directory: envs/staging
 
      - run: terraform apply -auto-approve
        working-directory: envs/staging
 
  # Apply production after staging succeeds (sequential deployment)
  apply-production:
    runs-on: ubuntu-latest
    needs: apply-staging
    environment: production
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
 
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ap-northeast-2
 
      - run: terraform init
        working-directory: envs/production
 
      - run: terraform apply -auto-approve
        working-directory: envs/production

environment: production: Using GitHub Actions' Environment Protection Rules, you can add an additional gate that prevents apply from running without approval from specific reviewers. This is especially useful for sensitive changes like modifying SLO target values.

For non-AWS environments: If you use Terraform Cloud HCP instead of an S3 backend, you can use the TFE_TOKEN secret instead of AWS_ROLE_ARN and set cli_config_credentials_token in the hashicorp/setup-terraform action. For a GCS (Google Cloud Storage) backend, you can configure authentication using the google-github-actions/auth action.

Example 3: Migrating Existing UI SLOs to Code

Since 2025, Grafana officially supports exporting SLOs created in the UI in HCL format. This formalizes a "UI first → migrate to code" brownfield transition path.

bash

# Step 1: Export existing SLOs in HCL format
curl -H "Authorization: Bearer $GRAFANA_TOKEN" \
  "https://your-org.grafana.net/api/plugins/grafana-slo-app/resources/v1/slo?format=hcl" \
  -o existing_slos.tf
 
# Or from the Grafana UI: SLO detail page → More > Export
 
# Step 2: Register existing SLOs in Terraform state (start managing without recreation)
terraform import grafana_slo.api_availability <SLO_UUID>
 
# Step 3: Verify drift between code and actual state with plan
terraform plan

After refactoring the exported HCL to fit the module structure and registering it in state with terraform import, you can begin managing existing SLOs as code without deleting and recreating them.

Advanced Topics

Knowledge Graph SLO Integration (New in 2025–2026)

Once you've mastered the basics, you can leverage Grafana's Knowledge Graph integration for richer SLO context.

GA Status Notice: Knowledge Graph SLO integration requires grafana/grafana provider v3.7.0 or later and only works in environments where the Asserts feature is enabled in Grafana Cloud. The search_expression attribute is currently in GA (Generally Available), but support may be limited to Grafana Cloud, so it is recommended to verify the latest supported scope in the official documentation before using it in a self-hosted environment.

Adding the grafana_slo_provenance = "asserts" label connects the SLO to a specific service entity, allowing you to navigate directly to the RCA Workbench for root cause analysis when an alert fires.

hcl

# Requires grafana/grafana provider >= 3.7.0, only works in environments with Asserts enabled
resource "grafana_slo" "entity_slo" {
  name        = "Checkout API Availability (Knowledge Graph)"
  description = "Knowledge Graph 연동 SLO"
 
  query {
    type = "ratio"
    ratio {
      success_metric = "http_requests_total{service=\"checkout\",status!~\"5..\"}"
      total_metric   = "http_requests_total{service=\"checkout\"}"
    }
  }
 
  objectives {
    value  = 0.999
    window = "30d"
  }
 
  destination_datasource {
    uid = data.grafana_data_source.prometheus.uid
  }
 
  label {
    key   = "grafana_slo_provenance"
    value = "asserts"
  }
 
  # Specify the service entity to link (provider >= 3.7.0, Asserts activation required)
  search_expression = "service.name=\"checkout-api\" AND environment=\"production\""
}

Pros and Cons

Advantages

Item	Details
Version control	SLO change history is tracked in Git history, making it auditable — when and why it changed
Review process	PR-based code review automates team approval workflows for SLO target and query changes
Environment consistency	Managing staging/production SLOs with the same module prevents configuration mismatches across environments
Self-service	Terraform module abstraction lets development teams create SLOs without depending on the platform team
Drift detection	`terraform plan` immediately detects unauthorized changes made in the UI
Automation	CI/CD automates deployments and eliminates manual click errors

Disadvantages and Caveats

Item	Details	Mitigation
Learning curve	Teams unfamiliar with Terraform and PromQL face initial onboarding costs	Provide reusable modules + internal documentation
State file management	Terraform state may contain sensitive information	S3 encryption + DynamoDB locking (or `use_lockfile` for 1.10+) configuration is required
UI sync issues	Directly modifying in the UI causes the code and actual state to diverge	Enforce a team rule against direct UI edits; use the `managed=terraform` label for identification
PromQL complexity	Built-in variables like `$__rate_interval` cannot be tested outside Grafana	Validate queries in Grafana Explore before committing to code
Auto-generated resources	Dashboards and alert rules are not directly managed by Terraform	Verify the impact of auto-generated resources before deleting an SLO
Grafana Cloud dependency	Only works in environments where the SLO plugin is activated	Activate the `grafana-slo-app` plugin for self-hosted setups. Plugin support may differ from Cloud, so check feature availability in the official plugin docs
Multi-environment complexity	Managing SLOs for dozens of services makes module design complex	Define naming conventions and directory structure in advance

⚠️ The Most Common Mistakes in Production

The following three issues occur most frequently in real production environments. Being aware of them in advance and making them team rules can reduce on-call fatigue.

Using a local state file: Keeping terraform.tfstate locally causes state inconsistencies among team members. It is recommended to configure a remote backend (backend.tf) as the very first step when starting a project.
Making temporary UI edits without reflecting them in code: Changing an SLO in the UI during an incident and forgetting to update the code means the change will be reverted on the next terraform apply. Running terraform plan regularly to detect drift early is important.
Applying the same SLO target to every service: 99.9% is not always the right answer. It is recommended to differentiate targets per service based on service criticality and error budget consumption history. The slo_target variable exists precisely so you can set different values per service.

Closing Thoughts

Building a culture where the entire team signs off on SLO changes — SLO-as-Code is the first step toward that. By declaring reliability targets as code, reviewing them via PRs, and deploying through CI/CD, you can secure both transparency in SLO management and trust across teams.

Three steps you can take right now:

(~5 minutes) Export your existing SLOs to HCL — The command curl "https://your-org.grafana.net/api/plugins/grafana-slo-app/resources/v1/slo?format=hcl" -H "Authorization: Bearer $GRAFANA_TOKEN" -o existing_slos.tf extracts your current SLOs as code. You can also open an SLO in the Grafana UI and click More > Export.
(~30 minutes) Set up the reusable module and import — Use the modules/grafana-slo structure from the examples above to write your module, then register existing SLOs into state with terraform import grafana_slo.<name> <SLO_UUID> and run terraform plan to confirm there is no drift.
(~20 minutes) Add the GitHub Actions workflow — Add .github/workflows/slo-deploy.yml to your repository and register the GRAFANA_URL, GRAFANA_SERVICE_ACCOUNT_TOKEN, and AWS_ROLE_ARN secrets, and a GitOps pipeline that automatically posts plan results as PR comments will be activated immediately.

References

If you're just getting started, these two are recommended first:

⭐ grafana_slo Resource | Terraform Registry — Official reference for the resource schema
⭐ Provision SLO resources using Terraform | Grafana Cloud Official Docs — Official setup guide

Additional resources:

Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline | DEV BAK - 기술블로그

DevOps

Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline

Tags: DevOps / Infrastructure / SRE / Observability

What this article covers:

Core query types and SLO components in grafana_slo
Reusable Terraform module design and per-environment management
A GitHub Actions GitOps pipeline with automated PR plans and sequential deployments
Migration procedures for converting existing UI SLOs to code
Advanced features including Knowledge Graph integration

Prerequisites: If you have basic familiarity with Terraform syntax (resources, variables, modules) and PromQL fundamentals, you can set up the pipeline in about an hour. Complete beginners are encouraged to go through HashiCorp's Terraform Get Started tutorial first.

Core Concepts

The Four Core Components of an SLO

Before writing SLO-as-Code, it helps to have a clear understanding of the concepts that make up an SLO.

Concept	Description	Example
SLI (Service Level Indicator)	A metric that measures service quality	Ratio of successful requests excluding HTTP 5xx
SLO (Service Level Objective)	A target value for an SLI	Maintain 99.9% availability over 30 days
Error Budget	`1 - SLO target`, the allowable margin of error	99.9% SLO → 0.1% (43.2 minutes/month)
Burn Rate	The rate at which the error budget is consumed	Burn Rate 2 = budget exhausted in 15 days

What is Burn Rate? A Burn Rate of 1 means the error budget is consumed at exactly the rate that exhausts it over the 30-day compliance window. A Burn Rate of 2 means half the window — 15 days — is all you get before the budget is gone, requiring immediate action.

SLI Query Types Supported by grafana_slo

Grafana provides four query types for defining SLIs.

Type	Description	Best for
`ratio`	Ratio of successful requests / total requests	HTTP availability, error rate measurement
`freeform`	Arbitrary PromQL expression	Latency, composite condition measurement
`threshold`	Threshold-based measurement	Determining whether a value exceeds a limit
`grafana_queries`	Reuse existing Grafana panel queries	Integrating with existing dashboard queries

The grafana_queries type lets you reuse already-built dashboard panel queries directly as SLIs, making it especially useful during initial migrations.

hcl

# grafana_queries type: reuse existing dashboard panel queries as SLIs
resource "grafana_slo" "panel_query_slo" {
  name        = "Checkout API Availability (Dashboard Query)"
  description = "기존 대시보드 패널 쿼리를 SLI로 재활용"
 
  query {
    type = "grafana_queries"
    grafana_queries {
      success_query {
        query_type = "instant"
        ref_id     = "A"
        expr       = "sum(http_requests_total{service=\"checkout\",status!~\"5..\"})"
      }
      total_query {
        query_type = "instant"
        ref_id     = "B"
        expr       = "sum(http_requests_total{service=\"checkout\"})"
      }
    }
  }
 
  objectives {
    value  = 0.999
    window = "30d"
  }
 
  destination_datasource {
    uid = var.datasource_uid
  }
}

Multi-window Alerting Structure

Grafana SLO alerting follows the multi-window approach recommended in the Google SRE Workbook.

Multi-window Alerting: Simultaneously detects fastburn (severity: critical), which rapidly consumes the error budget in a short time, and slowburn (severity: warning), which steadily consumes it at a low rate. Alerts only fire when both conditions are met, dramatically reducing false positives.

Basic Structure of the grafana_slo Resource

Now that the concepts are clear, let's look at an actual grafana_slo resource.

hcl

terraform {
  required_providers {
    grafana = {
      source  = "grafana/grafana"
      version = ">= 3.5.0"
    }
  }
}
 
provider "grafana" {
  url  = var.grafana_url
  auth = var.grafana_service_account_token
}
 
# Dynamically reference the datasource (recommended over hardcoding)
data "grafana_data_source" "prometheus" {
  name = "grafanacloud-prom"
}
 
# Ratio type: HTTP availability SLO
resource "grafana_slo" "api_availability" {
  name        = "API Availability SLO"
  description = "HTTP 5xx 제외 성공 요청 비율"
 
  query {
    type = "ratio"
    ratio {
      success_metric  = "http_requests_total{status!~\"5..\"}"
      total_metric    = "http_requests_total"
      group_by_labels = ["service", "env"]
    }
  }
 
  objectives {
    value  = 0.999
    window = "30d"
  }
 
  destination_datasource {
    uid = data.grafana_data_source.prometheus.uid
  }
 
  label {
    key   = "team"
    value = "platform"
  }
 
  alerting {
    fastburn {
      annotation {
        key   = "summary"
        value = "에러 버짓 급속 소진 중"
      }
      label {
        key   = "severity"
        value = "critical"
      }
    }
    slowburn {
      annotation {
        key   = "summary"
        value = "에러 버짓 지속 소진 중"
      }
      label {
        key   = "severity"
        value = "warning"
      }
    }
  }
}

For SLIs that are difficult to express as a ratio — such as latency — you can use the freeform type.

hcl

# Freeform type: P99 latency SLO
resource "grafana_slo" "latency_slo" {
  name        = "API P99 지연시간 SLO"
  description = "P99 응답시간 200ms 이하 유지"
 
  query {
    type = "freeform"
    freeform {
      query = "sum(rate(http_request_duration_seconds_bucket{le=\"0.2\"}[$__rate_interval])) / sum(rate(http_request_duration_seconds_count[$__rate_interval]))"
    }
  }
 
  objectives {
    value  = 0.95
    window = "7d"
  }
 
  destination_datasource {
    uid = data.grafana_data_source.prometheus.uid
  }
}

$__rate_interval and freeform queries: $__rate_interval is a built-in Grafana variable that automatically adjusts the rate() calculation window to match the scrape interval. During terraform apply, the Grafana backend substitutes this variable at runtime with the evaluation window used internally by the SLO engine (e.g., 5 minutes, 30 minutes). However, if you run the query directly with tools outside Grafana — such as promtool or curl /api/v1/query — this variable will not be substituted and will cause an error. It is strongly recommended to validate queries in the Grafana Explore panel before committing them to code.

Practical Application

Now that the concepts are clear, let's write actual code. We'll walk through each step: module design, CI/CD pipeline, and migration.

Example 1: Designing a Reusable SLO Module

Directory structure:

slo-gitops/
├── modules/
│   └── grafana-slo/
│       ├── main.tf        # grafana_slo resource definition
│       ├── variables.tf   # input variable declarations
│       └── outputs.tf     # output value declarations
├── envs/
│   ├── staging/
│   │   ├── backend.tf     # S3 remote state configuration
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   └── production/
│       ├── backend.tf
│       ├── main.tf
│       └── terraform.tfvars
└── .github/
    └── workflows/
        └── slo-deploy.yml

Module variable declarations (modules/grafana-slo/variables.tf):

hcl

variable "service_name" {
  type        = string
  description = "SLO 이름에 사용할 서비스 식별자"
}
 
variable "team" {
  type        = string
  description = "담당 팀 이름 (레이블 및 설명에 사용)"
}
 
variable "slo_target" {
  type        = number
  default     = 0.999
  description = "SLO 목표값 (0.0 ~ 1.0)"
}
 
variable "success_metric" {
  type        = string
  description = "성공 요청을 나타내는 PromQL 메트릭"
}
 
variable "total_metric" {
  type        = string
  description = "전체 요청을 나타내는 PromQL 메트릭"
}
 
variable "datasource_uid" {
  type        = string
  description = "Prometheus 데이터소스 UID"
}

Module resource definition (modules/grafana-slo/main.tf):

hcl

resource "grafana_slo" "this" {
  name        = "${var.service_name} Availability"
  description = "Managed by Terraform - team: ${var.team}"
 
  query {
    type = "ratio"
    ratio {
      success_metric = var.success_metric
      total_metric   = var.total_metric
    }
  }
 
  objectives {
    value  = var.slo_target
    window = "30d"
  }
 
  destination_datasource {
    uid = var.datasource_uid
  }
 
  label {
    key   = "team"
    value = var.team
  }
  label {
    key   = "managed"
    value = "terraform"
  }
 
  alerting {
    fastburn {
      label {
        key   = "severity"
        value = "critical"
      }
    }
    slowburn {
      label {
        key   = "severity"
        value = "warning"
      }
    }
  }
}

Module outputs (modules/grafana-slo/outputs.tf):

hcl

output "slo_id" {
  value       = grafana_slo.this.id
  description = "생성된 SLO의 UUID"
}
 
output "slo_name" {
  value       = grafana_slo.this.name
  description = "생성된 SLO의 이름"
}

S3 backend configuration (envs/production/backend.tf):

hcl

terraform {
  backend "s3" {
    bucket  = "my-terraform-state"
    key     = "slo/production/terraform.tfstate"
    region  = "ap-northeast-2"
    encrypt = true
 
    # Terraform 1.9 이하: DynamoDB 잠금 사용
    dynamodb_table = "terraform-lock"
 
    # Terraform 1.10+: 네이티브 S3 잠금 (위 dynamodb_table 대신 사용 가능)
    # use_lockfile = true
  }
}

Terraform State: A file that stores the current state of the infrastructure Terraform manages. In Terraform 1.9 and below, DynamoDB locking was required to prevent concurrent modifications, but since 1.10, the use_lockfile = true option enables native S3 locking, making a DynamoDB table unnecessary. Regardless of which approach you choose, configuring a remote backend is the first step toward team collaboration.

Calling the module in the production environment (envs/production/main.tf):

hcl

data "grafana_data_source" "prometheus" {
  name = "grafanacloud-prom"
}
 
module "checkout_slo" {
  source = "../../modules/grafana-slo"
 
  service_name   = "checkout-api"
  team           = "payments"
  slo_target     = 0.999
  success_metric = "http_requests_total{service=\"checkout\",status!~\"5..\"}"
  total_metric   = "http_requests_total{service=\"checkout\"}"
  datasource_uid = data.grafana_data_source.prometheus.uid
}
 
module "auth_slo" {
  source = "../../modules/grafana-slo"
 
  service_name   = "auth-service"
  team           = "identity"
  slo_target     = 0.9995
  success_metric = "http_requests_total{service=\"auth\",status!~\"5..\"}"
  total_metric   = "http_requests_total{service=\"auth\"}"
  datasource_uid = data.grafana_data_source.prometheus.uid
}

How to verify PromQL metric names: The metric names that go into success_metric and total_metric vary by service. It is recommended to look up the actual metrics via kubectl exec -it <pod> -- curl localhost:8080/metrics or the Grafana Explore panel before putting them into code.

Component	Role
Separate `variables.tf`	Separates variable definitions from resource logic, improving module reusability
Separate `outputs.tf`	Makes the SLO ID referenceable from other modules (alerts, dashboards)
Separate `backend.tf`	Clearly manages state file paths per environment
`managed = "terraform"` label	A guardrail to identify SLOs that have been modified without authorization in the UI

Example 2: GitHub Actions GitOps Pipeline

yaml

name: SLO Deploy
 
on:
  push:
    branches: [main]
    paths: ["envs/**", "modules/**"]
  pull_request:
    branches: [main]
    paths: ["envs/**", "modules/**"]
 
env:
  TF_VAR_grafana_url: ${{ secrets.GRAFANA_URL }}
  TF_VAR_grafana_token: ${{ secrets.GRAFANA_SERVICE_ACCOUNT_TOKEN }}
 
jobs:
  plan:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    strategy:
      matrix:
        env: [staging, production]
    steps:
      - uses: actions/checkout@v4
 
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.9.x"
 
      - name: Configure AWS Credentials (S3 backend)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ap-northeast-2
 
      - name: Terraform Init
        run: terraform init
        working-directory: envs/${{ matrix.env }}
 
      - name: Terraform Plan
        id: plan
        run: |
          terraform plan -no-color -out=tfplan
          terraform show -no-color tfplan > plan_output.txt
        working-directory: envs/${{ matrix.env }}
 
      - name: Comment Plan on PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            let output = fs.readFileSync('envs/${{ matrix.env }}/plan_output.txt', 'utf8');
            // GitHub PR 코멘트는 65,536자 제한이 있으므로 초과 시 잘라냄
            if (output.length > 60000) {
              output = output.substring(0, 60000) + '\n...(출력이 길어 잘렸습니다. 전체 내용은 Actions 로그를 확인하세요)';
            }
            const body = `#### Terraform Plan — ${{ matrix.env }}\n\`\`\`\n${output}\n\`\`\``;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });
 
  # Apply staging first
  apply-staging:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
 
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ap-northeast-2
 
      - run: terraform init
        working-directory: envs/staging
 
      - run: terraform apply -auto-approve
        working-directory: envs/staging
 
  # Apply production after staging succeeds (sequential deployment)
  apply-production:
    runs-on: ubuntu-latest
    needs: apply-staging
    environment: production
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
 
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ap-northeast-2
 
      - run: terraform init
        working-directory: envs/production
 
      - run: terraform apply -auto-approve
        working-directory: envs/production

environment: production: Using GitHub Actions' Environment Protection Rules, you can add an additional gate that prevents apply from running without approval from specific reviewers. This is especially useful for sensitive changes like modifying SLO target values.

For non-AWS environments: If you use Terraform Cloud HCP instead of an S3 backend, you can use the TFE_TOKEN secret instead of AWS_ROLE_ARN and set cli_config_credentials_token in the hashicorp/setup-terraform action. For a GCS (Google Cloud Storage) backend, you can configure authentication using the google-github-actions/auth action.

Example 3: Migrating Existing UI SLOs to Code

Since 2025, Grafana officially supports exporting SLOs created in the UI in HCL format. This formalizes a "UI first → migrate to code" brownfield transition path.

bash

# Step 1: Export existing SLOs in HCL format
curl -H "Authorization: Bearer $GRAFANA_TOKEN" \
  "https://your-org.grafana.net/api/plugins/grafana-slo-app/resources/v1/slo?format=hcl" \
  -o existing_slos.tf
 
# Or from the Grafana UI: SLO detail page → More > Export
 
# Step 2: Register existing SLOs in Terraform state (start managing without recreation)
terraform import grafana_slo.api_availability <SLO_UUID>
 
# Step 3: Verify drift between code and actual state with plan
terraform plan

After refactoring the exported HCL to fit the module structure and registering it in state with terraform import, you can begin managing existing SLOs as code without deleting and recreating them.

Advanced Topics

Knowledge Graph SLO Integration (New in 2025–2026)

Once you've mastered the basics, you can leverage Grafana's Knowledge Graph integration for richer SLO context.

GA Status Notice: Knowledge Graph SLO integration requires grafana/grafana provider v3.7.0 or later and only works in environments where the Asserts feature is enabled in Grafana Cloud. The search_expression attribute is currently in GA (Generally Available), but support may be limited to Grafana Cloud, so it is recommended to verify the latest supported scope in the official documentation before using it in a self-hosted environment.

hcl

# Requires grafana/grafana provider >= 3.7.0, only works in environments with Asserts enabled
resource "grafana_slo" "entity_slo" {
  name        = "Checkout API Availability (Knowledge Graph)"
  description = "Knowledge Graph 연동 SLO"
 
  query {
    type = "ratio"
    ratio {
      success_metric = "http_requests_total{service=\"checkout\",status!~\"5..\"}"
      total_metric   = "http_requests_total{service=\"checkout\"}"
    }
  }
 
  objectives {
    value  = 0.999
    window = "30d"
  }
 
  destination_datasource {
    uid = data.grafana_data_source.prometheus.uid
  }
 
  label {
    key   = "grafana_slo_provenance"
    value = "asserts"
  }
 
  # Specify the service entity to link (provider >= 3.7.0, Asserts activation required)
  search_expression = "service.name=\"checkout-api\" AND environment=\"production\""
}

Pros and Cons

Advantages

Item	Details
Version control	SLO change history is tracked in Git history, making it auditable — when and why it changed
Review process	PR-based code review automates team approval workflows for SLO target and query changes
Environment consistency	Managing staging/production SLOs with the same module prevents configuration mismatches across environments
Self-service	Terraform module abstraction lets development teams create SLOs without depending on the platform team
Drift detection	`terraform plan` immediately detects unauthorized changes made in the UI
Automation	CI/CD automates deployments and eliminates manual click errors

Disadvantages and Caveats

Item	Details	Mitigation
Learning curve	Teams unfamiliar with Terraform and PromQL face initial onboarding costs	Provide reusable modules + internal documentation
State file management	Terraform state may contain sensitive information	S3 encryption + DynamoDB locking (or `use_lockfile` for 1.10+) configuration is required
UI sync issues	Directly modifying in the UI causes the code and actual state to diverge	Enforce a team rule against direct UI edits; use the `managed=terraform` label for identification
PromQL complexity	Built-in variables like `$__rate_interval` cannot be tested outside Grafana	Validate queries in Grafana Explore before committing to code
Auto-generated resources	Dashboards and alert rules are not directly managed by Terraform	Verify the impact of auto-generated resources before deleting an SLO
Grafana Cloud dependency	Only works in environments where the SLO plugin is activated	Activate the `grafana-slo-app` plugin for self-hosted setups. Plugin support may differ from Cloud, so check feature availability in the official plugin docs
Multi-environment complexity	Managing SLOs for dozens of services makes module design complex	Define naming conventions and directory structure in advance

⚠️ The Most Common Mistakes in Production

The following three issues occur most frequently in real production environments. Being aware of them in advance and making them team rules can reduce on-call fatigue.

Using a local state file: Keeping terraform.tfstate locally causes state inconsistencies among team members. It is recommended to configure a remote backend (backend.tf) as the very first step when starting a project.
Making temporary UI edits without reflecting them in code: Changing an SLO in the UI during an incident and forgetting to update the code means the change will be reverted on the next terraform apply. Running terraform plan regularly to detect drift early is important.
Applying the same SLO target to every service: 99.9% is not always the right answer. It is recommended to differentiate targets per service based on service criticality and error budget consumption history. The slo_target variable exists precisely so you can set different values per service.

Closing Thoughts

Three steps you can take right now:

(~5 minutes) Export your existing SLOs to HCL — The command curl "https://your-org.grafana.net/api/plugins/grafana-slo-app/resources/v1/slo?format=hcl" -H "Authorization: Bearer $GRAFANA_TOKEN" -o existing_slos.tf extracts your current SLOs as code. You can also open an SLO in the Grafana UI and click More > Export.
(~30 minutes) Set up the reusable module and import — Use the modules/grafana-slo structure from the examples above to write your module, then register existing SLOs into state with terraform import grafana_slo.<name> <SLO_UUID> and run terraform plan to confirm there is no drift.
(~20 minutes) Add the GitHub Actions workflow — Add .github/workflows/slo-deploy.yml to your repository and register the GRAFANA_URL, GRAFANA_SERVICE_ACCOUNT_TOKEN, and AWS_ROLE_ARN secrets, and a GitOps pipeline that automatically posts plan results as PR comments will be activated immediately.

References

If you're just getting started, these two are recommended first:

⭐ grafana_slo Resource | Terraform Registry — Official reference for the resource schema
⭐ Provision SLO resources using Terraform | Grafana Cloud Official Docs — Official setup guide

Additional resources:

Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline

Core Concepts

The Four Core Components of an SLO

SLI Query Types Supported by grafana_slo

Multi-window Alerting Structure

Basic Structure of the grafana_slo Resource

Practical Application

Example 1: Designing a Reusable SLO Module

Example 2: GitHub Actions GitOps Pipeline

Example 3: Migrating Existing UI SLOs to Code

Advanced Topics

Knowledge Graph SLO Integration (New in 2025–2026)

Pros and Cons

Advantages

Disadvantages and Caveats

⚠️ The Most Common Mistakes in Production

Closing Thoughts

Other Articles in This Series

References

Implementing SLO-as-Code with Terraform grafana_slo: A Step-by-Step GitOps Pipeline

Core Concepts

The Four Core Components of an SLO

SLI Query Types Supported by grafana_slo

Multi-window Alerting Structure

Basic Structure of the grafana_slo Resource

Practical Application

Example 1: Designing a Reusable SLO Module

Example 2: GitHub Actions GitOps Pipeline

Example 3: Migrating Existing UI SLOs to Code

Advanced Topics

Knowledge Graph SLO Integration (New in 2025–2026)

Pros and Cons

Advantages

Disadvantages and Caveats

⚠️ The Most Common Mistakes in Production

Closing Thoughts

Other Articles in This Series

References

Recommended Posts

Core Concepts

The Four Core Components of an SLO

SLI Query Types Supported by grafana_slo

Multi-window Alerting Structure

Basic Structure of the grafana_slo Resource

Practical Application

Example 1: Designing a Reusable SLO Module

Example 2: GitHub Actions GitOps Pipeline

Example 3: Migrating Existing UI SLOs to Code

Advanced Topics

Knowledge Graph SLO Integration (New in 2025–2026)

Pros and Cons

Advantages

Disadvantages and Caveats

⚠️ The Most Common Mistakes in Production

Closing Thoughts

Other Articles in This Series

References

Core Concepts

The Four Core Components of an SLO

SLI Query Types Supported by grafana_slo

Multi-window Alerting Structure

Basic Structure of the grafana_slo Resource

Practical Application

Example 1: Designing a Reusable SLO Module

Example 2: GitHub Actions GitOps Pipeline

Example 3: Migrating Existing UI SLOs to Code

Advanced Topics

Knowledge Graph SLO Integration (New in 2025–2026)

Pros and Cons

Advantages

Disadvantages and Caveats

⚠️ The Most Common Mistakes in Production

Closing Thoughts

Other Articles in This Series

References

Recommended Posts

Kubernetes SLO Automation: Declarative SLO Management with Sloth and Pyrra

Error Budget Automation: A Practical Implementation Guide to Blocking SLO Violations with GitOps Deployment Gates

Complete Guide to Prometheus + Grafana Monitoring — From Docker Compose to Kubernetes

Automating Fast-burn/Slow-burn Alerts with Grafana SLO

Building a P99 Latency & Error Rate SLO Dashboard — A Practical Guide to Grafana Loki LogQL

LogQL Pipeline Parser Practical Guide — Extracting Fields from Unstructured Logs in Grafana Loki 3.x