Putting Error Rate per Second & P99 Latency on a Grafana Loki Dashboard with LogQL — rate, sum by, and quantile_over_time Query Patterns
"Searching" logs and "extracting insights" from logs are quite different things. You've probably been there: your service goes down, you frantically stare at logs in the Grafana Explore tab (left menu → Explore), and eventually give up and go back to plain grep. The frustration usually comes from using LogQL only as a filter tool.
When you use LogQL properly, the story changes. You can aggregate tens of millions of log lines in real time to produce per-second error rates, P99 latency, and per-service traffic heatmaps. If you already know PromQL, the aggregation syntax will feel familiar — and even if this is your first time, the concepts themselves aren't hard.
This article walks through the two pillars of LogQL — stream selectors and log pipelines — step by step, then organizes metric aggregation operators like rate, sum by, and quantile_over_time into query patterns you can drop straight into Grafana dashboard panels.
Prerequisites: This article assumes Grafana Loki is already deployed and connected to your environment. Loki installation is not covered here.
Table of Contents
- Core Concepts
- Practical Examples
- Example 1: Per-Service Error Rate Monitoring (Time Series Panel)
- Example 2: P99 Latency Trend (Example 1 with unwrap Added)
- Example 3: Extracting Response Time from AWS ALB Access Logs
- Example 4: Debugging Parse Errors
- Example 5: Authentication Failure Alerts (for Grafana Alerting)
- Example 6: Log Volume Heatmap (Distribution by Log Level)
- Pros and Cons
- Closing Thoughts
- References
Core Concepts
The Two Components of LogQL
LogQL is Grafana Loki's dedicated query language, designed with inspiration from PromQL. The structure is simple: a stream selector that specifies which log streams to look at, and a log pipeline that filters and transforms those streams — that's all there is.
{app="nginx", env="production"} |= "error" | json | status >= 500
└─── stream selector ─────────┘ └──────── log pipeline ──────────────┘Stream selectors leverage the label index stored in Loki. This is why narrowing the scope dramatically reduces query cost — filtering inside the pipeline is a full scan that happens after chunks are already read.
Stream selector: A
{key="value"}expression that selects logs stored in Loki by label. Because Loki only indexes labels — not log content — the narrower the selector, the lower the I/O cost.
LogQL queries come in two types depending on their return type:
| Type | Return Value | Primary Use |
|---|---|---|
| Log Query | Log lines (strings) | Grafana Logs panel, Explore browsing |
| Metric Query | Numeric time series | Graph, Stat, Alert panels |
A Metric Query is created by wrapping a Log Query with aggregation functions like rate() or count_over_time(). The two sections below walk through this flow in order.
Writing Log Queries: Understanding Each Pipeline Stage
A pipeline is a sequence of stages connected by |. Once you know what each stage does, the right ordering — and why it matters for performance — becomes obvious.
Stage 1 — Line Filter: Filter the Log Lines Themselves
{app="api"} |= "ERROR" # contains string
{app="api"} != "healthcheck" # excludes string
{app="api"} |~ "5[0-9]{2}" # contains regex
{app="api"} !~ "GET|POST" # excludes regexLine filters are the cheapest stage in the pipeline. Because they run before parsing, always place them at the very front of the pipeline. I used to add filters after | json and wondered why queries felt sluggish — swapping the order made a noticeable difference.
Stage 2 — Parser: Structure Unstructured Logs into Labels
# JSON log parsing
{app="api"} | json
# key=value format (logfmt)
{app="worker"} | logfmt
# Position-based pattern extraction
{app="nginx"} | pattern `<ip> - <user> [<_>] "<method> <path> <_>" <status> <bytes>`
# Extract named groups with regex
{app="legacy"} | regexp `(?P<level>\w+) (?P<message>.+)`Once the parser runs, fields from the log line are registered as labels, making them referenceable by names like status or duration in later stages. If you've ever been puzzled by "why isn't the field there after parsing?" — it's often because the parser only runs at stage 2.
Stage 3 — Label Filter: Filter on Parsed Fields
{app="api"} | json | status >= 500
{app="api"} | json | level = "error" | duration > 200ms
{app="api"} | json | __error__ != "" # debugging parse errorsThe __error__ label is an internal label Loki automatically attaches when parsing fails. When JSON parsing isn't working, __error__ != "" lets you see which logs are failing to parse. It's a tip buried even in the official docs, but it's invaluable for tracking down issues when a log format changes.
Stage 4 — Line Format / Label Format: Redefine Output
# Reformat output lines
{app="api"} | json | line_format "{{.level}} | {{.message}}"
# Rename labels
{app="api"} | json | label_format response_time=duration, svc=appCommonly used to improve readability in the Logs panel or to unify label names before aggregation.
Writing Metric Queries: Range Aggregation and Vector Aggregation
Once you've filtered the desired logs through a pipeline, wrapping it with aggregation functions produces a Metric Query.
Range Aggregation: Extracting Numbers from Logs
There are two kinds, distinguished by whether unwrap is used.
Without unwrap — aggregate the log lines themselves
# Number of log lines over 5 minutes
count_over_time({app="api"}[5m])
# Per-second log occurrence rate
rate({app="api"}[5m])
# Total bytes over 5 minutes / per-second byte rate
bytes_over_time({app="api"}[5m])
bytes_rate({app="api"}[5m])rate() is count_over_time() divided by the number of seconds in the range window. Because it normalizes to "per second," comparisons remain valid even if the range vector size changes. bytes_over_time and bytes_rate calculate the total bytes and per-second bytes of log lines, respectively — useful for monitoring log volume in capacity units.
With unwrap — extract a numeric field, then aggregate
# Average response time
avg_over_time({app="api"} | logfmt | unwrap duration [5m])
# 99th percentile latency
quantile_over_time(0.99, {app="api"} | json | unwrap response_time [5m])
# Maximum value
max_over_time({app="api"} | json | unwrap response_ms [5m])unwrap: A keyword that pulls a parsed label value out as a number for use in aggregation.
unwrap durationconverts thedurationlabel's value to a float. Time-unit values like1.5sor200msare automatically converted to seconds.
quantile_over_time uses a T-digest-based approximation algorithm. The result can differ slightly from exact quantiles computed via Prometheus, so keep that in mind if P99 values look unexpected.
Vector Aggregation: Grouping Series with sum by and topk
This stage takes the time series produced by Range Aggregation and groups or aggregates them by label.
# Per-service per-second error rate
sum by (service) (rate({env="prod"} |= "error" [5m]))
# Top 10 endpoints by error count
topk(10, sum by (path) (count_over_time({app="api"} | json | status >= 500 [10m])))
# Request count by HTTP status code
sum by (status) (count_over_time({app="nginx"} | json [5m]))Calculating an error rate is identical to the PromQL approach: put the error-filtered expression in the numerator and total requests in the denominator. Honestly, knowing just this one pattern solves half the alerts you'll ever need in production. However, if you want per-service rates, both the numerator and denominator need sum by (service) — otherwise you get a single scalar instead of per-service series.
# Per-service error rate (%) calculation
sum by (service) (rate({app="api"} | json | status >= 500 [5m]))
/
sum by (service) (rate({app="api"} | json [5m]))
* 100The aggregation operators are nearly identical to PromQL. In practice, sum by and topk cover most cases, and the full list is available in the Loki official documentation.
Practical Examples
The examples are ordered so that each one builds on the last. Adding unwrap to the basic error rate query from Example 1 gives you the latency query in Example 2, and so on.
Example 1: Per-Service Error Rate Monitoring (Time Series Panel)
Use this in a microservices environment when you want to instantly see "which service is spiking with errors right now?"
sum by (service, method) (
rate(
{namespace="production"} | json | status =~ "5.." [5m]
)
)| Component | Role |
|---|---|
{namespace="production"} |
Selects only logs from the production namespace |
| json |
Parses logs as JSON, turning status, method, etc. into labels |
status =~ "5.." |
Filters to only 5xx status codes using regex |
rate([5m]) |
Per-second error rate over a 5-minute window |
sum by (service, method) |
Aggregates by service + method combination |
When attached to a Grafana Time series panel, each service/method combination appears as a separate series. You can immediately spot when POST /payment errors from a specific service suddenly spike.
Example 2: P99 Latency Trend (Example 1 with unwrap Added)
Built on the same structure as Example 1, but uses quantile_over_time instead of rate and uses unwrap to pull out a numeric field. Frequently used for SLA management and performance regression detection.
quantile_over_time(0.99,
{app="api"} | json | unwrap response_time_ms [5m]
) by (service)| Component | Role |
|---|---|
| json |
JSON parsing to turn the response_time_ms field into a label |
unwrap response_time_ms |
Pulls the label value out as a number |
quantile_over_time(0.99, ...[5m]) |
Computes the 99th percentile of values collected over 5 minutes |
by (service) |
Splits by service |
quantile_over_time(φ, ...): Computes the φ (0–1) quantile of values in the range.
0.99gives P99,0.5gives the median. Unlikeavg_over_time, it's less affected by outliers. However, Loki computes this as a T-digest-based approximation, so results may differ slightly from Prometheus histogram-based quantiles.
Example 3: Extracting Response Time from AWS ALB Access Logs
ALB logs are unstructured, which makes the pattern parser extremely useful here.
avg_over_time(
{job="alb-access"}
| pattern `<time> <elb> <client_ip>:<client_port> <_> <_> <request_processing_time> <backend_processing_time> <_> <elb_status_code> <_> <received_bytes> <sent_bytes> <request>`
| backend_processing_time > 0
| unwrap backend_processing_time [1m]
) by (elb_status_code)<_> is a wildcard that skips fields you don't need to extract. Since you only name the fields you care about, the pattern parser is far more readable than regex for access log formats with a fixed column order.
ALB records
backend_processing_timeas-1when the backend connection fails. Usingavg_over_timewithout the| backend_processing_time > 0filter lets those -1 values mix in and skew the average. This is a trap almost everyone steps into at least once in production.
Example 4: Debugging Parse Errors
When a newly ingested log format has changed, you can quickly identify which lines are failing to parse.
{app="api"} | json | __error__ = "JSONParserErr"
| line_format "PARSE_ERR: {{__line__}}"__line__ is an internal variable that references the full original log line. Since you can see the raw source of whatever is failing to parse, it's extremely useful for tracking format changes.
Example 5: Authentication Failure Alerts (for Grafana Alerting)
sum(
count_over_time({app="auth-service"} |= "authentication failed" [1m])
)Attach this query to a Grafana Alert rule and set the Threshold in Grafana Alerting to 100. Embedding > 100 directly in the query can conflict with Grafana Alert evaluation logic in some cases, so it's safer to leave threshold configuration to the panel settings. This is a pattern you'll encounter frequently in production for brute-force attack detection.
Example 6: Log Volume Heatmap (Distribution by Log Level)
sum by (level) (
count_over_time({app="api"} | json [1m])
)How you attach this query differs slightly depending on the Grafana panel type:
- Bar chart panel: Attach the query as-is. Stacked bars for
info/warn/errorwill be displayed per time period. This is more intuitive for checking whethererror-level logs spike right after a deployment. - Heatmap panel: Heatmaps work best when label values are numeric buckets. Representing a string-label-based distribution as a Heatmap requires additional configuration in panel settings — for log level distributions, a Bar chart panel is more appropriate.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Low indexing cost | Only labels are indexed, not log content → dramatically reduced storage costs |
| PromQL-friendly | Aggregation syntax is similar to PromQL → gentle learning curve for existing Prometheus users |
| Pipeline flexibility | JSON/logfmt/regexp parsing happens at query time → no need to pre-design index schemas |
| Unified observability | Metrics (Prometheus) + Logs (Loki) + Traces (Tempo) can all be connected within Grafana |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Full-scan cost | A broad label range means reading a large number of chunks | Keep stream selectors as narrow as possible |
| Cardinality explosion | Overusing high-cardinality extracted labels in by causes series explosion |
Avoid using labels like user_id or request_id as aggregation keys |
count_over_time + unwrap incompatible |
count_over_time does not accept unwrapped values |
Use sum_over_time for summing |
| Regex performance | |~ regex filters are slower than |= string filters |
Prefer string filters where possible |
| Pipeline order sensitivity | Placing line filters after parsing incurs unnecessary parsing cost | Always place line filters at the front of the pipeline |
| quantile_over_time approximation | T-digest-based; may differ slightly from Prometheus quantiles | Consider using Prometheus histogram metrics when exact quantiles are needed |
Cardinality: The number of unique values a label can hold.
envhas few values likeproductionandstaging— low cardinality — butuser_idcan have millions of unique values — high cardinality. Using high-cardinality labels as aggregation keys causes memory and storage costs to explode.
Most Common Mistakes in Production
-
Setting the stream selector too broadly — Leaving it empty like
{}or using a wide range forces Loki to scan all chunks, causing timeouts. Always narrow the scope first with labels likeapp,namespace, orenv. -
Placing line filters after parsing in the pipeline —
|= "timeout" | json | status >= 500is significantly faster than| json | status >= 500 |= "timeout". Pre-filtering with a line filter before parsing reduces the number of logs that need to be parsed in the first place. -
Setting the
[5m]range vector too short — On low-volume services, using[1m]can result in sparse or zero aggregation output. Also, if the Grafana panel's Step value is larger than the range vector, data points will be missing — keep Step smaller than the range vector.
Closing Thoughts
I started out just using |= "error" in Explore, and learning sum by alone changed how I think about building dashboards entirely. It feels like shifting from a tool that "displays" logs to one that "measures" them.
If you're already running Loki, I'd encourage you to put at least one LogQL-based panel up on a Grafana dashboard right now.
Three steps to get started immediately:
-
Start by verifying your stream selector in Grafana Explore — Type
{app="your-service"}into the Explore tab (left menu → Explore) and confirm logs are being collected correctly. Understanding what labels are attached is the starting point for every query. -
Copy the error rate query and paste it into a Time series panel — Take the query from Example 1 and just swap the
namespaceandservicelabels to match your environment. If thesum by (service)result shows up as series in the panel, you're done. -
Extend it to a P99 latency query — If your logs include a response time field, combine
unwrapwithquantile_over_time(0.99, ...)to add an SLA monitoring panel. Placing it alongsideavg_over_timelets you see the difference between average and tail latency at a glance, which is useful for detecting anomalies.
References
Good starting points
- Metric queries | Grafana Loki official documentation
- How to use LogQL range aggregations in Loki | Grafana Labs Blog
Additional references
- Query Loki | Grafana Loki official documentation
- Log queries | Grafana Loki official documentation
- LogQL Reference | Grafana Loki official documentation
- Loki v3.5 Release Notes | Grafana Loki official documentation
- Grafana Loki: LogQL and Recording Rules from AWS ALB logs | ITNEXT
- A Comprehensive Guide to LogQL | DEV Community
- Introduction to Loki Workshop — Metric Queries Lab
- Loki LogQL Cheat Sheet | logit.io
- LogQL Cheat Sheet | FusionReactor