"It's internal, so it should be fine" is over — Applying Zero Trust to Microservice APIs: mTLS, JWT, and OPA in Practice
To be honest, I used to think, "It's an internal network anyway, why bother with such tight security?" I believed that as long as the outside of the API Gateway was well-protected, the inside would take care of itself. But in reality, incidents always originate from the inside. Scenarios where a trusted service account gets hijacked, or a single internal microservice gets compromised and laterally moves sideways through the entire system — this isn't just a hypothetical. According to the Salt Security 2023 API Security Report, 71% of web traffic already flows through APIs, and 17% of security vulnerabilities originate from APIs. APIs themselves are the new security perimeter.
If you're already using JWT, you can jump straight to the "API Gateway JWT Hardening" section. If you want to extend security to service-to-service communication, start with the "mTLS + SPIRE" section. If this is your first time, just read in order.
After reading this, you'll have the concepts and code needed to add JWT revocation checks to an existing NestJS service and stand up an mTLS pilot between two services. Keywords like mTLS, JWT, OPA, and SPIRE will be explained one by one as they appear, so don't worry if you're encountering them for the first time. That said, familiarity with basic Kubernetes concepts (pods, namespaces) will make the latter examples easier to follow.
Core Concepts
The Three Pillars of Zero Trust
Zero Trust is a security model based on the principle of "Never Trust, Always Verify." Unlike traditional perimeter security that implicitly trusted internal networks, it authenticates and authorizes every request regardless of network location.
Understanding this makes the rest of the technology choices fall naturally into place.
| Principle | Meaning | Application in API Design |
|---|---|---|
| Verify Explicitly | Validate every request based on identity, device, and context | Verify tokens/certificates on every request; never trust IP addresses |
| Least Privilege | Allow only what is strictly necessary | Scope-based fine-grained authorization, per-resource access control |
| Assume Breach | Design as if the interior is already compromised | Encrypt and authenticate service-to-service communication; block lateral movement |
NIST SP 800-207: The official standard framework for Zero Trust Architecture, which recommends phased migration. A "big bang" all-at-once transition carries too much practical risk.
Traditional Perimeter Security vs. Zero Trust
This is a situation frequently encountered in practice. The traditional approach looks like this:
[External] — Firewall — [Internal Network]
↳ Service A ↔ Service B ↔ Service C
(implicitly trust each other)Once inside the firewall, services treat each other as "on the same side." Zero Trust, on the other hand, looks like this:
[External] — API Gateway (Auth & Authz) — [Service A]
↓ mTLS + identity verification
[Service B]
↓ mTLS + identity verification
[Service C]Verification occurs at every hop. Even when Service A calls Service B, it confirms "are you really A?"
What Is Workload Identity?
I was confused by this at first too — just as people have IDs, services (workloads) also need unique identities. This is the core of workload identity.
SPIFFE (Secure Production Identity Framework For Everyone): A CNCF project that provides a standard for assigning consistent workload identities even in dynamic container and serverless environments. SPIRE is the runtime that actually implements this standard. Think of them as the relationship between a standard and its implementation.
A SPIFFE URI looks like this:
spiffe://trust-domain/path/to/service
e.g.: spiffe://mycompany.com/backend/payment-serviceBased on this identity, short-lived X.509 certificates called SVIDs (SPIFFE Verifiable Identity Documents) are automatically issued and used for mTLS communication. Since certificates are automatically renewed upon expiration, there's no need to manage them manually like static API keys. I once experienced an early-morning outage due to certificate expiration when first implementing mTLS — I wish SPIRE had been there to spare me that headache.
Architecture Choices Based on Whether a Service Mesh Is Present
Getting a handle on the branching below will make the context clearer as you follow the examples.
| Aspect | Without a Service Mesh | With a Service Mesh (e.g., Istio) |
|---|---|---|
| mTLS Configuration | Managed directly in app code or via cert-manager | Handled automatically by sidecar proxy |
| Policy Enforcement Point | App-level Guard or API Gateway | OPA integrated into Envoy sidecar |
| Operational Complexity | Low (simple, but difficult to maintain consistency) | High (heavy initial setup, smoother ongoing operations) |
| Recommended Starting Point | ✅ Small teams, fast pilots | Large-scale microservice environments |
Sidecar: An auxiliary container that runs alongside the app container inside a Kubernetes Pod (the unit that runs containers). It intercepts network traffic on behalf of the app to handle mTLS, logging, policy enforcement, and more.
Practical Application
API Gateway JWT Hardening: Closing the Gaps in Token Validation
This is the most realistic starting point. It can be applied immediately without a service mesh, which is why it's recommended as the first step.
External Client (OAuth2 Bearer Token)
↓
[API Gateway] → Token validation (iss, aud, exp, scope, revocation)
↓
Internal Service (converted to standardized internal JWT)There are parts of JWT validation that are easy to miss. Many implementations only check exp (expiration) and stop there, but revocation status checks are also essential. The language may differ, but the principle is the same. Here is a NestJS TypeScript example.
// Dependencies: @nestjs/jwt, ioredis (Redis client)
import { Injectable, UnauthorizedException, ForbiddenException } from '@nestjs/common';
import { JwtService } from '@nestjs/jwt';
@Injectable()
export class AuthService {
constructor(
private readonly jwtService: JwtService,
private readonly tokenBlacklistService: TokenBlacklistService,
) {}
async validateToken(token: string): Promise<JwtPayload> {
let payload: JwtPayload;
try {
payload = this.jwtService.verify(token, {
issuer: 'https://auth.mycompany.com',
audience: 'api.mycompany.com',
// exp, iss, aud are automatically checked by verify()
});
} catch (e) {
throw new UnauthorizedException('Invalid or expired token');
}
if (!payload.scope?.includes('api:read')) {
throw new ForbiddenException('Insufficient scope');
}
// Revocation check — even short-lived tokens must be immediately invalidatable
const isRevoked = await this.tokenBlacklistService.isRevoked(payload.jti);
if (isRevoked) {
throw new UnauthorizedException('Token has been revoked');
}
return payload;
}
}| Validation Item | Meaning | Risk if Omitted |
|---|---|---|
iss (Issuer) |
Verifies token issuer | Forged tokens may be accepted |
aud (Audience) |
Verifies the token is intended for this API | Tokens issued for other services can be reused |
exp (Expiry) |
Expiration time | Perpetually valid tokens are allowed |
scope |
Permission scope | Excessive access is permitted |
jti + revocation |
Per-token invalidation | Leaked tokens cannot be immediately blocked |
There is one tradeoff worth noting. A blacklist based on a centralized store like Redis is convenient, but a Redis failure becomes a single point of failure that translates directly to an authentication outage. To prepare for this, consider a Redis sentinel or cluster configuration, or explicitly define a fallback policy (allow-by-default vs. deny-by-default) for when the blacklist lookup fails.
mTLS + SPIRE Service-to-Service Authentication: Encrypting Internal Communication
If you're introducing a service mesh, this pattern embodies the core of Zero Trust. The language may differ, but the principle is the same. The following is a Python example — note that this is pseudocode illustrating a concept, not a reflection of the actual pyspiffe library API.
# pseudocode: conceptual example of obtaining an SVID via SPIRE and establishing an mTLS connection
# may differ from the actual pyspiffe SDK API
import grpc
async def call_payment_service(request_data: dict):
# Fetch the auto-renewed SVID (SPIFFE Verifiable Identity Document) from the SPIRE Agent
svid = get_svid_from_spire_agent() # see pyspiffe WorkloadApiClient for real implementation
# Configure mTLS channel based on the SVID
credentials = grpc.ssl_channel_credentials(
root_certificates=svid.bundle.x509_authorities(),
private_key=svid.private_key,
certificate_chain=svid.cert_chain,
)
async with grpc.aio.secure_channel(
'payment-service:443', credentials
) as channel:
stub = PaymentServiceStub(channel)
return await stub.ProcessPayment(request_data)The key point here is that there are no hardcoded certificates or keys anywhere in the code. Since the SPIRE Agent periodically renews certificates, there's no need to worry about expiration. When I first saw this structure, I thought "so where are the certificates?" — and that's exactly what makes SPIRE so elegant.
The latency overhead of mTLS is in the range of a few milliseconds for the initial handshake, but it varies greatly depending on the network environment, certificate size, and hardware. With connection pooling and session reuse applied, it becomes nearly negligible for repeated requests.
East-West Traffic: Unlike client↔server traffic (North-South), this refers to traffic exchanged between internal services. In Kubernetes environments, this is the biggest blind spot for Zero Trust. Using a service mesh like Istio automatically applies mTLS to inter-pod communication.
OPA Authorization Policy Separation: Moving Permission Logic Outside of Code
Authentication ("who are you?") and authorization ("are you allowed to do this?") are different things. Yet in practice, the two are often tangled together inside service code. OPA lets you manage complex authorization policies like code.
The key is that policies live outside of the service code. Being able to change policies without a deployment is more powerful than it might seem.
# OPA Rego policy — order retrieval permissions
# Dependencies: OPA server (docker run openpolicyagent/opa)
package api.orders
default allow = false
allow {
# It is a GET request,
input.method == "GET"
# the path is /api/v1/orders/{id},
[_, _, _, order_id] = input.path
# the token has the orders:read scope,
input.token.payload.scope[_] == "orders:read"
# and the requester is the owner of the order
input.token.payload.sub == data.order_owners[order_id]
}// OPA integration in NestJS — delegating all authorization decisions to OPA
// Dependencies: @nestjs/axios, rxjs
import { Injectable } from '@nestjs/common';
import { CanActivate, ExecutionContext } from '@nestjs/common';
import { HttpService } from '@nestjs/axios';
import { firstValueFrom } from 'rxjs';
@Injectable()
export class OpaGuard implements CanActivate {
constructor(private readonly httpService: HttpService) {}
async canActivate(context: ExecutionContext): Promise<boolean> {
const request = context.switchToHttp().getRequest();
const { data } = await firstValueFrom(
this.httpService.post(
'http://opa:8181/v1/data/api/orders/allow',
{
input: {
method: request.method,
path: request.path.split('/').filter(Boolean),
token: { payload: request.user },
},
}
)
);
return data.result === true;
}
}When a new permission rule is needed, only the OPA policy needs to be updated — no service redeployment required. Multiple services can point to the same OPA instance, so policies don't become scattered. At first I thought, "can't we just handle this with if-statements inside the service?" — but once the number of services exceeded ten and I saw each one making authorization decisions by different criteria, the value of OPA became clear.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Blocks lateral movement | Even if one service is compromised, the attacker's range of movement is significantly reduced |
| Fine-grained audit logs | All API calls are recorded, making forensics and compliance response much easier |
| Suited for cloud and multi-cloud | Consistent security policies can be maintained even in distributed environments with blurry perimeters |
| Shadow API detection | Unregistered API endpoints are naturally surfaced during continuous verification |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Performance overhead | Initial mTLS handshake adds a few milliseconds of latency | Can be mitigated with connection pooling and session reuse |
| Operational complexity | Certificate lifecycle management, token renewal, and policy updates all increase — honestly, it's a hassle | Recommend automating with SPIRE and HashiCorp Vault |
| Cross-team consistency | Mixing OAuth, JWT, and mTLS across teams breaks standardization | Best to standardize on a single approach at the API Gateway |
| East-West traffic | Kubernetes pod-to-pod communication has no encryption by default | Automatic mTLS can be applied by introducing the Istio service mesh |
| Token blacklist scalability | A single Redis node becomes unavailable for auth if it fails | Reduce dependency with a cluster configuration or short-lived tokens |
| Initial adoption cost | PKI infrastructure, Identity Provider, and API Gateway setup are required | Realistically, start with a single API Gateway and expand incrementally |
PKI (Public Key Infrastructure): The entire infrastructure for issuing, managing, and revoking certificates. This infrastructure must be in place to operate mTLS.
The Most Common Mistakes in Practice
-
Validating only
expand skipping revocation — if a token is stolen, there's no way to stop it until it expires. Using ajti-based blacklist or short-lived tokens (within 15 minutes) together is recommended. -
Granting excessive permissions to service accounts — the strategy of "give broad access for now and tighten it later" almost never gets tightened in practice. Starting with least privilege from the beginning is far wiser.
-
Leaving internal service APIs open with no authentication — the assumption that "external access isn't possible anyway" is the most dangerous mindset. Assuming the internal network is already compromised is the starting point of Zero Trust.
Closing Thoughts
Zero Trust is not a project completed all at once — it is a journey of gradually embedding the principle that "all trust must be proven" across the entire system.
In practice, NIST also recommends phased migration over a big-bang transition. There's no need to overhaul the entire infrastructure right now. It's best to start step by step in the order below.
- You can start by hardening JWT validation — check your current code to see if all four of
iss,aud,exp, andscopeare being validated. Ifjtirevocation is missing, switching to short-lived tokens (15 minutes) is the priority. - You can try applying an mTLS pilot between just two services — if Istio feels like too much, starting with
cert-manageris perfectly sufficient. - You can experiment with OPA locally first — spin it up with
docker run openpolicyagent/opaand runcurl -X POST http://localhost:8181/v1/data/...to verify that policy queries work, and you'll get a feel for it.
How does your team handle authentication between internal services? Share it in the comments and let's discuss together.
Next post: A step-by-step guide to building microservice mTLS in a Kubernetes environment using only Envoy + SPIRE, without a service mesh
References
- NIST SP 800-207: Zero Trust Architecture | NIST
- NIST SP 800-207A: Zero Trust for Multi-Cloud Native Environments | NIST
- Implementing Zero Trust APIs | Curity
- Zero Trust API Security Explained | A10 Networks
- Zero Trust with Envoy, SPIRE and OPA | Styra
- SPIFFE Official Site
- Zero Trust is Not Enough: Evolving Cloud Security in 2025 | CSA
- API Gateway Authentication Patterns: JWT, OAuth2, mTLS | Elysiate
- Zero Trust Microservices Security | Springfuse