Harness Engineering: Environment Design Guide for AI Agent Production

If you've ever deployed an AI agent to production, you've probably experienced this at least once: the model is smart enough, but the results are just not trustworthy. No matter how much you refine the prompt, the fundamental instability never goes away. By the end of this article, you'll understand the real cause of that instability and be able to design an agent harness yourself and apply it directly to your team's codebase. This piece covers the Harness Engineering paradigm — rapidly taking shape as OpenAI, Anthropic, and Thoughtworks published official articles in quick succession in early 2026 — with practical examples aimed at both teams introducing AI agents for the first time and teams looking to stabilize agents already in production.

Core Concepts

"It's Not the Horse — It's the Harness"

Harness is a metaphor from the equipment used to control a horse. Expressed as a formula:

AI Model (horse) + Harness (environment & control) = Agent

The bottom line is this: an agent's real competitive edge comes not from the model itself, but from the system designed around it. Whether it's Claude, GPT-4, or Gemini, the major models of 2026 are converging to a similar level. That means swapping out the model doesn't dramatically change performance. The harness is what makes the real difference.

This perspective isn't a complete break from traditional MLOps or DevOps. Just as MLOps engineers model training and deployment pipelines, harness engineering engineers the entire execution environment in which an already-deployed inference model operates. The difference is that the subject is not a pipeline but the agent's context and authority.

What a Harness Includes

The list of tools the agent can call
The source and form of information (context) the agent accesses
The agent's decision validation method
The criteria for when the agent should stop
The overall development environment: repository structure, CI configuration, linter, formatter, etc.

The Evolution: Prompt → Context → Harness

Prompt Engineering   → Designing what to say to the model
Context Engineering  → Designing what information the model can see
Harness Engineering  → Designing the entire environment in which the model operates

What is a Context Window? It is the maximum range of text an LLM can process at once. Information that doesn't fit within this window cannot be referenced by the agent. Team knowledge buried in Slack threads or Google Docs falls exactly into this category.

As the OpenAI team put it, you need to give the agent a map, not a 1,000-page manual. Knowledge that lives outside the codebase simply doesn't exist for the agent.

The 3 Components of a Harness (Birgitta Böckeler's Classification)

This is the classification compiled by Thoughtworks Distinguished Engineer Birgitta Böckeler on martinfowler.com. It is often cited as "the Martin Fowler classification," but Böckeler is the actual author.

Component	Description
Context Engineering	Knowledge base embedded in the codebase + dynamic sources such as observability data and browser navigation
Architectural Constraints	LLM-based guardrails + deterministic structural tests
Garbage Collection	Automated/manual detection and removal of dead code, unnecessary files, and convention drift

ArchUnit, which appears under Architectural Constraints, is a testing tool that automatically validates code dependency rules. It is used to enforce architectural rules in CI that an LLM might violate — for example, "the service layer must not directly reference the controller."

Practical Application

Designing a Harness with the IMPACT Framework

This is a checklist you can use when designing an agent harness from scratch. You can see how each item is implemented in the code examples that follow.

Element	Description	Practical Question
Intent	Define the agent's purpose and goals	What should this agent do?
Memory	Manage short-term and long-term memory	What needs to be remembered across sessions?
Planning	Break down tasks and build a plan	How do we decompose a large task?
Authority	Restrict the scope of agent permissions	What should the agent never touch?
Control Flow	Manage execution flow and error handling	How do we recover on failure?
Tools	Define available tools	What is the minimum set of tools required?

Example 1: Short-Term Task — 2-Agent Research & Writing Pipeline

This is the most fundamental harness pattern, separating Research from Writing. shared_state acts as shared memory between agents. In the code below, research_agent writes its results to shared_state, and writing_agent reads directly from it.

python

import anthropic
 
client = anthropic.Anthropic()
 
# Shared state: data-passing hub between agents (Memory)
shared_state: dict = {
    "topic": "",
    "research_result": "",
}
 
def research_agent(topic: str) -> str:
    """Intent: Research the topic and store a structured summary in shared state"""
    try:
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            system="""You are a technical research specialist.
Research the given topic and summarize the key findings in bullet points.
Output the results in a structured format that the next agent can use.""",
            messages=[{"role": "user", "content": f"Research the following topic: {topic}"}]
        )
        result = response.content[0].text
        shared_state["research_result"] = result   # Memory: save to shared state
        return result
    except anthropic.APIError as e:
        print(f"[research_agent] API error: {e}")
        raise
 
def writing_agent(tone: str = "technical blog") -> str:
    """Intent: Read research results from shared state and draft a blog post"""
    research = shared_state.get("research_result", "")
    if not research:
        raise ValueError("No research results found. Run research_agent first.")
 
    try:
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=4096,
            system=f"""You are a professional {tone} writer.
Write a reader-friendly blog post based on the provided research material.
Use markdown format and include code examples.""",
            messages=[{
                "role": "user",
                "content": f"Write a blog post using the following research material:\n\n{research}"
            }]
        )
        return response.content[0].text
    except anthropic.APIError as e:
        print(f"[writing_agent] API error: {e}")
        raise
 
def run_blog_harness(topic: str) -> str:
    shared_state["topic"] = topic
    research_agent(topic)         # → updates shared_state["research_result"]
    blog_post = writing_agent()   # ← reads shared_state["research_result"]
    return blog_post
 
result = run_blog_harness("Harness Engineering")

Analyzing this harness from the IMPACT perspective:

Element	Implementation
Intent	Each agent performs only one role (role separation)
Memory	`shared_state["research_result"]` serves as shared memory between agents
Authority	Each agent can only call the Claude API; no filesystem access
Control Flow	If `research_agent` fails, `writing_agent` cannot proceed (enforced ordering)
Tools	Single tool per agent (Claude API), adhering to the principle of least privilege

Example 2: Long-Running Task — Anthropic 3-Agent Harness

Used for tasks that exceed a single agent's context window, such as multi-hour coding tasks. The core mechanism is context isolation: each agent has an independent context window, and only summarized outputs are passed between stages. This means that even if one agent's context fills up, the entire pipeline does not halt.

Anthropic 3-Agent Harness

Viewed through the IMPACT lens, each stage maps as follows:

Planning: Handles task decomposition (P) and intent (I) definition
Generation: Handles tool (T) use and control flow (C)
Evaluation: Handles authority boundary (A) validation and the feedback loop (C)

Example 3: Team Environment — Configuring the Claude Code Harness with CLAUDE.md

If you use Claude Code as a team agent, CLAUDE.md is effectively your harness configuration file. The 6 IMPACT elements can be mapped directly as sections.

bash

# CLAUDE.md (example harness configuration)
 
## Agent Purpose (Intent)
- Maintain and add features to the backend API of this repository.
- Frontend changes are out of scope.
 
## Permitted Tool Scope (Authority)
- File read/write: src/ directory only
- Never modify: .env, secrets/, prisma/migrations/
- Always request human review before changing the DB schema
 
## Context Guide (Context Engineering)
- Architecture decisions: refer to /docs/adr/
- API specification: refer to /docs/openapi.yaml
- Coding conventions: refer to /docs/conventions.md
 
## Execution Flow Rules (Control Flow)
- Stop immediately if a potential external API key exposure is detected
- Do not modify business logic without tests

Pros and Cons

Advantages: Why Harness Engineering Now

Item	Detail
Predictability	Agent behavior is controlled through environment design, making results stable
Scalability	Performance can be improved by refining the harness alone, without swapping models
Productivity	10× speed improvement over manual work, per OpenAI experiments
Model Independence	Competitive advantage secured at the system level, not tied to a specific LLM
Long-Session Support	3-agent separation structurally overcomes context window limits

Disadvantages and Caveats

Item	Detail	Mitigation
Design Complexity	The harness itself increases engineering complexity	Start with 2 agents and expand as needed
Garbage Accumulation	Dead code and unnecessary files generated by agents	Regular garbage collection routines are essential
Context Design Overhead	Information the agent accesses must be structured	Start by organizing CLAUDE.md, ADRs, and OpenAPI specs
Excessive Control Flow	Complex conditional branching can confuse agents	Follow the atomic tool design principle
Learning Curve	Requires a different mindset from traditional prompt engineering	Use the IMPACT framework for a structured approach

Atomic tool design is the principle of designing each tool to perform only one task. It is better to separate search_file() and update_file() than to bundle them into a single search_and_update_file(). Tools with compound functionality make it easy for agents to produce unintended side effects.

The Most Common Mistakes in Practice

Giving the agent too many tools — Handing an agent every available tool causes it to wander. Enforce the principle of least privilege.
Keeping context outside the codebase — Knowledge that lives only in Slack or Confluence simply doesn't exist for the agent.
Designing the harness only once — As the codebase evolves, so must the harness. Regular reviews are essential.

Closing Thoughts

Harness engineering in one sentence: "The success or failure of an AI agent depends not on the model, but on the environment surrounding the model." The era of refining prompts is over. Designing an environment in which agents can make sound judgments is now the core competency.

Three steps you can take right now:

Context audit — Run git grep -r "TODO\|FIXME\|slack.com\|notion.so" in your terminal to find knowledge links that have escaped the codebase. Also open your team onboarding docs and move every item marked "ask someone directly" into /docs.
Write the IMPACT checklist — Apply the 6 IMPACT elements to the agents you're currently using and find the gaps. If Authority is blank, start by defining the list of files the agent must never touch.
Start small — Begin with the 2-agent pattern (research + writing, or planning + execution), get comfortable with the pattern, then expand to 3 agents.

In a world where models have become commoditized, the team that designs the better harness wins.

Next article: Multi-agent orchestration patterns — when you have more than 3 agents, how do you coordinate them? A comparison of the state-machine approaches used by LangGraph and CrewAI.

References

Harness Engineering: Environment Design Guide for AI Agent Production | DEV BAK - 기술블로그

Harness Engineering: Environment Design Guide for AI Agent Production

Core Concepts

"It's Not the Horse — It's the Harness"

Harness is a metaphor from the equipment used to control a horse. Expressed as a formula:

AI Model (horse) + Harness (environment & control) = Agent

What a Harness Includes

The list of tools the agent can call
The source and form of information (context) the agent accesses
The agent's decision validation method
The criteria for when the agent should stop
The overall development environment: repository structure, CI configuration, linter, formatter, etc.

The Evolution: Prompt → Context → Harness

Prompt Engineering   → Designing what to say to the model
Context Engineering  → Designing what information the model can see
Harness Engineering  → Designing the entire environment in which the model operates

What is a Context Window? It is the maximum range of text an LLM can process at once. Information that doesn't fit within this window cannot be referenced by the agent. Team knowledge buried in Slack threads or Google Docs falls exactly into this category.

As the OpenAI team put it, you need to give the agent a map, not a 1,000-page manual. Knowledge that lives outside the codebase simply doesn't exist for the agent.

The 3 Components of a Harness (Birgitta Böckeler's Classification)

Component	Description
Context Engineering	Knowledge base embedded in the codebase + dynamic sources such as observability data and browser navigation
Architectural Constraints	LLM-based guardrails + deterministic structural tests
Garbage Collection	Automated/manual detection and removal of dead code, unnecessary files, and convention drift

Practical Application

Designing a Harness with the IMPACT Framework

This is a checklist you can use when designing an agent harness from scratch. You can see how each item is implemented in the code examples that follow.

Element	Description	Practical Question
Intent	Define the agent's purpose and goals	What should this agent do?
Memory	Manage short-term and long-term memory	What needs to be remembered across sessions?
Planning	Break down tasks and build a plan	How do we decompose a large task?
Authority	Restrict the scope of agent permissions	What should the agent never touch?
Control Flow	Manage execution flow and error handling	How do we recover on failure?
Tools	Define available tools	What is the minimum set of tools required?

Example 1: Short-Term Task — 2-Agent Research & Writing Pipeline

python

import anthropic
 
client = anthropic.Anthropic()
 
# Shared state: data-passing hub between agents (Memory)
shared_state: dict = {
    "topic": "",
    "research_result": "",
}
 
def research_agent(topic: str) -> str:
    """Intent: Research the topic and store a structured summary in shared state"""
    try:
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            system="""You are a technical research specialist.
Research the given topic and summarize the key findings in bullet points.
Output the results in a structured format that the next agent can use.""",
            messages=[{"role": "user", "content": f"Research the following topic: {topic}"}]
        )
        result = response.content[0].text
        shared_state["research_result"] = result   # Memory: save to shared state
        return result
    except anthropic.APIError as e:
        print(f"[research_agent] API error: {e}")
        raise
 
def writing_agent(tone: str = "technical blog") -> str:
    """Intent: Read research results from shared state and draft a blog post"""
    research = shared_state.get("research_result", "")
    if not research:
        raise ValueError("No research results found. Run research_agent first.")
 
    try:
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=4096,
            system=f"""You are a professional {tone} writer.
Write a reader-friendly blog post based on the provided research material.
Use markdown format and include code examples.""",
            messages=[{
                "role": "user",
                "content": f"Write a blog post using the following research material:\n\n{research}"
            }]
        )
        return response.content[0].text
    except anthropic.APIError as e:
        print(f"[writing_agent] API error: {e}")
        raise
 
def run_blog_harness(topic: str) -> str:
    shared_state["topic"] = topic
    research_agent(topic)         # → updates shared_state["research_result"]
    blog_post = writing_agent()   # ← reads shared_state["research_result"]
    return blog_post
 
result = run_blog_harness("Harness Engineering")

Analyzing this harness from the IMPACT perspective:

Element	Implementation
Intent	Each agent performs only one role (role separation)
Memory	`shared_state["research_result"]` serves as shared memory between agents
Authority	Each agent can only call the Claude API; no filesystem access
Control Flow	If `research_agent` fails, `writing_agent` cannot proceed (enforced ordering)
Tools	Single tool per agent (Claude API), adhering to the principle of least privilege

Example 2: Long-Running Task — Anthropic 3-Agent Harness

Anthropic 3-Agent Harness

Viewed through the IMPACT lens, each stage maps as follows:

Planning: Handles task decomposition (P) and intent (I) definition
Generation: Handles tool (T) use and control flow (C)
Evaluation: Handles authority boundary (A) validation and the feedback loop (C)

Example 3: Team Environment — Configuring the Claude Code Harness with CLAUDE.md

If you use Claude Code as a team agent, CLAUDE.md is effectively your harness configuration file. The 6 IMPACT elements can be mapped directly as sections.

bash

# CLAUDE.md (example harness configuration)
 
## Agent Purpose (Intent)
- Maintain and add features to the backend API of this repository.
- Frontend changes are out of scope.
 
## Permitted Tool Scope (Authority)
- File read/write: src/ directory only
- Never modify: .env, secrets/, prisma/migrations/
- Always request human review before changing the DB schema
 
## Context Guide (Context Engineering)
- Architecture decisions: refer to /docs/adr/
- API specification: refer to /docs/openapi.yaml
- Coding conventions: refer to /docs/conventions.md
 
## Execution Flow Rules (Control Flow)
- Stop immediately if a potential external API key exposure is detected
- Do not modify business logic without tests

Pros and Cons

Advantages: Why Harness Engineering Now

Item	Detail
Predictability	Agent behavior is controlled through environment design, making results stable
Scalability	Performance can be improved by refining the harness alone, without swapping models
Productivity	10× speed improvement over manual work, per OpenAI experiments
Model Independence	Competitive advantage secured at the system level, not tied to a specific LLM
Long-Session Support	3-agent separation structurally overcomes context window limits

Disadvantages and Caveats

Item	Detail	Mitigation
Design Complexity	The harness itself increases engineering complexity	Start with 2 agents and expand as needed
Garbage Accumulation	Dead code and unnecessary files generated by agents	Regular garbage collection routines are essential
Context Design Overhead	Information the agent accesses must be structured	Start by organizing CLAUDE.md, ADRs, and OpenAPI specs
Excessive Control Flow	Complex conditional branching can confuse agents	Follow the atomic tool design principle
Learning Curve	Requires a different mindset from traditional prompt engineering	Use the IMPACT framework for a structured approach

Atomic tool design is the principle of designing each tool to perform only one task. It is better to separate search_file() and update_file() than to bundle them into a single search_and_update_file(). Tools with compound functionality make it easy for agents to produce unintended side effects.

The Most Common Mistakes in Practice

Giving the agent too many tools — Handing an agent every available tool causes it to wander. Enforce the principle of least privilege.
Keeping context outside the codebase — Knowledge that lives only in Slack or Confluence simply doesn't exist for the agent.
Designing the harness only once — As the codebase evolves, so must the harness. Regular reviews are essential.

Closing Thoughts

Three steps you can take right now:

Context audit — Run git grep -r "TODO\|FIXME\|slack.com\|notion.so" in your terminal to find knowledge links that have escaped the codebase. Also open your team onboarding docs and move every item marked "ask someone directly" into /docs.
Write the IMPACT checklist — Apply the 6 IMPACT elements to the agents you're currently using and find the gaps. If Authority is blank, start by defining the list of files the agent must never touch.
Start small — Begin with the 2-agent pattern (research + writing, or planning + execution), get comfortable with the pattern, then expand to 3 agents.

In a world where models have become commoditized, the team that designs the better harness wins.

Next article: Multi-agent orchestration patterns — when you have more than 3 agents, how do you coordinate them? A comparison of the state-machine approaches used by LangGraph and CrewAI.

Core Concepts

"It's Not the Horse — It's the Harness"

What a Harness Includes

The Evolution: Prompt → Context → Harness

The 3 Components of a Harness (Birgitta Böckeler's Classification)

Practical Application

Designing a Harness with the IMPACT Framework

Example 1: Short-Term Task — 2-Agent Research & Writing Pipeline

Example 2: Long-Running Task — Anthropic 3-Agent Harness

Example 3: Team Environment — Configuring the Claude Code Harness with CLAUDE.md

Pros and Cons

Advantages: Why Harness Engineering Now

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

"It's Not the Horse — It's the Harness"

What a Harness Includes

The Evolution: Prompt → Context → Harness

The 3 Components of a Harness (Birgitta Böckeler's Classification)

Practical Application

Designing a Harness with the IMPACT Framework

Example 1: Short-Term Task — 2-Agent Research & Writing Pipeline

Example 2: Long-Running Task — Anthropic 3-Agent Harness

Example 3: Team Environment — Configuring the Claude Code Harness with CLAUDE.md

Pros and Cons

Advantages: Why Harness Engineering Now

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

LangGraph vs CrewAI: Respond with State Machine Orchestration when the number of agents exceeds 3

Reducing LLM Agent Pipeline Token Costs by 50%: A Practical Comparison of Summary Agent vs. Chunk Injection vs. Prompt Caching

Context Engineering 4 Strategies — Applying Write, Select, Compress, and Isolate to Multi-Agent Systems