Harness Engineering: Environment Design Guide for AI Agent Production
If you have ever deployed an AI agent into production, you have likely experienced this at least once. The model is smart enough, but the results are far from reliable. No matter how much you refine the prompts, the fundamental instability does not disappear. By the end of this article, you will understand the true cause of that instability and be able to design your own agent harness and apply it immediately to your team's codebase. We summarize the Harness Engineering paradigm—which is rapidly establishing itself with OpenAI, Anthropic, and Thoughtworks successively publishing official articles in early 2026—along with practical examples, for both teams looking to introduce AI agents for the first time and those seeking to stabilize agents already in operation.
Key Concepts
"It's not the horse, it's the harness"
The term "harness" is a metaphor derived from the harness worn to control a horse. It can be expressed as a formula as follows.
AI 모델 (말) + 하네스 (환경·제어) = 에이전트To start with the conclusion, an agent's true competitiveness comes not from the model itself, but from the system designed around it. Whether it is Claude, GPT-4, or Gemini, major models are converging to similar levels as of 2026. This means that changing the model does not significantly alter performance. What makes the real difference is the harness.
This perspective is not a concept completely separate from existing MLOps or DevOps. Just as MLOps engineers model training and deployment pipelines, harness engineering engineers the entire execution environment in which already deployed inference models operate. The difference is that the target is not the pipeline, but the context and authority of the agents.
What a harness includes
- List of Tools that the agent can call
- Source and form of the information (context) accessed by the agent
- Agent's Decision Verification Method
- Criteria for when the agent must stop
- Repository structure, CI setup, linter, formatter, etc. Overall development environment
Evolutionary Flow: Prompt → Context → Harness
프롬프트 엔지니어링 → 모델에게 무엇을 말할지 설계
컨텍스트 엔지니어링 → 모델이 볼 수 있는 정보를 설계
하네스 엔지니어링 → 모델이 작동하는 환경 전체를 설계What is a Context Window? It is the maximum range of text that an LLM can process at a time. Information that does not fall within this window cannot be referenced by the agent. Team knowledge buried in Slack threads or Google Docs falls into this category.
As the OpenAI team stated, you should give the agent a map, not a 1,000-page manual. Knowledge outside the codebase is as if it does not exist to the agent.
The Three Main Components of a Harness (Birgitta Böckeler Classification)
This is a classification organized by Birgitta Böckeler, Distinguished Engineer at Thoughtworks, on martinfowler.com. It is commonly referred to as the "Martin Fowler classification," but the actual author is Böckeler.
| Components | Description |
|---|---|
| Context Engineering | Knowledge base embedded in the codebase + dynamic sources such as observation data, browser navigation, etc. |
| Architectural Constraints | LLM-based Guardrail + Deterministic Structural Testing |
| Garbage Collection | Automatic/Manual detection and removal of dead code, unnecessary files, and convention drift |
ArchUnit, which appears in Architectural Constraints, is a testing tool that automatically verifies code dependency rules. It is used in CI to enforce architectural rules that LLM might violate, such as "the service layer must not directly reference controllers."
Practical Application
Designing Harnesses with the IMPACT Framework
This is a checklist that can be used when initially designing the agent harness. You can see how each item is implemented in the code examples that follow.
| Element | Description | Practical Questions |
|---|---|---|
| Intent | Defining the Agent's Purpose and Goals | What Should This Agent Do? |
| Memory | Short-term and Long-term Memory Management | What Needs to Be Remembered Between Sessions? |
| PPlanning | Work Decomposition and Planning | How to Break Down a Large Task? |
| **Authority | Agent Authority Scope Restrictions | What Should Not Agents Touch? |
| Control Flow | Execution Flow and Error Handling | How to Recover in Case of Failure? |
| Tools | Defining Available Tools | What Are the Minimum Tools Required? |
Example 1: Short-term Job — 2-Agent Research & Writing Pipeline
This is the most basic harness pattern that separates research and writing. shared_state acts as shared memory between agents. In the code below, research_agent writes the results to shared_state, while writing_agent reads and uses them directly.
import anthropic
client = anthropic.Anthropic()
# 공유 상태: 에이전트 간 데이터 전달 허브 (Memory)
shared_state: dict = {
"topic": "",
"research_result": "",
}
def research_agent(topic: str) -> str:
"""Intent: 주제를 리서치하고 구조화된 요약을 공유 상태에 저장"""
try:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
system="""당신은 기술 조사 전문가입니다.
주어진 주제를 조사하고 핵심 내용을 bullet point로 정리하세요.
결과는 다음 에이전트가 사용할 수 있도록 구조화된 형태로 출력하세요.""",
messages=[{"role": "user", "content": f"다음 주제를 조사하세요: {topic}"}]
)
result = response.content[0].text
shared_state["research_result"] = result # Memory: 공유 상태에 저장
return result
except anthropic.APIError as e:
print(f"[research_agent] API 오류: {e}")
raise
def writing_agent(tone: str = "기술 블로그") -> str:
"""Intent: 공유 상태의 조사 결과를 읽어 블로그 초안 작성"""
research = shared_state.get("research_result", "")
if not research:
raise ValueError("조사 결과가 없습니다. research_agent를 먼저 실행하세요.")
try:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
system=f"""당신은 {tone} 전문 작가입니다.
제공된 조사 자료를 바탕으로 독자 친화적인 블로그 포스트를 작성하세요.
마크다운 형식을 사용하고, 코드 예시를 포함하세요.""",
messages=[{
"role": "user",
"content": f"다음 조사 자료로 블로그를 작성하세요:\n\n{research}"
}]
)
return response.content[0].text
except anthropic.APIError as e:
print(f"[writing_agent] API 오류: {e}")
raise
def run_blog_harness(topic: str) -> str:
shared_state["topic"] = topic
research_agent(topic) # → shared_state["research_result"] 업데이트
blog_post = writing_agent() # ← shared_state["research_result"] 읽음
return blog_post
result = run_blog_harness("하네스 엔지니어링")Analysis of this harness from an impact perspective:
| Element | Implementation Method |
|---|---|
| Intent | Two agents each perform only one role (Role Separation) |
| Memory | shared_state["research_result"] is shared memory between agents |
| Authority | Each agent can only call Claude APIs, no file system access |
| Control Flow | Cannot proceed to writing_agent if research_agent fails (Order enforced) |
| Tools | Single tool per agent (Claude API), adheres to the principle of least privilege |
Example 2: Long-term Work — Anthropic 3-Agent Harness
It is used for tasks that extend beyond a single agent's context window, such as multi-time coding operations. The core mechanism is context isolation: each agent has an independent context window, and the data passed between stages consists only of summarized outputs. Thanks to this, the entire pipeline does not stop even if one agent's context becomes full.
[계획 에이전트] [생성 에이전트] [평가 에이전트]
Planning Agent → Generation Agent → Evaluation Agent
- 작업 목록 생성 - 실제 코드 작성 - 결과 검증
- 우선순위 결정 - 파일 수정/생성 - 테스트 실행
- 컨텍스트 요약 - API 호출 - 피드백 생성
↑ |
└──────────────── 피드백 루프 ─────────────────┘Looking at each step from an impact perspective:
- Planning: Responsible for defining work decomposition (P) and intent (I).
- Generation: Responsible for tool (T) usage and execution flow (C)
- Evaluation: Responsible for authority boundary (A) verification and feedback loop (C)
Example 3: Team Environment — Setting up the Claude Code Harness with CLAUDE.md
If you use Claude Code as the team agent, CLAUDE.md is the harness configuration file. The six IMPACT items can be mapped directly into sections.
# CLAUDE.md (하네스 설정 예시)
## 이 에이전트의 목적 (Intent)
- 이 리포지터리의 백엔드 API를 유지보수하고 기능을 추가한다.
- 프론트엔드 변경은 범위 밖이다.
## 허용된 도구 범위 (Authority)
- 파일 읽기/쓰기: src/ 디렉토리만
- 절대 수정 금지: .env, secrets/, prisma/migrations/
- DB 스키마 변경 시 반드시 사람 검토 요청
## 컨텍스트 가이드 (Context Engineering)
- 아키텍처 결정: /docs/adr/ 참조
- API 규격: /docs/openapi.yaml 참조
- 코딩 컨벤션: /docs/conventions.md 참조
## 실행 흐름 규칙 (Control Flow)
- 외부 API 키 노출 가능성 감지 시 즉시 중단
- 테스트 없이 비즈니스 로직을 수정하지 않는다Pros and Cons Analysis
Pros: Why Harness Engineering Now?
| Item | Content |
|---|---|
| Predictability | Stable results because agent behavior is controlled by environment design |
| Scalability | Performance can be improved by upgrading only the harness without replacing the model |
| Productivity | 10x Speed Improvement Compared to Manual Based on OpenAI Experiments |
| Model Independence | Securing competitiveness at the system level without being dependent on a specific LLM |
| Support for Long-Term Sessions | Structurally Overcomes Context Window Limitations with 3-Agent Separation |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Design Complexity | The harness itself increases engineering complexity | Start with 2-agents and scale as needed |
| Garbage Accumulation | Dead code and unnecessary files generated by the agent | Regular garbage collection routines are essential |
| Burden of Context Design | Must structure the information the agent accesses | Start with refining CLAUDE.md, ADR, and OpenAPI specifications |
| Excessive Control Flow | Complex conditional branching confuses the agent | Adherence to Atomic Tool Design Principles |
| Learning Costs | A Different Mindset from Traditional Prompt Engineering | A Systematic Approach with the IMPACT Framework |
Atomic tool design is the principle of designing a tool so that it performs only one task. It is better to separate search_file() and update_file() than to group functions together like search_and_update_file(). If a tool has complex functions, the agent is prone to producing unintended side effects.
The Most Common Mistakes in Practice
- Give too many tools — If you give the agent every tool, it will actually wander. Adhere to the principle of least privilege.
- Keep the context outside the codebase — Knowledge found only in Slack or Confluence is as if it doesn't exist for the agent.
- Design the harness only once — As the codebase changes, the harness must evolve. Regular reviews are essential.
In Conclusion
Harness engineering can be summarized in a single line: "The success or failure of an AI agent depends not on the model, but on the environment surrounding it." The era of refining prompts is over. Now, the core competency is designing an environment where the agent can make the right decisions.
3 Steps to Start Right Now:
- Context Audit — Run
git grep -r "TODO\|FIXME\|slack.com\|notion.so"in the terminal to find knowledge links that have gone outside the codebase. Also open the team onboarding documentation and move all items that say "you have to ask them directly" into/docs. - Create IMPACT Checklist — Apply the 6 IMPACT items to the agent currently in use to identify any gaps. In particular, if Authority is empty, start by defining the list of files that the agent must not touch.
- Start small — Begin with the 2-Agent pattern (Research + Write, or Plan + Execute) and expand to the 3-Agent pattern after you become familiar with it.
In a world where models are standardized, the team that designs the harness well wins.
Next Post: Multi-Agent Orchestration Patterns — How to Coordinate When There Are More Than 3 Agents? Comparing the State Machine Approaches of LangGraph and CrewAI.
Reference Materials
- Harness engineering: leveraging Codex in an agent-first world | OpenAI
- Effective harnesses for long-running agents | Anthropic Engineering
- Harness engineering for coding agent users | martinfowler.com (Birgitta Böckeler)
- OpenAI Introduces Harness Engineering: Codex Agents Power Large-Scale Software Development | InfoQ
- Anthropic Designs Three-Agent Harness for Long-Running AI Development | InfoQ
- The Anatomy of an Agent Harness | Daily Dose of Data Science
- Agent Engineering: Harness Patterns, IMPACT Framework & Coding Agent Architecture
- What is Harness Engineering? Why It Is Emerging as the Core of AI Agent Development by 2026 | Channel.io
- Unlocking the Codex harness: how we built the App Server | OpenAI