AI Agent Security in Code: A Practical Guide to Defending Against Target Hijacking, Memory Poisoning, and Cascading Failures
If you are building a system where AI agents formulate plans, invoke tools, and delegate tasks to other agents, you must be aware that there are threat categories that cannot be captured by existing static analysis (SAST) or dynamic analysis (DAST) tools alone. A structure where a single user request from an agent is broken down into dozens of internal actions, with the result of each action serving as the input for the next plan, creates an attack surface that traditional request-response-centric security models never assumed. The OWASP Top 10 for Agentic Applications, published by the OWASP GenAI Security Project in December 2025, is a framework designed to fill this very gap. While the existing LLM Top 10 focused on threats involving single prompt-response pairs, the Agentic Top 10 treats the entire multi-stage behavioral cycle—where an agent repeats planning, tool invocation, memory persistence, and delegation—as a threat model.
This article focuses on three of the ten items that have either been issued an actual CVE (ASI01: EchoLeak CVE-2025-32711, CVSS 9.3) or are verified by reproducible research papers (ASI06: arXiv:2603.20357): ASI01 (Goal Hijacking), ASI06 (Memory Poisoning), and ASI08 (Cascading Failure). These three threats do not exist independently but form a chain structure where ASI01 acts as an initial entry point to contaminate memory via ASI06, and that contamination propagates throughout the entire system via ASI08. After reading this article, you will have an architectural standard for deploying the three patterns of GoalLock, Layer 5 Memory Isolation, and Circuit Breaker as defense lines against each threat. This article is intended for backend and full-stack developers who are already building or planning to build agentic AI systems.
Key Concepts
Why Agentic AI Creates Threats That Traditional Security Tools Cannot Catch
Traditional SAST/DAST tools detect static code structures or anomalies in HTTP requests and responses. However, agentic AI attacks bypass byte-level validation because they infiltrate the results of tool calls written in natural language, the embedding layers of vector databases, and the message payloads between agents.
| Classification | Traditional Prompt Injection | Agentic Goal Hijacking |
|---|---|---|
| Span of Impact | Single Response Contamination | Full Control of Agent Planning Engine |
| Persistence | Limited to this request | Weaponizes all subsequent actions |
| Detection Difficulty | Relatively Easy | Natural Language Based, Bypasses Schema Validation |
| Propagation Path | None | Propagate to Memory/Sub-agents |
The Core Paradox of Agentic Security: The factors that make an agent more powerful (planning ability, long-term memory, cooperation with other agents) are also the factors that exponentially expand the attack surface.
Chain Structure of Three Threats: ASI01 → ASI06 → ASI08
These three threats are not independent individual events. In actual attack scenarios, they occur in a chain as follows.
[외부 입력] → ASI01(목표 하이재킹)
↓ 오염된 목표로 메모리 쓰기
ASI06(메모리 포이즈닝)
↓ 오염된 메모리를 공유 RAG에 저장
ASI08(계단식 실패)
↓ 공유 메모리를 읽는 모든 에이전트로 전파
[시스템 전체 장애]Each defense layer blocks one step of this chain. If GoalLock blocks ASI01 entry, ASI06 and ASI08 lose their firing conditions.
ASI01 — Agent Goal Hijacking
Goal hijacking is an attack in which malicious content overwrites the agent's original goal and plan path itself. The attack enters the context window through an external source trusted by the agent. The specific entry paths are as follows:
- Web Search Results: Directives inserted in the
descriptionfield orsnippetarea of the Bing/Google API response - Recipient Email: Hidden text part processed by CSS
display:nonein the HTML body - External Documents: PDF footnotes, comment areas in Word documents, HTML comments in Markdown files(
<!-- ... -->)
As officially acknowledged by OpenAI, complete blocking is theoretically impossible due to the structural limitation that trusted and untrusted inputs are processed in the same context window.
A representative example is EchoLeak (CVE-2025-32711, CVSS 9.3), discovered in Microsoft 365 Copilot in mid-2025. The moment Copilot processed the HTML body part of a crafted email as context, sensitive data was automatically leaked externally without user intervention. The agent's goal was immediately switched from "normal business processing" to "data extraction."
ASI06 — Memory & Context Poisoning
RAG (Retrieval-Augmented Generation): This is a pattern where an agent searches an external knowledge base (documents stored in a vector database) and utilizes it to generate answers. It operates as the agent's "long-term memory," and a structure where multiple agents share the same vector database is common.
Memory poisoning is an attack that inserts malicious instructions into this RAG repository, permanently affecting all subsequent interactions. Unlike prompt injection, which contaminates a single conversation, once successful, all agents accessing that RAG repository are contaminated.
As confirmed in the PoisonedRAG study (arXiv:2603.20357), a single malicious document can propagate throughout the entire system within hours via a shared vector DB. If an agent forms a false belief (e.g., the perception that a specific security policy is invalid), subsequent decisions based on this belief are cascaded and corrupted.
ASI08 — Cascading Failures
Cascading failure is a phenomenon where corruption in one agent propagates to connected tools, shared memory, and subordinate agents, driving the entire system into a failure state. A key risk factor is that natural language-based errors pass type checking or schema validation. Existing monitoring tools struggle to detect agents that deliver semantically incorrect instructions while returning a valid JSON schema. According to Galileo AI's analysis of multi-agent system failures, temporal compounding—where corrupted memory continuously contaminates future operations—is identified as the most significant detrimental factor.
Zero Trust for AI: This is an architectural concept that applies the Zero Trust principle—"Never trust, always verify"—to AI agents. Internal agents are treated as verification targets just like external inputs, and both Cisco and Microsoft released Zero Trust architecture guidelines dedicated to AI agents in the first quarter of 2026.
MCP (Model Context Protocol): This is a protocol designed for AI agents to interact with external tools and services in a standardized manner. The enforcement of least privilege through MCP gateways has become the current industry standard pattern, which will be covered in detail in the following article.
Practical Application
Example 1: Defending Against Goal Hijacking with the GoalLock Mechanism
This is a pattern where the agent signs the initial goal with HMAC and compares the current goal with the initial signature every time after processing external input. If the goal is modified without authorization, execution stops immediately.
import hashlib
import hmac
import re
from dataclasses import dataclass
from typing import Optional
SECRET_KEY = b"your-secret-key-stored-in-env" # 실제 환경에서는 환경 변수로 관리
@dataclass
class GoalLock:
original_goal: str
signature: str
@staticmethod
def create(goal: str) -> "GoalLock":
# hmac.new의 첫 번째 인자는 bytes, digestmod는 키워드 인자로 명시
sig = hmac.new(
SECRET_KEY, goal.encode(), digestmod=hashlib.sha256
).hexdigest()
return GoalLock(original_goal=goal, signature=sig)
def verify(self, current_goal: str) -> bool:
expected = hmac.new(
SECRET_KEY, self.original_goal.encode(), digestmod=hashlib.sha256
).hexdigest()
# hmac.compare_digest: 타이밍 공격(timing attack) 방어
if not hmac.compare_digest(self.signature, expected):
return False # 서명 자체가 조작됨
return current_goal.strip() == self.original_goal.strip()
class SecureAgent:
def __init__(self, goal: str):
self.goal_lock = GoalLock.create(goal)
self.current_goal = goal
def process_external_input(self, external_content: str) -> str:
"""외부 입력을 처리하기 전 목표 무결성 확인"""
sanitized = self._sanitize_input(external_content)
if not self.goal_lock.verify(self.current_goal):
raise SecurityError(
f"Goal integrity violation detected. "
f"Original: '{self.goal_lock.original_goal}' "
f"Current: '{self.current_goal}'"
)
return sanitized
def _sanitize_input(self, content: str) -> str:
"""
다계층 입력 sanitization.
완전 차단 대신 '[SUSPICIOUS_CONTENT_DETECTED]' 마킹을 사용하는 이유:
- 완전 차단 시, LLM은 입력이 잘렸다는 사실을 모른 채 불완전한
컨텍스트로 계획을 진행해 오히려 예측 불가능한 동작을 유발할 수 있습니다.
- 마킹 방식은 LLM이 의심 콘텐츠의 존재를 인식하고 적절히
무시하거나 경고를 포함한 응답을 생성하도록 유도합니다.
- 단, 이 방식은 LLM이 마킹 자체를 무시하거나 학습 컨텍스트로
처리할 수 있으므로, 임베딩 거리 기반 의미론적 탐지와 병행하는
것을 권장합니다.
"""
injection_patterns = [
r"(?i)ignore\s+(all\s+)?previous\s+instructions?",
r"(?i)you\s+are\s+now\s+",
r"(?i)act\s+as\s+",
r"(?i)forget\s+your\s+(previous\s+)?instructions?",
r"(?i)new\s+goal\s*:",
r"(?i)override\s+(previous\s+)?instructions?",
]
for pattern in injection_patterns:
if re.search(pattern, content):
content = f"[SUSPICIOUS_CONTENT_DETECTED] {content}"
break
return content
class SecurityError(Exception):
pass| Code Components | Roles |
|---|---|
GoalLock.create() |
Sign initial target with HMAC-SHA256, create immutable baseline |
GoalLock.verify() |
Verify current goal and signature match before every action |
hmac.compare_digest() |
Constant Time Comparison for Timing Attack Defense |
_sanitize_input() |
Mark after detecting known injection patterns (not complete block) |
SecurityError |
Stop execution immediately upon target tampering detection |
Note: Regular expression-based sanitization only defends against known patterns. Since attackers can bypass static rules through encoding, bypass expressions, and indirect injection, it is recommended to use it in conjunction with embedding distance measurement-based semantic detection.
Example 2: Defending Against Poisoning with 5-Layer Memory Isolation
The key to defending against memory poisoning is to store the source and trustworthiness of each memory entry and verify them at the time of the query.
// Node.js 14.17.0+ 환경 기준
// randomUUID와 createHash 모두 "crypto" 모듈에서 명시적으로 임포트
import { createHash, randomUUID } from "crypto";
interface MemoryEntry {
id: string;
content: string;
// Provenance Tracking: 출처 메타데이터
provenance: {
source: string; // 예: "user_upload" | "web_search" | "agent_internal"
timestamp: number;
agentId: string;
trustLevel: "high" | "medium" | "low" | "untrusted";
};
// Temporal Decay: 만료 정보
ttl: number; // Unix timestamp (ms), 0이면 영구
contentHash: string; // 무결성 검증용 SHA-256 해시
}
class SecureMemoryStore {
private partitions: Map<string, MemoryEntry[]> = new Map();
// 1. Memory Partitioning — 에이전트별 격리
private getPartition(key: string): MemoryEntry[] {
if (!this.partitions.has(key)) {
this.partitions.set(key, []);
}
return this.partitions.get(key)!;
}
// 2. Provenance Tracking — 출처 메타데이터 포함 저장
async store(
agentId: string,
content: string,
source: string,
trustLevel: MemoryEntry["provenance"]["trustLevel"],
ttlHours: number = 24
): Promise<string> {
const entry: MemoryEntry = {
id: randomUUID(), // "crypto" 모듈에서 임포트한 randomUUID 사용
content,
provenance: {
source,
timestamp: Date.now(),
agentId,
trustLevel,
},
ttl: ttlHours > 0 ? Date.now() + ttlHours * 3_600_000 : 0,
contentHash: createHash("sha256").update(content).digest("hex"),
};
// 낮은 신뢰도 콘텐츠는 별도 격리 파티션에 저장
const partitionKey =
trustLevel === "untrusted" ? `${agentId}:quarantine` : agentId;
this.getPartition(partitionKey).push(entry);
return entry.id;
}
// 3. Context Isolation + 4. Temporal Decay — 쿼리 시 실시간 검증
async query(
agentId: string,
minTrustLevel: MemoryEntry["provenance"]["trustLevel"] = "medium"
): Promise<MemoryEntry[]> {
const trustHierarchy: Record<
MemoryEntry["provenance"]["trustLevel"],
number
> = { high: 3, medium: 2, low: 1, untrusted: 0 };
const minScore = trustHierarchy[minTrustLevel];
const now = Date.now();
return this.getPartition(agentId).filter((entry) => {
// Temporal Decay: 만료된 항목 제외
if (entry.ttl > 0 && entry.ttl < now) return false;
// Context Isolation: 신뢰 수준 필터링
if (trustHierarchy[entry.provenance.trustLevel] < minScore) return false;
// 무결성 검증: 저장 후 변조 여부 확인
const currentHash = createHash("sha256")
.update(entry.content)
.digest("hex");
if (currentHash !== entry.contentHash) {
console.error(`Memory integrity violation detected: entry ${entry.id}`);
return false;
}
return true;
});
}
// 5. Behavioral Monitoring — 이상 패턴 탐지
async detectAnomalies(agentId: string): Promise<string[]> {
const warnings: string[] = [];
const partition = this.getPartition(agentId);
// 단시간 내 대량 저장 시도 탐지
const recentEntries = partition.filter(
(e) => Date.now() - e.provenance.timestamp < 60_000
);
if (recentEntries.length > 50) {
warnings.push(
`Anomaly: ${recentEntries.length} entries written in last 60s`
);
}
// 비신뢰 소스 비율 탐지
const untrustedRatio =
partition.filter((e) => e.provenance.trustLevel === "untrusted").length /
Math.max(partition.length, 1);
if (untrustedRatio > 0.3) {
warnings.push(
`Anomaly: ${(untrustedRatio * 100).toFixed(1)}% entries from untrusted sources`
);
}
return warnings;
}
}| Layer | Implementation Point | Defense Effect |
|---|---|---|
| Memory Partitioning | getPartition(agentId) |
Blocking Cross-Contamination Between Agents |
| Context Isolation | minTrustLevel Filter |
Automatic isolation of low confidence items |
| Provenance Tracking | provenance Metadata |
Post-attack path tracing possible |
| Temporal Decay | ttl Expiration Check |
Automatically Deletion of Old Contaminated Items |
| Behavioral Monitoring | detectAnomalies() |
Early Detection of Mass Insertion Attacks |
Example 3: Defending Against Cascading Failures with Circuit Breaker Patterns
Circuit Breaker Pattern: A software engineering pattern designed to prevent chaining of external service call failures by automatically terminating the connection when a failure exceeds a threshold and attempting recovery after a certain period. In agent systems, the same principle is applied to isolate a compromised agent and restore it to its last healthy state (SafeMode snapshot).
import asyncio
import time
import json
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, Any, Optional
class CircuitState(Enum):
CLOSED = "closed" # 정상 동작
OPEN = "open" # 차단 (요청 즉시 거부)
HALF_OPEN = "half_open" # 복구 테스트 중
@dataclass
class AgentCircuitBreaker:
agent_id: str
failure_threshold: int = 5 # 실패 N회 시 OPEN
recovery_timeout: float = 30.0 # N초 후 HALF_OPEN 시도
success_threshold: int = 2 # HALF_OPEN에서 성공 N회 시 CLOSED 복귀
state: CircuitState = CircuitState.CLOSED
failure_count: int = 0
success_count: int = 0
last_failure_time: float = 0.0
safe_snapshot: Optional[dict] = None
def save_snapshot(self, state: dict) -> None:
"""정상 동작 시 SafeMode 스냅샷 저장"""
self.safe_snapshot = {
"timestamp": time.time(),
"agent_id": self.agent_id,
"state": json.dumps(state),
}
def restore_from_snapshot(self) -> Optional[dict]:
"""장애 시 마지막 정상 스냅샷으로 복구"""
if self.safe_snapshot:
print(
f"[RECOVERY] Agent {self.agent_id}: "
f"restoring snapshot (saved at {self.safe_snapshot['timestamp']:.0f})"
)
return json.loads(self.safe_snapshot["state"])
return None
async def call(self, func: Callable, *args, **kwargs) -> Any:
"""에이전트 액션 실행 전 회로 상태 확인"""
if self.state == CircuitState.OPEN:
elapsed = time.time() - self.last_failure_time
if elapsed >= self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.success_count = 0
print(f"[CIRCUIT] Agent {self.agent_id}: HALF_OPEN (recovery test)")
else:
restored = self.restore_from_snapshot()
raise CircuitOpenError(
f"Agent {self.agent_id} is OPEN. "
f"Retry after {self.recovery_timeout - elapsed:.1f}s",
restored_state=restored,
)
try:
result = await func(*args, **kwargs)
self._on_success()
return result
except Exception:
self._on_failure()
raise
def _on_success(self) -> None:
self.failure_count = 0
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
print(f"[CIRCUIT] Agent {self.agent_id}: CLOSED (recovered)")
def _on_failure(self) -> None:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
elif self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
print(
f"[CIRCUIT] Agent {self.agent_id}: OPEN "
f"(failures: {self.failure_count})"
)
class MultiAgentOrchestrator:
def __init__(self, agent_ids: list[str]):
self.breakers = {
aid: AgentCircuitBreaker(agent_id=aid) for aid in agent_ids
}
async def get_available_agent_decision(
self,
query: str,
agent_handlers: dict[str, Callable],
availability_threshold: float = 0.66,
) -> dict:
"""
다중 에이전트 가용성 기반 결정.
주의: 이 구현은 '응답한 에이전트 비율'이 임계치 이상일 때
첫 번째 응답을 반환하는 가용성(availability) 검증입니다.
응답 내용의 의미론적 일치도를 검사하는 진정한 합의(consensus)가
아닙니다. 프로덕션 환경에서는 응답 간 임베딩 유사도 비교나
다수결 로직을 추가로 구현하는 것을 권장합니다.
"""
responses = []
for agent_id, handler in agent_handlers.items():
try:
breaker = self.breakers[agent_id]
result = await breaker.call(handler, query)
responses.append({"agent_id": agent_id, "result": result})
except CircuitOpenError:
print(f"[DECISION] Agent {agent_id} skipped (circuit open)")
if not responses:
raise RuntimeError("All agents unavailable — system in SafeMode")
# 가용한 에이전트 비율 확인 (의미론적 일치도 검증은 별도 구현 필요)
availability_ratio = len(responses) / len(agent_handlers)
if availability_ratio < availability_threshold:
raise RuntimeError(
f"Availability threshold not met: "
f"only {availability_ratio:.0%} agents responded"
)
return responses[0]["result"]
class CircuitOpenError(Exception):
def __init__(self, message: str, restored_state: Optional[dict] = None):
super().__init__(message)
self.restored_state = restored_stateCircuit Breaker State Transitions: Transitions occur in the order of CLOSED → (Failure Threshold Exceeded) → OPEN → (Recovery Timeout) → HALF_OPEN → (Success Threshold Reached) → CLOSED. In the OPEN state, requests are immediately rejected, and recovery is attempted to the last SafeMode snapshot.
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Hierarchical Defense | GoalLock → Memory Isolation → Circuit Breaker each block the ASI01/06/08 chain at each stage |
| Auditability | Attack paths can be traced retrospectively with Provenance Tracking |
| Automatic Recovery | Partial recovery without operator intervention using SafeMode snapshots |
| Regulatory Alignment | OWASP Agentic Top 10 Automatically Maps to EU AI Act, HIPAA, and SOC2 Requirements |
| Runtime Framework | Utilize open-source tools such as the Microsoft Agent Governance Toolkit to collectively apply GoalLock, Memory Isolation, and Circuit Breaker as runtime policies |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Semantic Opacity | Natural Language Communication Between LLMs Cannot Be Validated at the Byte Level | Concurrent Embedding Distance-Based Anomaly Detection |
| Increased Latency | Validation, Consensus, and Approval Flow Causes Response Speed Degradation | Utilizes AgentOS Submillisecond Policy Engine, Separates Asynchronous Validation |
| False Positives | Risk of excessive policies blocking normal agent operations | Apply after policy tuning in initial shadow mode |
| Emergent Behavior | Unpredictable interactions occur due to individual safety designs during multi-agent collaboration | Regular Red Team Testing (Promptfoo, DeepTeam) |
| Structural Limitations | Mixed Handling of Trusted and Untrusted Inputs Cannot Be Fully Resolved | Protecting High-Risk Actions with Human-in-the-Loop Checkpoints |
| Availability vs. Consensus | CircuitBreaker's get_available_agent_decision verifies availability rather than semantic consensus |
Separate implementation of logic to compare embedding similarity between responses required |
The Most Common Mistakes in Practice
- Apply Security Only in Test Environments: Skipping security verification during development and attempting to add it just before production will cause architecture change costs to skyrocket. It is recommended to embed GoalLock and Provenance Tracking from the early stages of agent design.
- Insufficient Isolation of Shared RAG DB: If multiple agents share the same vector DB without distinguishing between read and write permissions, a single infection will spread throughout the entire system. It is highly recommended to apply agent-specific partitioning and trust-level-based filters.
- Trust Static Rules Only: It is a common misconception to believe that defense is complete with only regular expression filters for known injection patterns. Since attackers bypass static rules through encoding, evasion expressions, and indirect injection, it is recommended to conduct regular red team tests using Promptfoo or DeepTeam.
In Conclusion
In this article, we examined a structure in which GoalLock blocks goal tampering at the entry point of external inputs, Layer 5 memory isolation prevents persistent contamination through the shared RAG storage, and Circuit Breaker isolates compromised agents and restores them to a final healthy state, thereby breaking the ASI01 → ASI06 → ASI08 chain at each stage. While each pattern is valid independently, combining the three layers effectively blocks the propagation path of the chain attack itself.
You can select a starting point for the 3 steps below depending on your team's situation.
- [Team operating agent system] Run Red Team Test: After installing Promptfoo (
pnpm add -g promptfoo), you can automatically scan the current agent for OWASP Agentic Top 10 vulnerabilities using thepromptfoo redteam run --plugins owasp:agenticcommand. It is recommended to check the ASI01, ASI06, and ASI08 scores first in the results report. - [Teams using RAG/Vector DB] Memory Isolation Audit: Please check if each memory entry in your current agent system contains
source,trustLevel, andttlmetadata. If not, you can start by adding Provenance Tracking by referring to theSecureMemoryStorepattern in this article. - [Team in Agent System Design Phase] Review of Microsoft Agent Governance Toolkit Adoption: This toolkit, released by Microsoft under the MIT license in April 2026, provides the
AgentOSpolicy engine, which can collectively apply runtime policies corresponding to the previously discussed GoalLock, Memory Isolation, and Circuit Breaker with sub-millisecond latency. You can find LangChain and CrewAI integration examples in the official GitHub repository (microsoft/agent-governance-toolkit).
Next Post: Designing an MCP (Model Context Protocol) Gateway in Practice — A Step-by-Step Guide to Applying Zero Trust Least Privilege Architecture to AI Agents
Reference Materials
- OWASP Top 10 for Agentic Applications 공식 발표 (2025.12) | OWASP
- OWASP Top 10 for Agentic Applications 2026 전문 | OWASP
- OWASP Agentic AI Threats and Mitigations 가이드 | OWASP
- Promptfoo — OWASP Agentic AI Red Team Guide | Promptfoo
- Microsoft Agent Governance Toolkit GitHub | Microsoft
- Microsoft Agent Governance Toolkit Official Blog | Microsoft Open Source
- Microsoft — OWASP Agentic Top 10 대응 (Copilot Studio) | Microsoft Security Blog
- Adversa AI — OWASP ASI08 Complete Guide to Cascading Failure | Adversa AI
- Agentic AI Security: Threats·Defense·Assessment·Challenges | arXiv
- System-level Indirect Prompt Injection Defense Architecture | arXiv
- Memory Poisoning and Secure Multi-Agent Systems (PoisonedRAG) | arXiv
- Zero Trust for AI Agents | Cisco
- In-depth Analysis of AI Agent Memory Poisoning | MintMCP
- Galileo AI — Causes and Prevention of Multi-Agent System Failure | Galileo AI
- NIST — AI Agent Hijacking Assessment Enhancement Technology Blog | NIST
- RAG Data Poisoning Core Concepts | Promptfoo
- Palo Alto Networks — OWASP Agentic AI 2026 Response Strategy | Palo Alto Networks