AI Agent Security in Code: A Practical Guide to Defending Against Target Hijacking, Memory Poisoning, and Cascading Failures

If you are building a system where AI agents formulate plans, invoke tools, and delegate tasks to other agents, you must be aware that there are threat categories that cannot be captured by existing static analysis (SAST) or dynamic analysis (DAST) tools alone. A structure where a single user request from an agent is broken down into dozens of internal actions, with the result of each action serving as the input for the next plan, creates an attack surface that traditional request-response-centric security models never assumed. The OWASP Top 10 for Agentic Applications, published by the OWASP GenAI Security Project in December 2025, is a framework designed to fill this very gap. While the existing LLM Top 10 focused on threats involving single prompt-response pairs, the Agentic Top 10 treats the entire multi-stage behavioral cycle—where an agent repeats planning, tool invocation, memory persistence, and delegation—as a threat model.

This article focuses on three of the ten items that have either been issued an actual CVE (ASI01: EchoLeak CVE-2025-32711, CVSS 9.3) or are verified by reproducible research papers (ASI06: arXiv:2603.20357): ASI01 (Goal Hijacking), ASI06 (Memory Poisoning), and ASI08 (Cascading Failure). These three threats do not exist independently but form a chain structure where ASI01 acts as an initial entry point to contaminate memory via ASI06, and that contamination propagates throughout the entire system via ASI08. After reading this article, you will have an architectural standard for deploying the three patterns of GoalLock, Layer 5 Memory Isolation, and Circuit Breaker as defense lines against each threat. This article is intended for backend and full-stack developers who are already building or planning to build agentic AI systems.

Key Concepts

Why Agentic AI Creates Threats That Traditional Security Tools Cannot Catch

Traditional SAST/DAST tools detect static code structures or anomalies in HTTP requests and responses. However, agentic AI attacks bypass byte-level validation because they infiltrate the results of tool calls written in natural language, the embedding layers of vector databases, and the message payloads between agents.

Classification	Traditional Prompt Injection	Agentic Goal Hijacking
Span of Impact	Single Response Contamination	Full Control of Agent Planning Engine
Persistence	Limited to this request	Weaponizes all subsequent actions
Detection Difficulty	Relatively Easy	Natural Language Based, Bypasses Schema Validation
Propagation Path	None	Propagate to Memory/Sub-agents

The Core Paradox of Agentic Security: The factors that make an agent more powerful (planning ability, long-term memory, cooperation with other agents) are also the factors that exponentially expand the attack surface.

Chain Structure of Three Threats: ASI01 → ASI06 → ASI08

These three threats are not independent individual events. In actual attack scenarios, they occur in a chain as follows.

[외부 입력] → ASI01(목표 하이재킹)
                  ↓ 오염된 목표로 메모리 쓰기
              ASI06(메모리 포이즈닝)
                  ↓ 오염된 메모리를 공유 RAG에 저장
              ASI08(계단식 실패)
                  ↓ 공유 메모리를 읽는 모든 에이전트로 전파
              [시스템 전체 장애]

Each defense layer blocks one step of this chain. If GoalLock blocks ASI01 entry, ASI06 and ASI08 lose their firing conditions.

ASI01 — Agent Goal Hijacking

Goal hijacking is an attack in which malicious content overwrites the agent's original goal and plan path itself. The attack enters the context window through an external source trusted by the agent. The specific entry paths are as follows:

Web Search Results: Directives inserted in the description field or snippet area of the Bing/Google API response
Recipient Email: Hidden text part processed by CSS display:none in the HTML body
External Documents: PDF footnotes, comment areas in Word documents, HTML comments in Markdown files()

As officially acknowledged by OpenAI, complete blocking is theoretically impossible due to the structural limitation that trusted and untrusted inputs are processed in the same context window.

A representative example is EchoLeak (CVE-2025-32711, CVSS 9.3), discovered in Microsoft 365 Copilot in mid-2025. The moment Copilot processed the HTML body part of a crafted email as context, sensitive data was automatically leaked externally without user intervention. The agent's goal was immediately switched from "normal business processing" to "data extraction."

ASI06 — Memory & Context Poisoning

RAG (Retrieval-Augmented Generation): This is a pattern where an agent searches an external knowledge base (documents stored in a vector database) and utilizes it to generate answers. It operates as the agent's "long-term memory," and a structure where multiple agents share the same vector database is common.

Memory poisoning is an attack that inserts malicious instructions into this RAG repository, permanently affecting all subsequent interactions. Unlike prompt injection, which contaminates a single conversation, once successful, all agents accessing that RAG repository are contaminated.

As confirmed in the PoisonedRAG study (arXiv:2603.20357), a single malicious document can propagate throughout the entire system within hours via a shared vector DB. If an agent forms a false belief (e.g., the perception that a specific security policy is invalid), subsequent decisions based on this belief are cascaded and corrupted.

ASI08 — Cascading Failures

Cascading failure is a phenomenon where corruption in one agent propagates to connected tools, shared memory, and subordinate agents, driving the entire system into a failure state. A key risk factor is that natural language-based errors pass type checking or schema validation. Existing monitoring tools struggle to detect agents that deliver semantically incorrect instructions while returning a valid JSON schema. According to Galileo AI's analysis of multi-agent system failures, temporal compounding—where corrupted memory continuously contaminates future operations—is identified as the most significant detrimental factor.

Zero Trust for AI: This is an architectural concept that applies the Zero Trust principle—"Never trust, always verify"—to AI agents. Internal agents are treated as verification targets just like external inputs, and both Cisco and Microsoft released Zero Trust architecture guidelines dedicated to AI agents in the first quarter of 2026.

MCP (Model Context Protocol): This is a protocol designed for AI agents to interact with external tools and services in a standardized manner. The enforcement of least privilege through MCP gateways has become the current industry standard pattern, which will be covered in detail in the following article.

Practical Application

Example 1: Defending Against Goal Hijacking with the GoalLock Mechanism

This is a pattern where the agent signs the initial goal with HMAC and compares the current goal with the initial signature every time after processing external input. If the goal is modified without authorization, execution stops immediately.

python

import hashlib
import hmac
import re
from dataclasses import dataclass
from typing import Optional
 
SECRET_KEY = b"your-secret-key-stored-in-env"  # 실제 환경에서는 환경 변수로 관리
 
 
@dataclass
class GoalLock:
    original_goal: str
    signature: str
 
    @staticmethod
    def create(goal: str) -> "GoalLock":
        # hmac.new의 첫 번째 인자는 bytes, digestmod는 키워드 인자로 명시
        sig = hmac.new(
            SECRET_KEY, goal.encode(), digestmod=hashlib.sha256
        ).hexdigest()
        return GoalLock(original_goal=goal, signature=sig)
 
    def verify(self, current_goal: str) -> bool:
        expected = hmac.new(
            SECRET_KEY, self.original_goal.encode(), digestmod=hashlib.sha256
        ).hexdigest()
        # hmac.compare_digest: 타이밍 공격(timing attack) 방어
        if not hmac.compare_digest(self.signature, expected):
            return False  # 서명 자체가 조작됨
        return current_goal.strip() == self.original_goal.strip()
 
 
class SecureAgent:
    def __init__(self, goal: str):
        self.goal_lock = GoalLock.create(goal)
        self.current_goal = goal
 
    def process_external_input(self, external_content: str) -> str:
        """외부 입력을 처리하기 전 목표 무결성 확인"""
        sanitized = self._sanitize_input(external_content)
 
        if not self.goal_lock.verify(self.current_goal):
            raise SecurityError(
                f"Goal integrity violation detected. "
                f"Original: '{self.goal_lock.original_goal}' "
                f"Current: '{self.current_goal}'"
            )
        return sanitized
 
    def _sanitize_input(self, content: str) -> str:
        """
        다계층 입력 sanitization.
 
        완전 차단 대신 '[SUSPICIOUS_CONTENT_DETECTED]' 마킹을 사용하는 이유:
        - 완전 차단 시, LLM은 입력이 잘렸다는 사실을 모른 채 불완전한
          컨텍스트로 계획을 진행해 오히려 예측 불가능한 동작을 유발할 수 있습니다.
        - 마킹 방식은 LLM이 의심 콘텐츠의 존재를 인식하고 적절히
          무시하거나 경고를 포함한 응답을 생성하도록 유도합니다.
        - 단, 이 방식은 LLM이 마킹 자체를 무시하거나 학습 컨텍스트로
          처리할 수 있으므로, 임베딩 거리 기반 의미론적 탐지와 병행하는
          것을 권장합니다.
        """
        injection_patterns = [
            r"(?i)ignore\s+(all\s+)?previous\s+instructions?",
            r"(?i)you\s+are\s+now\s+",
            r"(?i)act\s+as\s+",
            r"(?i)forget\s+your\s+(previous\s+)?instructions?",
            r"(?i)new\s+goal\s*:",
            r"(?i)override\s+(previous\s+)?instructions?",
        ]
        for pattern in injection_patterns:
            if re.search(pattern, content):
                content = f"[SUSPICIOUS_CONTENT_DETECTED] {content}"
                break
        return content
 
 
class SecurityError(Exception):
    pass

Code Components	Roles
`GoalLock.create()`	Sign initial target with HMAC-SHA256, create immutable baseline
`GoalLock.verify()`	Verify current goal and signature match before every action
`hmac.compare_digest()`	Constant Time Comparison for Timing Attack Defense
`_sanitize_input()`	Mark after detecting known injection patterns (not complete block)
`SecurityError`	Stop execution immediately upon target tampering detection

Note: Regular expression-based sanitization only defends against known patterns. Since attackers can bypass static rules through encoding, bypass expressions, and indirect injection, it is recommended to use it in conjunction with embedding distance measurement-based semantic detection.

Example 2: Defending Against Poisoning with 5-Layer Memory Isolation

The key to defending against memory poisoning is to store the source and trustworthiness of each memory entry and verify them at the time of the query.

typescript

// Node.js 14.17.0+ 환경 기준
// randomUUID와 createHash 모두 "crypto" 모듈에서 명시적으로 임포트
import { createHash, randomUUID } from "crypto";
 
interface MemoryEntry {
  id: string;
  content: string;
  // Provenance Tracking: 출처 메타데이터
  provenance: {
    source: string;       // 예: "user_upload" | "web_search" | "agent_internal"
    timestamp: number;
    agentId: string;
    trustLevel: "high" | "medium" | "low" | "untrusted";
  };
  // Temporal Decay: 만료 정보
  ttl: number;            // Unix timestamp (ms), 0이면 영구
  contentHash: string;    // 무결성 검증용 SHA-256 해시
}
 
class SecureMemoryStore {
  private partitions: Map<string, MemoryEntry[]> = new Map();
 
  // 1. Memory Partitioning — 에이전트별 격리
  private getPartition(key: string): MemoryEntry[] {
    if (!this.partitions.has(key)) {
      this.partitions.set(key, []);
    }
    return this.partitions.get(key)!;
  }
 
  // 2. Provenance Tracking — 출처 메타데이터 포함 저장
  async store(
    agentId: string,
    content: string,
    source: string,
    trustLevel: MemoryEntry["provenance"]["trustLevel"],
    ttlHours: number = 24
  ): Promise<string> {
    const entry: MemoryEntry = {
      id: randomUUID(), // "crypto" 모듈에서 임포트한 randomUUID 사용
      content,
      provenance: {
        source,
        timestamp: Date.now(),
        agentId,
        trustLevel,
      },
      ttl: ttlHours > 0 ? Date.now() + ttlHours * 3_600_000 : 0,
      contentHash: createHash("sha256").update(content).digest("hex"),
    };
 
    // 낮은 신뢰도 콘텐츠는 별도 격리 파티션에 저장
    const partitionKey =
      trustLevel === "untrusted" ? `${agentId}:quarantine` : agentId;
 
    this.getPartition(partitionKey).push(entry);
    return entry.id;
  }
 
  // 3. Context Isolation + 4. Temporal Decay — 쿼리 시 실시간 검증
  async query(
    agentId: string,
    minTrustLevel: MemoryEntry["provenance"]["trustLevel"] = "medium"
  ): Promise<MemoryEntry[]> {
    const trustHierarchy: Record<
      MemoryEntry["provenance"]["trustLevel"],
      number
    > = { high: 3, medium: 2, low: 1, untrusted: 0 };
    const minScore = trustHierarchy[minTrustLevel];
    const now = Date.now();
 
    return this.getPartition(agentId).filter((entry) => {
      // Temporal Decay: 만료된 항목 제외
      if (entry.ttl > 0 && entry.ttl < now) return false;
 
      // Context Isolation: 신뢰 수준 필터링
      if (trustHierarchy[entry.provenance.trustLevel] < minScore) return false;
 
      // 무결성 검증: 저장 후 변조 여부 확인
      const currentHash = createHash("sha256")
        .update(entry.content)
        .digest("hex");
      if (currentHash !== entry.contentHash) {
        console.error(`Memory integrity violation detected: entry ${entry.id}`);
        return false;
      }
      return true;
    });
  }
 
  // 5. Behavioral Monitoring — 이상 패턴 탐지
  async detectAnomalies(agentId: string): Promise<string[]> {
    const warnings: string[] = [];
    const partition = this.getPartition(agentId);
 
    // 단시간 내 대량 저장 시도 탐지
    const recentEntries = partition.filter(
      (e) => Date.now() - e.provenance.timestamp < 60_000
    );
    if (recentEntries.length > 50) {
      warnings.push(
        `Anomaly: ${recentEntries.length} entries written in last 60s`
      );
    }
 
    // 비신뢰 소스 비율 탐지
    const untrustedRatio =
      partition.filter((e) => e.provenance.trustLevel === "untrusted").length /
      Math.max(partition.length, 1);
 
    if (untrustedRatio > 0.3) {
      warnings.push(
        `Anomaly: ${(untrustedRatio * 100).toFixed(1)}% entries from untrusted sources`
      );
    }
 
    return warnings;
  }
}

Layer	Implementation Point	Defense Effect
Memory Partitioning	`getPartition(agentId)`	Blocking Cross-Contamination Between Agents
Context Isolation	`minTrustLevel` Filter	Automatic isolation of low confidence items
Provenance Tracking	`provenance` Metadata	Post-attack path tracing possible
Temporal Decay	`ttl` Expiration Check	Automatically Deletion of Old Contaminated Items
Behavioral Monitoring	`detectAnomalies()`	Early Detection of Mass Insertion Attacks

Example 3: Defending Against Cascading Failures with Circuit Breaker Patterns

Circuit Breaker Pattern: A software engineering pattern designed to prevent chaining of external service call failures by automatically terminating the connection when a failure exceeds a threshold and attempting recovery after a certain period. In agent systems, the same principle is applied to isolate a compromised agent and restore it to its last healthy state (SafeMode snapshot).

python

import asyncio
import time
import json
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, Any, Optional
 
 
class CircuitState(Enum):
    CLOSED = "closed"       # 정상 동작
    OPEN = "open"           # 차단 (요청 즉시 거부)
    HALF_OPEN = "half_open" # 복구 테스트 중
 
 
@dataclass
class AgentCircuitBreaker:
    agent_id: str
    failure_threshold: int = 5       # 실패 N회 시 OPEN
    recovery_timeout: float = 30.0   # N초 후 HALF_OPEN 시도
    success_threshold: int = 2       # HALF_OPEN에서 성공 N회 시 CLOSED 복귀
 
    state: CircuitState = CircuitState.CLOSED
    failure_count: int = 0
    success_count: int = 0
    last_failure_time: float = 0.0
    safe_snapshot: Optional[dict] = None
 
    def save_snapshot(self, state: dict) -> None:
        """정상 동작 시 SafeMode 스냅샷 저장"""
        self.safe_snapshot = {
            "timestamp": time.time(),
            "agent_id": self.agent_id,
            "state": json.dumps(state),
        }
 
    def restore_from_snapshot(self) -> Optional[dict]:
        """장애 시 마지막 정상 스냅샷으로 복구"""
        if self.safe_snapshot:
            print(
                f"[RECOVERY] Agent {self.agent_id}: "
                f"restoring snapshot (saved at {self.safe_snapshot['timestamp']:.0f})"
            )
            return json.loads(self.safe_snapshot["state"])
        return None
 
    async def call(self, func: Callable, *args, **kwargs) -> Any:
        """에이전트 액션 실행 전 회로 상태 확인"""
        if self.state == CircuitState.OPEN:
            elapsed = time.time() - self.last_failure_time
            if elapsed >= self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.success_count = 0
                print(f"[CIRCUIT] Agent {self.agent_id}: HALF_OPEN (recovery test)")
            else:
                restored = self.restore_from_snapshot()
                raise CircuitOpenError(
                    f"Agent {self.agent_id} is OPEN. "
                    f"Retry after {self.recovery_timeout - elapsed:.1f}s",
                    restored_state=restored,
                )
 
        try:
            result = await func(*args, **kwargs)
            self._on_success()
            return result
        except Exception:
            self._on_failure()
            raise
 
    def _on_success(self) -> None:
        self.failure_count = 0
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = CircuitState.CLOSED
                print(f"[CIRCUIT] Agent {self.agent_id}: CLOSED (recovered)")
 
    def _on_failure(self) -> None:
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.OPEN
        elif self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
            print(
                f"[CIRCUIT] Agent {self.agent_id}: OPEN "
                f"(failures: {self.failure_count})"
            )
 
 
class MultiAgentOrchestrator:
    def __init__(self, agent_ids: list[str]):
        self.breakers = {
            aid: AgentCircuitBreaker(agent_id=aid) for aid in agent_ids
        }
 
    async def get_available_agent_decision(
        self,
        query: str,
        agent_handlers: dict[str, Callable],
        availability_threshold: float = 0.66,
    ) -> dict:
        """
        다중 에이전트 가용성 기반 결정.
 
        주의: 이 구현은 '응답한 에이전트 비율'이 임계치 이상일 때
        첫 번째 응답을 반환하는 가용성(availability) 검증입니다.
        응답 내용의 의미론적 일치도를 검사하는 진정한 합의(consensus)가
        아닙니다. 프로덕션 환경에서는 응답 간 임베딩 유사도 비교나
        다수결 로직을 추가로 구현하는 것을 권장합니다.
        """
        responses = []
        for agent_id, handler in agent_handlers.items():
            try:
                breaker = self.breakers[agent_id]
                result = await breaker.call(handler, query)
                responses.append({"agent_id": agent_id, "result": result})
            except CircuitOpenError:
                print(f"[DECISION] Agent {agent_id} skipped (circuit open)")
 
        if not responses:
            raise RuntimeError("All agents unavailable — system in SafeMode")
 
        # 가용한 에이전트 비율 확인 (의미론적 일치도 검증은 별도 구현 필요)
        availability_ratio = len(responses) / len(agent_handlers)
        if availability_ratio < availability_threshold:
            raise RuntimeError(
                f"Availability threshold not met: "
                f"only {availability_ratio:.0%} agents responded"
            )
 
        return responses[0]["result"]
 
 
class CircuitOpenError(Exception):
    def __init__(self, message: str, restored_state: Optional[dict] = None):
        super().__init__(message)
        self.restored_state = restored_state

Circuit Breaker State Transitions: Transitions occur in the order of CLOSED → (Failure Threshold Exceeded) → OPEN → (Recovery Timeout) → HALF_OPEN → (Success Threshold Reached) → CLOSED. In the OPEN state, requests are immediately rejected, and recovery is attempted to the last SafeMode snapshot.

Pros and Cons Analysis

Advantages

Item	Content
Hierarchical Defense	GoalLock → Memory Isolation → Circuit Breaker each block the ASI01/06/08 chain at each stage
Auditability	Attack paths can be traced retrospectively with Provenance Tracking
Automatic Recovery	Partial recovery without operator intervention using SafeMode snapshots
Regulatory Alignment	OWASP Agentic Top 10 Automatically Maps to EU AI Act, HIPAA, and SOC2 Requirements
Runtime Framework	Utilize open-source tools such as the Microsoft Agent Governance Toolkit to collectively apply GoalLock, Memory Isolation, and Circuit Breaker as runtime policies

Disadvantages and Precautions

Item	Content	Response Plan
Semantic Opacity	Natural Language Communication Between LLMs Cannot Be Validated at the Byte Level	Concurrent Embedding Distance-Based Anomaly Detection
Increased Latency	Validation, Consensus, and Approval Flow Causes Response Speed Degradation	Utilizes AgentOS Submillisecond Policy Engine, Separates Asynchronous Validation
False Positives	Risk of excessive policies blocking normal agent operations	Apply after policy tuning in initial shadow mode
Emergent Behavior	Unpredictable interactions occur due to individual safety designs during multi-agent collaboration	Regular Red Team Testing (Promptfoo, DeepTeam)
Structural Limitations	Mixed Handling of Trusted and Untrusted Inputs Cannot Be Fully Resolved	Protecting High-Risk Actions with Human-in-the-Loop Checkpoints
Availability vs. Consensus	CircuitBreaker's `get_available_agent_decision` verifies availability rather than semantic consensus	Separate implementation of logic to compare embedding similarity between responses required

The Most Common Mistakes in Practice

Apply Security Only in Test Environments: Skipping security verification during development and attempting to add it just before production will cause architecture change costs to skyrocket. It is recommended to embed GoalLock and Provenance Tracking from the early stages of agent design.
Insufficient Isolation of Shared RAG DB: If multiple agents share the same vector DB without distinguishing between read and write permissions, a single infection will spread throughout the entire system. It is highly recommended to apply agent-specific partitioning and trust-level-based filters.
Trust Static Rules Only: It is a common misconception to believe that defense is complete with only regular expression filters for known injection patterns. Since attackers bypass static rules through encoding, evasion expressions, and indirect injection, it is recommended to conduct regular red team tests using Promptfoo or DeepTeam.

In Conclusion

In this article, we examined a structure in which GoalLock blocks goal tampering at the entry point of external inputs, Layer 5 memory isolation prevents persistent contamination through the shared RAG storage, and Circuit Breaker isolates compromised agents and restores them to a final healthy state, thereby breaking the ASI01 → ASI06 → ASI08 chain at each stage. While each pattern is valid independently, combining the three layers effectively blocks the propagation path of the chain attack itself.

You can select a starting point for the 3 steps below depending on your team's situation.

[Team operating agent system] Run Red Team Test: After installing Promptfoo (pnpm add -g promptfoo), you can automatically scan the current agent for OWASP Agentic Top 10 vulnerabilities using the promptfoo redteam run --plugins owasp:agentic command. It is recommended to check the ASI01, ASI06, and ASI08 scores first in the results report.
[Teams using RAG/Vector DB] Memory Isolation Audit: Please check if each memory entry in your current agent system contains source, trustLevel, and ttl metadata. If not, you can start by adding Provenance Tracking by referring to the SecureMemoryStore pattern in this article.
[Team in Agent System Design Phase] Review of Microsoft Agent Governance Toolkit Adoption: This toolkit, released by Microsoft under the MIT license in April 2026, provides the AgentOS policy engine, which can collectively apply runtime policies corresponding to the previously discussed GoalLock, Memory Isolation, and Circuit Breaker with sub-millisecond latency. You can find LangChain and CrewAI integration examples in the official GitHub repository (microsoft/agent-governance-toolkit).

Next Post: Designing an MCP (Model Context Protocol) Gateway in Practice — A Step-by-Step Guide to Applying Zero Trust Least Privilege Architecture to AI Agents

Reference Materials

AI Agent Security in Code: A Practical Guide to Defending Against Target Hijacking, Memory Poisoning, and Cascading Failures | DEV BAK - 기술블로그

AI Agent Security in Code: A Practical Guide to Defending Against Target Hijacking, Memory Poisoning, and Cascading Failures

Key Concepts

Why Agentic AI Creates Threats That Traditional Security Tools Cannot Catch

Classification	Traditional Prompt Injection	Agentic Goal Hijacking
Span of Impact	Single Response Contamination	Full Control of Agent Planning Engine
Persistence	Limited to this request	Weaponizes all subsequent actions
Detection Difficulty	Relatively Easy	Natural Language Based, Bypasses Schema Validation
Propagation Path	None	Propagate to Memory/Sub-agents

Chain Structure of Three Threats: ASI01 → ASI06 → ASI08

These three threats are not independent individual events. In actual attack scenarios, they occur in a chain as follows.

[외부 입력] → ASI01(목표 하이재킹)
                  ↓ 오염된 목표로 메모리 쓰기
              ASI06(메모리 포이즈닝)
                  ↓ 오염된 메모리를 공유 RAG에 저장
              ASI08(계단식 실패)
                  ↓ 공유 메모리를 읽는 모든 에이전트로 전파
              [시스템 전체 장애]

Each defense layer blocks one step of this chain. If GoalLock blocks ASI01 entry, ASI06 and ASI08 lose their firing conditions.

ASI01 — Agent Goal Hijacking

Web Search Results: Directives inserted in the description field or snippet area of the Bing/Google API response
Recipient Email: Hidden text part processed by CSS display:none in the HTML body
External Documents: PDF footnotes, comment areas in Word documents, HTML comments in Markdown files()

As officially acknowledged by OpenAI, complete blocking is theoretically impossible due to the structural limitation that trusted and untrusted inputs are processed in the same context window.

ASI06 — Memory & Context Poisoning

ASI08 — Cascading Failures

Practical Application

Example 1: Defending Against Goal Hijacking with the GoalLock Mechanism

python

import hashlib
import hmac
import re
from dataclasses import dataclass
from typing import Optional
 
SECRET_KEY = b"your-secret-key-stored-in-env"  # 실제 환경에서는 환경 변수로 관리
 
 
@dataclass
class GoalLock:
    original_goal: str
    signature: str
 
    @staticmethod
    def create(goal: str) -> "GoalLock":
        # hmac.new의 첫 번째 인자는 bytes, digestmod는 키워드 인자로 명시
        sig = hmac.new(
            SECRET_KEY, goal.encode(), digestmod=hashlib.sha256
        ).hexdigest()
        return GoalLock(original_goal=goal, signature=sig)
 
    def verify(self, current_goal: str) -> bool:
        expected = hmac.new(
            SECRET_KEY, self.original_goal.encode(), digestmod=hashlib.sha256
        ).hexdigest()
        # hmac.compare_digest: 타이밍 공격(timing attack) 방어
        if not hmac.compare_digest(self.signature, expected):
            return False  # 서명 자체가 조작됨
        return current_goal.strip() == self.original_goal.strip()
 
 
class SecureAgent:
    def __init__(self, goal: str):
        self.goal_lock = GoalLock.create(goal)
        self.current_goal = goal
 
    def process_external_input(self, external_content: str) -> str:
        """외부 입력을 처리하기 전 목표 무결성 확인"""
        sanitized = self._sanitize_input(external_content)
 
        if not self.goal_lock.verify(self.current_goal):
            raise SecurityError(
                f"Goal integrity violation detected. "
                f"Original: '{self.goal_lock.original_goal}' "
                f"Current: '{self.current_goal}'"
            )
        return sanitized
 
    def _sanitize_input(self, content: str) -> str:
        """
        다계층 입력 sanitization.
 
        완전 차단 대신 '[SUSPICIOUS_CONTENT_DETECTED]' 마킹을 사용하는 이유:
        - 완전 차단 시, LLM은 입력이 잘렸다는 사실을 모른 채 불완전한
          컨텍스트로 계획을 진행해 오히려 예측 불가능한 동작을 유발할 수 있습니다.
        - 마킹 방식은 LLM이 의심 콘텐츠의 존재를 인식하고 적절히
          무시하거나 경고를 포함한 응답을 생성하도록 유도합니다.
        - 단, 이 방식은 LLM이 마킹 자체를 무시하거나 학습 컨텍스트로
          처리할 수 있으므로, 임베딩 거리 기반 의미론적 탐지와 병행하는
          것을 권장합니다.
        """
        injection_patterns = [
            r"(?i)ignore\s+(all\s+)?previous\s+instructions?",
            r"(?i)you\s+are\s+now\s+",
            r"(?i)act\s+as\s+",
            r"(?i)forget\s+your\s+(previous\s+)?instructions?",
            r"(?i)new\s+goal\s*:",
            r"(?i)override\s+(previous\s+)?instructions?",
        ]
        for pattern in injection_patterns:
            if re.search(pattern, content):
                content = f"[SUSPICIOUS_CONTENT_DETECTED] {content}"
                break
        return content
 
 
class SecurityError(Exception):
    pass

Code Components	Roles
`GoalLock.create()`	Sign initial target with HMAC-SHA256, create immutable baseline
`GoalLock.verify()`	Verify current goal and signature match before every action
`hmac.compare_digest()`	Constant Time Comparison for Timing Attack Defense
`_sanitize_input()`	Mark after detecting known injection patterns (not complete block)
`SecurityError`	Stop execution immediately upon target tampering detection

Example 2: Defending Against Poisoning with 5-Layer Memory Isolation

The key to defending against memory poisoning is to store the source and trustworthiness of each memory entry and verify them at the time of the query.

typescript

// Node.js 14.17.0+ 환경 기준
// randomUUID와 createHash 모두 "crypto" 모듈에서 명시적으로 임포트
import { createHash, randomUUID } from "crypto";
 
interface MemoryEntry {
  id: string;
  content: string;
  // Provenance Tracking: 출처 메타데이터
  provenance: {
    source: string;       // 예: "user_upload" | "web_search" | "agent_internal"
    timestamp: number;
    agentId: string;
    trustLevel: "high" | "medium" | "low" | "untrusted";
  };
  // Temporal Decay: 만료 정보
  ttl: number;            // Unix timestamp (ms), 0이면 영구
  contentHash: string;    // 무결성 검증용 SHA-256 해시
}
 
class SecureMemoryStore {
  private partitions: Map<string, MemoryEntry[]> = new Map();
 
  // 1. Memory Partitioning — 에이전트별 격리
  private getPartition(key: string): MemoryEntry[] {
    if (!this.partitions.has(key)) {
      this.partitions.set(key, []);
    }
    return this.partitions.get(key)!;
  }
 
  // 2. Provenance Tracking — 출처 메타데이터 포함 저장
  async store(
    agentId: string,
    content: string,
    source: string,
    trustLevel: MemoryEntry["provenance"]["trustLevel"],
    ttlHours: number = 24
  ): Promise<string> {
    const entry: MemoryEntry = {
      id: randomUUID(), // "crypto" 모듈에서 임포트한 randomUUID 사용
      content,
      provenance: {
        source,
        timestamp: Date.now(),
        agentId,
        trustLevel,
      },
      ttl: ttlHours > 0 ? Date.now() + ttlHours * 3_600_000 : 0,
      contentHash: createHash("sha256").update(content).digest("hex"),
    };
 
    // 낮은 신뢰도 콘텐츠는 별도 격리 파티션에 저장
    const partitionKey =
      trustLevel === "untrusted" ? `${agentId}:quarantine` : agentId;
 
    this.getPartition(partitionKey).push(entry);
    return entry.id;
  }
 
  // 3. Context Isolation + 4. Temporal Decay — 쿼리 시 실시간 검증
  async query(
    agentId: string,
    minTrustLevel: MemoryEntry["provenance"]["trustLevel"] = "medium"
  ): Promise<MemoryEntry[]> {
    const trustHierarchy: Record<
      MemoryEntry["provenance"]["trustLevel"],
      number
    > = { high: 3, medium: 2, low: 1, untrusted: 0 };
    const minScore = trustHierarchy[minTrustLevel];
    const now = Date.now();
 
    return this.getPartition(agentId).filter((entry) => {
      // Temporal Decay: 만료된 항목 제외
      if (entry.ttl > 0 && entry.ttl < now) return false;
 
      // Context Isolation: 신뢰 수준 필터링
      if (trustHierarchy[entry.provenance.trustLevel] < minScore) return false;
 
      // 무결성 검증: 저장 후 변조 여부 확인
      const currentHash = createHash("sha256")
        .update(entry.content)
        .digest("hex");
      if (currentHash !== entry.contentHash) {
        console.error(`Memory integrity violation detected: entry ${entry.id}`);
        return false;
      }
      return true;
    });
  }
 
  // 5. Behavioral Monitoring — 이상 패턴 탐지
  async detectAnomalies(agentId: string): Promise<string[]> {
    const warnings: string[] = [];
    const partition = this.getPartition(agentId);
 
    // 단시간 내 대량 저장 시도 탐지
    const recentEntries = partition.filter(
      (e) => Date.now() - e.provenance.timestamp < 60_000
    );
    if (recentEntries.length > 50) {
      warnings.push(
        `Anomaly: ${recentEntries.length} entries written in last 60s`
      );
    }
 
    // 비신뢰 소스 비율 탐지
    const untrustedRatio =
      partition.filter((e) => e.provenance.trustLevel === "untrusted").length /
      Math.max(partition.length, 1);
 
    if (untrustedRatio > 0.3) {
      warnings.push(
        `Anomaly: ${(untrustedRatio * 100).toFixed(1)}% entries from untrusted sources`
      );
    }
 
    return warnings;
  }
}

Layer	Implementation Point	Defense Effect
Memory Partitioning	`getPartition(agentId)`	Blocking Cross-Contamination Between Agents
Context Isolation	`minTrustLevel` Filter	Automatic isolation of low confidence items
Provenance Tracking	`provenance` Metadata	Post-attack path tracing possible
Temporal Decay	`ttl` Expiration Check	Automatically Deletion of Old Contaminated Items
Behavioral Monitoring	`detectAnomalies()`	Early Detection of Mass Insertion Attacks

Example 3: Defending Against Cascading Failures with Circuit Breaker Patterns

python

import asyncio
import time
import json
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, Any, Optional
 
 
class CircuitState(Enum):
    CLOSED = "closed"       # 정상 동작
    OPEN = "open"           # 차단 (요청 즉시 거부)
    HALF_OPEN = "half_open" # 복구 테스트 중
 
 
@dataclass
class AgentCircuitBreaker:
    agent_id: str
    failure_threshold: int = 5       # 실패 N회 시 OPEN
    recovery_timeout: float = 30.0   # N초 후 HALF_OPEN 시도
    success_threshold: int = 2       # HALF_OPEN에서 성공 N회 시 CLOSED 복귀
 
    state: CircuitState = CircuitState.CLOSED
    failure_count: int = 0
    success_count: int = 0
    last_failure_time: float = 0.0
    safe_snapshot: Optional[dict] = None
 
    def save_snapshot(self, state: dict) -> None:
        """정상 동작 시 SafeMode 스냅샷 저장"""
        self.safe_snapshot = {
            "timestamp": time.time(),
            "agent_id": self.agent_id,
            "state": json.dumps(state),
        }
 
    def restore_from_snapshot(self) -> Optional[dict]:
        """장애 시 마지막 정상 스냅샷으로 복구"""
        if self.safe_snapshot:
            print(
                f"[RECOVERY] Agent {self.agent_id}: "
                f"restoring snapshot (saved at {self.safe_snapshot['timestamp']:.0f})"
            )
            return json.loads(self.safe_snapshot["state"])
        return None
 
    async def call(self, func: Callable, *args, **kwargs) -> Any:
        """에이전트 액션 실행 전 회로 상태 확인"""
        if self.state == CircuitState.OPEN:
            elapsed = time.time() - self.last_failure_time
            if elapsed >= self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.success_count = 0
                print(f"[CIRCUIT] Agent {self.agent_id}: HALF_OPEN (recovery test)")
            else:
                restored = self.restore_from_snapshot()
                raise CircuitOpenError(
                    f"Agent {self.agent_id} is OPEN. "
                    f"Retry after {self.recovery_timeout - elapsed:.1f}s",
                    restored_state=restored,
                )
 
        try:
            result = await func(*args, **kwargs)
            self._on_success()
            return result
        except Exception:
            self._on_failure()
            raise
 
    def _on_success(self) -> None:
        self.failure_count = 0
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = CircuitState.CLOSED
                print(f"[CIRCUIT] Agent {self.agent_id}: CLOSED (recovered)")
 
    def _on_failure(self) -> None:
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.OPEN
        elif self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
            print(
                f"[CIRCUIT] Agent {self.agent_id}: OPEN "
                f"(failures: {self.failure_count})"
            )
 
 
class MultiAgentOrchestrator:
    def __init__(self, agent_ids: list[str]):
        self.breakers = {
            aid: AgentCircuitBreaker(agent_id=aid) for aid in agent_ids
        }
 
    async def get_available_agent_decision(
        self,
        query: str,
        agent_handlers: dict[str, Callable],
        availability_threshold: float = 0.66,
    ) -> dict:
        """
        다중 에이전트 가용성 기반 결정.
 
        주의: 이 구현은 '응답한 에이전트 비율'이 임계치 이상일 때
        첫 번째 응답을 반환하는 가용성(availability) 검증입니다.
        응답 내용의 의미론적 일치도를 검사하는 진정한 합의(consensus)가
        아닙니다. 프로덕션 환경에서는 응답 간 임베딩 유사도 비교나
        다수결 로직을 추가로 구현하는 것을 권장합니다.
        """
        responses = []
        for agent_id, handler in agent_handlers.items():
            try:
                breaker = self.breakers[agent_id]
                result = await breaker.call(handler, query)
                responses.append({"agent_id": agent_id, "result": result})
            except CircuitOpenError:
                print(f"[DECISION] Agent {agent_id} skipped (circuit open)")
 
        if not responses:
            raise RuntimeError("All agents unavailable — system in SafeMode")
 
        # 가용한 에이전트 비율 확인 (의미론적 일치도 검증은 별도 구현 필요)
        availability_ratio = len(responses) / len(agent_handlers)
        if availability_ratio < availability_threshold:
            raise RuntimeError(
                f"Availability threshold not met: "
                f"only {availability_ratio:.0%} agents responded"
            )
 
        return responses[0]["result"]
 
 
class CircuitOpenError(Exception):
    def __init__(self, message: str, restored_state: Optional[dict] = None):
        super().__init__(message)
        self.restored_state = restored_state

Pros and Cons Analysis

Advantages

Item	Content
Hierarchical Defense	GoalLock → Memory Isolation → Circuit Breaker each block the ASI01/06/08 chain at each stage
Auditability	Attack paths can be traced retrospectively with Provenance Tracking
Automatic Recovery	Partial recovery without operator intervention using SafeMode snapshots
Regulatory Alignment	OWASP Agentic Top 10 Automatically Maps to EU AI Act, HIPAA, and SOC2 Requirements
Runtime Framework	Utilize open-source tools such as the Microsoft Agent Governance Toolkit to collectively apply GoalLock, Memory Isolation, and Circuit Breaker as runtime policies

Disadvantages and Precautions

Item	Content	Response Plan
Semantic Opacity	Natural Language Communication Between LLMs Cannot Be Validated at the Byte Level	Concurrent Embedding Distance-Based Anomaly Detection
Increased Latency	Validation, Consensus, and Approval Flow Causes Response Speed Degradation	Utilizes AgentOS Submillisecond Policy Engine, Separates Asynchronous Validation
False Positives	Risk of excessive policies blocking normal agent operations	Apply after policy tuning in initial shadow mode
Emergent Behavior	Unpredictable interactions occur due to individual safety designs during multi-agent collaboration	Regular Red Team Testing (Promptfoo, DeepTeam)
Structural Limitations	Mixed Handling of Trusted and Untrusted Inputs Cannot Be Fully Resolved	Protecting High-Risk Actions with Human-in-the-Loop Checkpoints
Availability vs. Consensus	CircuitBreaker's `get_available_agent_decision` verifies availability rather than semantic consensus	Separate implementation of logic to compare embedding similarity between responses required

The Most Common Mistakes in Practice

Apply Security Only in Test Environments: Skipping security verification during development and attempting to add it just before production will cause architecture change costs to skyrocket. It is recommended to embed GoalLock and Provenance Tracking from the early stages of agent design.
Insufficient Isolation of Shared RAG DB: If multiple agents share the same vector DB without distinguishing between read and write permissions, a single infection will spread throughout the entire system. It is highly recommended to apply agent-specific partitioning and trust-level-based filters.
Trust Static Rules Only: It is a common misconception to believe that defense is complete with only regular expression filters for known injection patterns. Since attackers bypass static rules through encoding, evasion expressions, and indirect injection, it is recommended to conduct regular red team tests using Promptfoo or DeepTeam.

In Conclusion

You can select a starting point for the 3 steps below depending on your team's situation.

[Team operating agent system] Run Red Team Test: After installing Promptfoo (pnpm add -g promptfoo), you can automatically scan the current agent for OWASP Agentic Top 10 vulnerabilities using the promptfoo redteam run --plugins owasp:agentic command. It is recommended to check the ASI01, ASI06, and ASI08 scores first in the results report.
[Teams using RAG/Vector DB] Memory Isolation Audit: Please check if each memory entry in your current agent system contains source, trustLevel, and ttl metadata. If not, you can start by adding Provenance Tracking by referring to the SecureMemoryStore pattern in this article.
[Team in Agent System Design Phase] Review of Microsoft Agent Governance Toolkit Adoption: This toolkit, released by Microsoft under the MIT license in April 2026, provides the AgentOS policy engine, which can collectively apply runtime policies corresponding to the previously discussed GoalLock, Memory Isolation, and Circuit Breaker with sub-millisecond latency. You can find LangChain and CrewAI integration examples in the official GitHub repository (microsoft/agent-governance-toolkit).

Next Post: Designing an MCP (Model Context Protocol) Gateway in Practice — A Step-by-Step Guide to Applying Zero Trust Least Privilege Architecture to AI Agents

Key Concepts

Why Agentic AI Creates Threats That Traditional Security Tools Cannot Catch

Chain Structure of Three Threats: ASI01 → ASI06 → ASI08

ASI01 — Agent Goal Hijacking

ASI06 — Memory & Context Poisoning

ASI08 — Cascading Failures

Practical Application

Example 1: Defending Against Goal Hijacking with the GoalLock Mechanism

Example 2: Defending Against Poisoning with 5-Layer Memory Isolation

Example 3: Defending Against Cascading Failures with Circuit Breaker Patterns

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

Why Agentic AI Creates Threats That Traditional Security Tools Cannot Catch

Chain Structure of Three Threats: ASI01 → ASI06 → ASI08

ASI01 — Agent Goal Hijacking

ASI06 — Memory & Context Poisoning

ASI08 — Cascading Failures

Practical Application

Example 1: Defending Against Goal Hijacking with the GoalLock Mechanism

Example 2: Defending Against Poisoning with 5-Layer Memory Isolation

Example 3: Defending Against Cascading Failures with Circuit Breaker Patterns

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

How to Design an MCP Gateway as a Zero Trust PEP — Implementing Least Privilege with OAuth 2.1, OPA, and Epimeral Tokens

MCP Multi-Agent Delegation Pattern: Designing Agent Chain Security with RFC 8693 Token Exchange and Audit Logs

MCP Agent Security Hardening: Practical Defense Guide to Prompt Injection and Tool Poisoning

AI Agent Security Monitored at the Kernel — In-depth Analysis of eBPF-Based Runtime Governance Architecture

Applying OAuth 2.1 Authentication, Token Rate Limiting, and Team Cost Attribution to MCP Servers with Kong AI Gateway 3.12 Without Code Modification

The Complete Guide to MCP Server Observability: From Prometheus Metrics and Distributed Trace to Anomaly Detection