How a TypeScript AI Agent Maintains Conversational Context Across Sessions — Designing Mastra's Memory Layer
When building AI agents, you eventually hit a wall. A user clearly stated their name in a previous conversation, but when a new session opens, the agent has no memory of it whatsoever. To fix this, you might first think "can't I just attach RAG (Retrieval-Augmented Generation: a method that performs a vector search on every query to pull in relevant documents and include them in the context)?" — but when you try it, the vector search results vary slightly each turn, causing inconsistent agent responses. Before long, you end up writing more memory management code than business logic, manually wiring up a DB, stuffing in context, and assembling prompts.
Mastra is a TypeScript-native agent framework that solves this problem at the framework level. Unlike LangChain Memory, which is centered on the Python ecosystem, the @mastra/memory package alone provides a layered strategy covering short-term message history, structured user state management, semantics-based long-term retrieval, and — released in early 2026 — compression-based Observational Memory. This article walks through how Mastra's four memory types work, which strategy to choose in which situation, and shows real code throughout.
Core Concepts
Mastra Memory's Two-Level Scoping Structure
To understand Mastra memory, you first need to grasp two axes: resource and thread.
resource: An identifier for a user or entity. Holds memory shared across all conversations for the same user.thread: The ID of an individual conversation session. Even for the same user, different threads are managed independently.
await agent.generate('Do you remember what I said before?', {
memory: { resource: 'user-001', thread: 'session-abc' },
})For a customer support chatbot, you might use a customer ID as resource and a ticket ID as thread. This cleanly separates "what has this customer inquired about before?" (resource level) from "the conversation flow for this ticket" (thread level).
Term definition
resourceis the identifier for "who," andthreadis the identifier for "which conversation." When both values are the same, they always connect to the same memory context.
Four Memory Types
1. Message History — Short-Term Memory
The most basic form: includes the most recent N messages directly in the context window so the LLM can reference the preceding conversation. The default is the last 10 messages, adjustable via lastMessages.
new Memory({
options: {
lastMessages: 20,
},
})Note Setting
lastMessagestoo high increases token costs linearly. For short sessions, 10–20 is usually sufficient; for longer conversations, it is recommended to use the other types described below in combination.
Simple, but powerful enough for short sessions. Once conversations grow long, tokens are consumed rapidly, so other strategies are needed beyond that point.
2. Working Memory — The Agent's Sticky Note
Think of it as an "active scratchpad." It stores information that must persist across sessions — user names, preferences, ongoing tasks. Type validation via Zod schema is also supported.
import { z } from 'zod'
const memory = new Memory({
workingMemory: {
enabled: true,
schema: z.object({
userName: z.string().optional(),
preferredLanguage: z.string().optional(),
ongoingTask: z.string().optional(),
}),
},
})How it works feels a bit unfamiliar at first, but becomes intuitive once you get used to it. When the agent determines during a conversation that Working Memory needs to be updated, it includes an XML-format update block in its response. The Mastra runtime parses this, saves it to storage, and from the next conversation onward automatically injects the current state into the system prompt. In other words, you don't call a separate write API — the update happens naturally as part of the LLM's response generation.
For a code generation agent, storing "this person prefers a functional style and is currently working on the payment module" in Working Memory means the next session doesn't require explaining the full context from scratch.
3. Semantic Recall — Semantics-Based Long-Term Memory
Retrieves past messages by semantic similarity rather than keyword matching. The key is being able to surface a relevant conversation even when you don't remember the exact words — like "that time we talked about the payment error."
new Memory({
options: {
semanticRecall: {
topK: 5, // Retrieve the top 5 most similar messages
messageRange: 2, // Include 2 messages before and after each retrieved message (context preservation)
scope: 'resource', // Search across all threads for the same user
},
},
})Term definition
messageRangefetches not just the matched message but also surrounding context. This is useful for preserving conversational flow rather than retrieving isolated sentences.
Internally it uses a vector DB and embeddings. The default embedding model is OpenAI's text-embedding-3 series, which can be swapped out in the storage configuration. If using PostgreSQL as storage, the pgvector extension must be installed; with LibSQL (Turso), it works out of the box with no additional setup.
4. Observational Memory — Compression-Based Long-Term Memory
Honestly, this is the most interesting part. Released in February 2026, this new feature has an Observer agent running in the background, automatically compressing old conversations into dense, structured notes.
It divides the context window into two regions:
| Region | Contents |
|---|---|
| Observation block | Key information from previous conversations, compressed by the Observer agent |
| Current session raw messages | Recent, uncompressed conversation |
When unobserved messages reach 30,000 tokens (configurable), the Observer agent automatically runs compression. On the LongMemEval benchmark, it achieved 94.87% with gpt-4.1-mini.
How to read the benchmark LongMemEval measures accuracy of correctly recalling specific facts from long conversations. Note that the comparison RAG approach (80.05% with GPT-4o) uses a different model scale. It is interesting that a smaller model (gpt-4.1-mini) outperformed a larger model's (GPT-4o) RAG approach, but isolating the pure effect of the memory architecture alone is difficult. You can read the directional signal, but it is recommended to validate on your own domain before adopting in production.
Unlike RAG, which performs a vector search every turn causing the context to keep changing, Observational Memory maintains a stable structure. This makes it a good fit for prompt caching — a feature in AI APIs that discounts costs when the same prompt prefix repeats — enabling additional token cost savings.
Practical Application
Example 1: Basic Memory Setup — Starting with the Development Environment
The recommended starting stack is LibSQL. It lets you test immediately with in-memory storage, no separate DB required, keeping the barrier to entry low.
pnpm add @mastra/memory @mastra/libsql @ai-sdk/openaiimport { Mastra } from '@mastra/core'
import { Agent } from '@mastra/core/agent'
import { Memory } from '@mastra/memory'
import { LibSQLStore } from '@mastra/libsql'
import { openai } from '@ai-sdk/openai'
export const mastra = new Mastra({
storage: new LibSQLStore({
url: process.env.DATABASE_URL ?? ':memory:', // In-memory is sufficient during development
}),
})
export const assistantAgent = new Agent({
name: 'assistant',
instructions: '사용자의 정보를 기억하고 맥락에 맞게 응답하는 어시스턴트입니다.',
model: openai('gpt-4o-mini'), // Mastra specifies models using Vercel AI SDK adapters
memory: new Memory({
options: {
lastMessages: 20,
semanticRecall: {
topK: 5,
messageRange: 2,
scope: 'resource',
},
},
}),
})
// First message — pass user information
const res1 = await assistantAgent.generate(
'내 이름은 김철수고, TypeScript 기반 풀스택 개발자야.',
{ memory: { resource: 'user-001', thread: 'session-abc' } }
)
console.log(res1.text)
// → "안녕하세요, 김철수님! TypeScript 풀스택 개발자로 활동하고 계시는군요..."
// Calling with the same resource + thread references the previous context as-is
const res2 = await assistantAgent.generate(
'내가 어떤 개발자였지?',
{ memory: { resource: 'user-001', thread: 'session-abc' } }
)
console.log(res2.text)
// → "TypeScript 기반 풀스택 개발자라고 하셨습니다."| Code point | Description |
|---|---|
':memory:' |
In-memory SQLite for development and testing. Data is lost on server restart. |
openai('gpt-4o-mini') |
Mastra specifies models using the Vercel AI SDK adapter pattern. Requires the @ai-sdk/openai package. |
resource: 'user-001' |
User identifier. Use the actual user ID if you have a login system. |
thread: 'session-abc' |
Conversation session ID. Generate a new thread ID each time a new chat is opened. |
Example 2: Customer Support Bot — Maintaining Customer Info with Working Memory
This is a situation frequently encountered in practice. The mistake I made the first time I used this pattern was setting resource and thread to the same value. Separating the customer ID and ticket ID cleanly divides "what has this customer inquired about before?" (resource level) from "the conversation for this ticket" (thread level).
import { z } from 'zod'
import { openai } from '@ai-sdk/openai'
const customerMemory = new Memory({
workingMemory: {
enabled: true,
schema: z.object({
customerName: z.string().optional(),
subscriptionPlan: z.string().optional(),
previousIssues: z.array(z.string()).optional(),
preferredContactTime: z.string().optional(),
}),
},
options: {
lastMessages: 15,
semanticRecall: {
topK: 3,
messageRange: 1,
scope: 'resource', // Search across all previous tickets for this customer
},
},
})
export const supportAgent = new Agent({
name: 'support',
instructions: `고객 지원 에이전트입니다.
Working Memory에서 고객 정보를 확인하고,
이전 문의 이력을 참조하여 중복 안내를 피해주세요.`,
model: openai('gpt-4o'),
memory: customerMemory,
})
async function handleNewTicket(customerId: string, ticketId: string, message: string) {
return await supportAgent.generate(message, {
memory: {
resource: customerId, // Shared memory at the customer level
thread: ticketId, // Independent conversation at the ticket level
},
})
}
// Customer sends the first message
const res1 = await handleNewTicket(
'customer-001',
'ticket-2026-001',
'안녕하세요, 저는 김철수이고 Pro 플랜을 사용 중이에요. 결제 오류가 났는데요.'
)
// → When generating a response, the agent automatically saves customerName and subscriptionPlan to Working Memory
// Even when a different ticket is opened, customer info is read from Working Memory
const res2 = await handleNewTicket(
'customer-001',
'ticket-2026-002', // New ticket ID
'또 문제가 생겼어요.'
)
// → "김철수님, Pro 플랜 관련 문의이신가요? 이전에 결제 오류 건도 있으셨는데..."In this structure, customer information stored in Working Memory (subscriptionPlan, previousIssues, etc.) persists across ticket changes, and Semantic Recall with scope: 'resource' includes conversations from previous tickets in its search scope.
Example 3: Long-Running Agent — Enabling Observational Memory
Suitable for cases where conversations grow very long, such as medical records or long-term projects. The usage is identical to the standard generate() call pattern, and compression is handled automatically in the background.
import { Memory } from '@mastra/memory'
import { openai } from '@ai-sdk/openai'
const longTermMemory = new Memory({
observationalMemory: {
enabled: true,
compressionThreshold: 30000, // Auto-compress when reaching 30,000 tokens
},
options: {
lastMessages: 10,
},
})
export const medicalAgent = new Agent({
name: 'medical-assistant',
instructions: `환자의 진료 이력을 관리하는 어시스턴트입니다.
압축된 진료 기록을 참조하여 현재 증상과의 연관성을 분석합니다.`,
model: openai('gpt-4o'),
memory: longTermMemory,
})
// The calling pattern is identical to a regular agent
const res = await medicalAgent.generate(
'오늘 두통이 있는데, 지난달에 처방받은 약이 뭐였죠?',
{ memory: { resource: 'patient-001', thread: 'visit-2026-05-21' } }
)
// If the Observer agent has already compressed previous medical records,
// even tens of thousands of tokens of records are referenced efficiently from the observation blockCaution Observational Memory is a structure that concentrates sensitive information into long-term artifacts. In regulated environments such as healthcare or finance, the blast radius of a data breach can be wider than with a RAG approach. It is recommended to design encryption and access control policies alongside this feature.
Pros and Cons Analysis
Advantages
| Item | Description |
|---|---|
| Structured memory layers | Short-term (Message History) → Working (Working Memory) → Long-term (Semantic Recall / Observational) layers are clearly separated, enabling strategy selection by purpose |
| Cost efficiency | Observational Memory offers 5–40x text compression + prompt caching compatibility, enabling up to 10x token cost reduction |
| Long-term memory without a vector DB | Observational Memory maintains long-term context without a vector DB, reducing infrastructure complexity |
| TypeScript-native | Zod schema validation, type-safe configuration, no Python server management required |
| Flexible storage options | Choose from LibSQL (including for development), PostgreSQL, MongoDB, or Upstash to match your environment |
Disadvantages and Caveats
Personally, the item I am most concerned about is the concentration of security risk. The fact that Observational Memory compresses and archives all conversations into long-term artifacts means that a single breach exposes months of a user's conversation history in its entirety. This contrasts with RAG, which distributes risk across individual queries.
| Item | Description | Mitigation |
|---|---|---|
| Concentrated security risk | Observational Memory stores sensitive information in long-term artifacts → wide blast radius on breach | Storage encryption + strict access control design required |
| Observer agent cost | Background LLM calls occur during compression → additional costs with frequent compression | Set compressionThreshold sufficiently high |
| Compression loss | Detail may be lost during 5–40x compression | For regulated environments requiring original preservation, Semantic Recall is recommended |
| Benchmark source | The 94.87% figure is from Mastra's own measurements; independent third-party verification is lacking | Self-validation on your own domain required before production adoption |
| PostgreSQL pgvector | Separate installation of the pgvector extension required when using Semantic Recall with PostgreSQL | Prepare a CREATE EXTENSION vector; migration in advance |
| Framework lock-in | Design deeply integrated into the Mastra ecosystem → high migration cost to other frameworks | Review long-term maintenance plans before adopting |
Term supplement
pgvectoris a PostgreSQL extension that supports vector operations. Semantic Recall internally uses this vector index when computing semantic similarity.
After laying out the pros and cons, you realize that choosing a memory strategy is ultimately not a purely technical decision. Data security requirements, cost structure, and the team's infrastructure operation capabilities all need to be factored in together.
The Most Common Mistakes in Practice
- Setting
resourceandthreadto the same value — all conversations merge into a single thread, mixing different contexts together. The user ID must go intoresourceand the conversation session ID intothreadseparately. - Using the
:memory:storage from the development environment in production — all memory is lost on server restart. In production, you must switch to a LibSQL file path or PostgreSQL. - Relying solely on Semantic Recall and omitting Working Memory — for information that always needs to be referenced, such as user names and preferences, storing it in a structured way in Working Memory is far more stable and less costly than pulling it in with a vector search each time.
Closing Thoughts
Memory design is an architectural decision that is difficult to change later with "I'll fix it sometime." If you don't establish the resource and thread layers properly from the start, you end up with problems where user data gets mixed together or another user's context bleeds in unintentionally. Treating memory layer design as part of agent design from the beginning costs far less than refactoring after data has accumulated. It is recommended to lock in the layer structure right now.
If you're trying this for the first time, here is a suggested approach:
- Add dependencies with
pnpm add @mastra/memory @mastra/libsql @ai-sdk/openai, then testresource+threadcombinations withurl: ':memory:'in-memory storage. Start by directly confirming that context is preserved even as sessions change. - Define a Zod schema in Working Memory and try storing structured information like user preferences or progress. You'll immediately feel the difference compared to using only Message History.
- After enough conversations have accumulated, activate Semantic Recall and check whether past context is retrieved correctly when the same topic is phrased differently. You'll get a feel for which
topKandmessageRangevalues suit your domain. - Simulate a scenario where conversations accumulate beyond 30,000 tokens and directly examine the compression output from Observational Memory. Seeing which information remains in the observation block and which is lost will give you a basis for the right
compressionThresholdfor your domain.
References
- Agent memory | Mastra official documentation
- Memory overview | Mastra official documentation
- Observational Memory | Mastra official documentation
- Working memory | Mastra official documentation
- Semantic recall | Mastra official documentation
- Storage | Mastra official documentation
- Observational Memory Research | Mastra
- Using Mastra's Agent Memory API | Mastra Blog
- 'Observational memory' cuts AI agent costs 10x | VentureBeat
- How Mastra's Observational Memory Beats RAG | Techbuddies Studio
- Mastra AI: The Complete Guide to the TypeScript Agent Framework (2026)
- Mastra in 2026: What It Is, When to Use It | DEV Community
- Memory System Architecture | DeepWiki
- State of AI Agent Memory 2026 | mem0
- @mastra/memory | npm
- GitHub - mastra-ai/mastra