Horizontally Scaling a Yjs Collaboration Server with Hocuspocus + Redis: Sticky Session and Document Persistence Strategies
Have you ever had real-time collaborative editing work perfectly on a single server, only to see things go wrong the moment you added a second instance? User A types some text that disappears and reappears on user B's screen, or two people open the same document and see different content depending on which server they're on. I've been there. Hocuspocus handles conflict resolution itself thanks to Yjs CRDT, but the real problem is how to share state across servers.
This post is for teams who started with a single Hocuspocus server and are now thinking about scaling out, or for those building a Tiptap-based SaaS and designing their infrastructure for the first time. I'll walk through how I arrived at a setup that uses Redis Pub/Sub to synchronize updates between instances and a separate Database extension to ensure persistence — Redis handles only inter-instance synchronization, and document persistence must be separated into the Database extension for safe horizontal scaling. I'll honestly discuss the real-world limitations of sticky sessions and the tradeoffs behind each choice, including a layered persistence strategy using PostgreSQL and S3.
Core Concepts
Yjs CRDT and the Division of Responsibilities in Hocuspocus
Yjs is a library that implements the CRDT (Conflict-free Replicated Data Type) algorithm. It's a mathematical structure that automatically merges edits made by two clients simultaneously to the same document — even if the network drops and reconnects, everything merges without conflicts.
CRDT (Conflict-free Replicated Data Type): A data structure in distributed systems where multiple nodes can independently modify data and always converge to the same result. It merges automatically later, even without a server or with the network disconnected.
Hocuspocus is a WebSocket backend for Yjs. It runs on Node.js 18+, Bun, Deno, and Cloudflare Workers, and functionality can be composed via extensions.
| Component | Role |
|---|---|
@hocuspocus/server |
WebSocket connection management, Yjs document coordination |
@hocuspocus/extension-redis |
Propagates updates between instances via Redis Pub/Sub |
@hocuspocus/extension-database |
Stores documents in external DBs like PostgreSQL, MySQL, MongoDB |
| Redis Pub/Sub | Acts as a broadcast channel — does not store data |
How the Redis Extension Works
What the Redis extension does is simpler than you might think. When a client connected to instance A edits a document, A publishes that update to a Redis channel. Instances B and C are subscribed to the same channel, receive the update, and forward it to the clients connected to them.
The important point is that data is not stored in Redis. If Redis restarts, any messages in between are simply gone. I've personally seen cases in production where Redis was mistakenly treated as a persistence store — I'll revisit this point below.
Redlock and Distributed Locking
With multiple instances, this situation can occur: client A is on instance 1, client B is on instance 2, and both edit the same document almost simultaneously. If both instances try to save the same document to the database at the same time, one write risks overwriting the other. I got burned by Redlock once when the server clocks were nearly 30 seconds apart.
The Hocuspocus Redis extension uses the Redlock algorithm internally to address this. It requests locks from multiple Redis nodes simultaneously and only considers the lock acquired when a majority succeed, so even if a single Redis node fails, the lock can be maintained.
Note: Redlock requires that system clocks across servers don't drift significantly. NTP synchronization in a distributed environment is not optional — it's mandatory.
Practical Application
Example 1: Standard Horizontal Scaling Setup (Hocuspocus + Redis + PostgreSQL)
This is the configuration most teams start with. If the number of concurrently open documents is in the hundreds or fewer and DB write load isn't particularly high, this pattern is sufficient.
[Client A] ──WebSocket──▶ [Load Balancer (Sticky Session)]
[Client B] ──WebSocket──▶ │ │
[Hocuspocus #1] [Hocuspocus #2]
│ │
[Redis Pub/Sub (sync)]
│ │
[Database Extension] [Database Extension]
│ │
└──────┬───────┘
[PostgreSQL]It's important that both instances independently access PostgreSQL.
// hocuspocus-server.ts (run identically on each instance)
import { Server } from '@hocuspocus/server'
import { Redis } from '@hocuspocus/extension-redis'
import { Database } from '@hocuspocus/extension-database'
import { Pool } from 'pg'
const pool = new Pool({ connectionString: process.env.DATABASE_URL })
const server = Server.configure({
port: Number(process.env.PORT) || 1234,
// Using both onLoadDocument and the Database extension's fetch simultaneously
// will apply Y.applyUpdate twice to the same document, corrupting its state.
// Use only one of the two. Here we use only the Database extension.
extensions: [
new Redis({
host: process.env.REDIS_HOST ?? 'localhost',
port: 6379,
}),
new Database({
fetch: async ({ documentName }) => {
// Add try/catch in production.
// If fetch fails, the document may open in an empty state.
const result = await pool.query(
'SELECT data FROM documents WHERE name = $1',
[documentName]
)
return result.rows[0]?.data ?? null
},
store: async ({ documentName, state }) => {
// Add try/catch in production.
await pool.query(
`INSERT INTO documents (name, data, updated_at)
VALUES ($1, $2, NOW())
ON CONFLICT (name) DO UPDATE
SET data = EXCLUDED.data, updated_at = NOW()`,
[documentName, Buffer.from(state)]
)
},
}),
],
})
server.listen()
console.log(`Hocuspocus running on port ${server.configuration.port}`)Here is a sticky session configuration using HAProxy. Since WebSocket connections are long-lived, the timeout tunnel value needs to be set large enough.
# haproxy.cfg
frontend ws_frontend
bind *:80
default_backend ws_backend
# option http-server-close disables HTTP keep-alive.
# This is fine for a WebSocket-only frontend,
# but may degrade performance if also handling regular HTTP traffic.
backend ws_backend
balance leastconn
cookie SERVERID insert indirect nocache
timeout connect 5s
timeout client 1h
timeout tunnel 24h
server hocuspocus1 hocuspocus-1:1234 check cookie s1
server hocuspocus2 hocuspocus-2:1234 check cookie s2If you prefer Nginx, you can use the following approach.
upstream hocuspocus_backend {
hash $remote_addr consistent;
server hocuspocus-1:1234;
server hocuspocus-2:1234;
}
server {
listen 80;
location / {
proxy_pass http://hocuspocus_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_read_timeout 86400s;
}
}Key Insight: The Redis extension and the Database extension must be used together. With Redis alone, synchronization works, but data is lost on server restart.
Example 2: Layered Document Persistence Strategy (High-Load Environments)
From this example onward, the content targets high-load environments or situations requiring audit logs and version history. Consider this when the number of concurrently open documents exceeds several thousand, or when monitoring shows DB write latency starting to climb. It's recommended to start with Example 1 and switch only after identifying a real bottleneck.
One thing worth clarifying: the state received by the Database extension's store callback is not a delta of the last change — it is the full state of the Y.Doc. The entire document is passed on every call. Understanding this also simplifies the fetch logic: you only need to retrieve the single most recent record to fully restore the state.
A layered strategy that takes advantage of this works as follows: always upsert the latest state into the documents table, while periodically recording a recovery-point snapshot into a separate table.
import { Server } from '@hocuspocus/server'
import { Redis } from '@hocuspocus/extension-redis'
import { Database } from '@hocuspocus/extension-database'
import { Pool } from 'pg'
const pool = new Pool({ connectionString: process.env.DATABASE_URL })
// Update counter (in-memory, operates independently per instance)
// Note: with multiple instances, each has its own counter,
// so snapshot timing will vary across instances.
// In high-load environments, it's safer to move this to a separate scheduler process.
const updateCounters = new Map<string, number>()
const SNAPSHOT_INTERVAL = 500
const server = Server.configure({
extensions: [
new Redis({ host: process.env.REDIS_HOST, port: 6379 }),
new Database({
fetch: async ({ documentName }) => {
// Since state is always the full state, fetching the single latest record is sufficient.
const result = await pool.query(
'SELECT data FROM documents WHERE name = $1',
[documentName]
)
return result.rows[0]?.data ?? null
},
store: async ({ documentName, state }) => {
// state is the full Y.Doc state (not a delta).
const count = (updateCounters.get(documentName) ?? 0) + 1
updateCounters.set(documentName, count)
// Always upsert to the latest state
await pool.query(
`INSERT INTO documents (name, data, updated_at) VALUES ($1, $2, NOW())
ON CONFLICT (name) DO UPDATE SET data = EXCLUDED.data, updated_at = NOW()`,
[documentName, Buffer.from(state)]
)
// Periodically record a recovery-point snapshot
if (count % SNAPSHOT_INTERVAL === 0) {
await pool.query(
`INSERT INTO document_snapshots (name, data, created_at) VALUES ($1, $2, NOW())`,
[documentName, Buffer.from(state)]
)
// Snapshots older than the previous one can be moved to S3 by a separate archiving process
}
// Add try/catch in production.
},
}),
],
})| Storage Layer | Contents | Access Pattern |
|---|---|---|
PostgreSQL documents |
Latest full state per document (upsert) | On document open |
PostgreSQL document_snapshots |
Periodic recovery points | Disaster recovery, version history |
| S3 | Cold archive of old snapshots | Accessed only for auditing or long-term retention |
Example 3: y-redis Worker Separation Pattern (Alternative Architecture)
This example is also intended for intermediate and above environments. It's useful as a reference when you want a broader view of architectural options.
y-redis takes a different approach from Hocuspocus's Redis extension. It uses Redis Streams as a message queue from the start, with a separate Worker process asynchronously flushing to S3 or PostgreSQL.
[WebSocket Server] ──update──▶ [Redis Streams]
│
[y-redis Worker]
│
[S3 or PostgreSQL]The advantages are that the WebSocket server doesn't hold Y.Doc in memory, making it more memory-efficient, and the server can be restarted at any time with state recoverable from Redis Streams.
One thing worth mentioning: before adopting y-redis for a commercial service, check the license. It uses a dual AGPL / commercial license structure, which, unlike Hocuspocus (MIT), may involve additional costs. This is why many teams choose the Hocuspocus + Redis combination even when y-redis looks architecturally more elegant.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Automatic CRDT merging | Automatic conflict-free merging on reconnection after a network partition — no custom conflict resolution logic needed |
| Incremental scaling | Start with a single instance, then add horizontal scaling by simply adding the Redis extension |
| Multiple runtime support | Works identically on Node.js, Bun, Deno, and Cloudflare Workers |
| Rich ecosystem | Official Tiptap support, various Database drivers including PostgreSQL, MySQL, MongoDB |
| Offline editing | Automatic synchronization when a client that edited offline reconnects |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Sticky session fragility | If a server fails, all connections to that server are dropped | Configure graceful drain, transition to stateless design |
| Data loss on Redis restart | The Redis extension handles only synchronization — no storage | Must use the Database extension in parallel |
| No CPU load distribution | All instances process all updates, so CPU is not distributed | Shard by document ID into separate instances (out of scope for this post) |
| Redlock clock dependency | Large clock drift between servers can cause lock malfunction | NTP synchronization is mandatory |
Sticky session fragility isn't much of a problem for small teams, but the moment you use HPA in Kubernetes, forced connection termination during scale-in becomes the first barrier you'll hit. Attempts to use Redis for CPU load distribution are also common, but the Redis extension is for message propagation, not load balancing. If that's the goal, look into document-ID-based sharding or a separate architecture like y-redis.
Sticky Session: A technique where a load balancer always routes a specific client's requests to the same server. Required when state is kept in server memory (as with WebSockets), but connections can drop during server failures or dynamic scaling.
Redis Pub/Sub vs Streams: If immediacy matters and message loss is acceptable, Pub/Sub is appropriate. If you need message reprocessing or consumer group management, Redis Streams is recommended.
The Most Common Mistakes in Practice
-
Mistaking the Redis extension for a persistence store: This is something I personally witnessed. The Redis extension only relays messages between instances — it does not store data. If Redis restarts, there is no way to recover the changes that occurred in between. The Database extension is not optional.
-
Using
onLoadDocumentand the Database extension'sfetchsimultaneously: Both paths read the document from the DB and applyY.applyUpdate. When an update is applied twice to the same document, the state becomes corrupted and leads to a bug that is extremely difficult to trace. You must use only one or the other. -
Running multiple instances without sticky sessions: Document updates are synchronized via Redis, but if clients for the same document are on different instances, awareness state (cursor positions, user colors, etc.) may not propagate correctly. Sticky sessions are required until a full stateless transition is made.
-
Managing update counters only in per-instance in-memory state: Because each instance has its own counter, one instance might take a snapshot at iteration 300 while another does it at 700. In the worst case, a scenario can occur where state accumulates indefinitely without any snapshots being taken. In high-load environments, it's safer to move snapshot interval management to a separate scheduler process.
Closing Thoughts
The core is simple: stick to the design principle of clearly separating Redis as a synchronization channel and the Database extension as a persistence layer. Follow this principle, and the architecture stays the same no matter how many instances you add.
Here are 3 steps you can start on right now.
-
Install with
pnpm add @hocuspocus/server @hocuspocus/extension-databaseand connect it to PostgreSQL to save documents. Open the same document in two browsers, type something, and if both sides reflect the changes, step 1 is done. Getting thedocumentstable schema solid at this stage makes later scaling much easier. -
Add
pnpm add @hocuspocus/extension-redisand usedocker-composeto runhocuspocus-1,hocuspocus-2, andrediscontainers together. Watch the logs of one container in the terminal — when you can confirm that messages from a client connected to a different instance are being delivered, step 2 is done. -
Configure sticky sessions with HAProxy or Nginx and test failure scenarios. Force-stop one instance with
docker stop, then if the client reconnects and the document content is preserved, step 3 is done. This process gives you a real feel for what role the Database extension actually plays.
References
- Hocuspocus Official Docs - Redis Extension | Tiptap
- Hocuspocus Official Docs - Scalability Guide | Tiptap
- Hocuspocus Official Docs - Database Extension | Tiptap
- GitHub - ueberdosis/hocuspocus
- GitHub - hocuspocus Redis.ts source
- y-redis README - Yjs Official
- y-redis Official Yjs Docs
- Yjs Community - y-redis alternative backend discussion
- Redis Official Docs - Distributed Locking
- HAProxy - WebSocket Sticky Session Configuration
- WebSocket Sticky Sessions vs Distributed State Architecture Comparison | Scale with Chintan
- Redis Pub/Sub vs Streams Comparison - Redis Official Blog
- PostgreSQL + Yjs CRDT Collaborative Editing - PowerSync
- GitHub - hackmdio/y-socketio-redis
- @hocuspocus/extension-redis - npm