When to Switch from Ollama to vLLM? — LLM Serving Decision Criteria Based on Concurrent Users
I thought Ollama could handle everything at first, too. You spin up a model with ollama run llama3, hook up the OpenAI-compatible API, build an internal team chatbot, and think "hey, this actually works." But once users start growing, the story changes. The moment 10 people connect simultaneously, the response queue turns into a terrifying 45-second wait. You start wondering "is my model just slow?" — only to realize later that it's not the model, it's the architecture. I was one of those people who figured it out the hard way.
This post answers the question "when should you switch from Ollama to vLLM?" using a single criterion: concurrent user count. It covers why the two tools differ internally, at what point switching becomes meaningful, and how to actually migrate. If you're currently on Ollama and starting to feel the slowdown, this post will help you make the decision.
Core Concepts
Why Ollama Struggles with Concurrent Requests
Ollama is a single-binary server built on llama.cpp. Installation takes about 2 minutes, it supports Apple Silicon's Metal backend and CPU inference, and handles everything from model download to format conversion to serving in one tool. In terms of developer experience, it honestly has no competition.
GGUF: A name combining the initials of llama.cpp developer Georgi Gerganov (GG) and Universal Format (UF). It's a quantization-friendly model format designed to run efficiently on CPU and Apple Silicon environments, used by both Ollama and llama.cpp.
The problem is how it handles requests. Ollama's default is sequential processing. You can increase parallel request count with the OLLAMA_NUM_PARALLEL environment variable, and load multiple models simultaneously with OLLAMA_MAX_LOADED_MODELS. But even with the parallel option enabled, the architecture allocates KV cache memory separately for each request — so as concurrent users grow, memory pressure rises sharply and throughput limits become clearly apparent.
KV Cache (Key-Value Cache): A space where a Transformer model stores the attention information of previously processed tokens in memory. It speeds up inference by avoiding recalculation from scratch each time, but when there are many concurrent requests, this cache occupies large amounts of memory.
How vLLM Handles Concurrent Requests
vLLM is a production inference engine that originated at UC Berkeley. Two core technologies create its performance gap over Ollama.
PagedAttention manages the KV cache as non-contiguous pages, similar to OS virtual memory. The traditional approach reserves memory up to the maximum sequence length per request, wasting a lot of memory that never actually gets used. PagedAttention allocates pages only as needed, eliminating memory fragmentation and enabling more requests to be processed simultaneously.
Continuous Batching doesn't wait for an entire batch to finish. It inserts new requests into the GPU pipeline immediately at each token generation iteration.
Forward Pass: A single processing pass where input tokens travel through all layers of the model to compute the probability of the next token. Continuous Batching inserts waiting requests at each forward pass, minimizing GPU idle time.
The net result: with 20 concurrent requests, Ollama's default queues 19 of them while vLLM processes all 20 in the same forward pass in parallel.
# Example: running a vLLM server (single GPU setup)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 4096 \
--tensor-parallel-size 1 # change to the number of GPUs if using multipleContinuous Batching: Unlike traditional static batching, which waits until all requests in a batch finish, continuous batching dynamically inserts new requests into an in-progress batch. This dramatically reduces GPU idle time.
The Key Metric for Switching: Concurrent User Count
In terms of peak throughput, vLLM reaches 793 tok/s versus Ollama's 41 tok/s — roughly a 19x difference. But that number isn't always meaningful. If you can only choose one criterion, it's concurrent user count.
| Concurrent Users | Ollama P99 Latency | vLLM P99 Latency | Recommendation |
|---|---|---|---|
| 1–3 | ~2s | ~80ms | Stay on Ollama |
| 4–10 | 2–45s | ~80ms | Consider switching to vLLM |
| 50+ | 24.7s+ | ~80ms | Switch to vLLM required |
| 128+ | Effectively unusable | ~80ms | vLLM only |
For an internal tool used by 1–3 people, vLLM's throughput advantage is essentially irrelevant. There's no reason to absorb the added installation complexity for a switch like that.
Pros and Cons
To be honest, vLLM is dominant on performance, but there are clear reasons to keep using Ollama.
Advantages
Ollama
| Item | Detail |
|---|---|
| Installation ease | Single binary, 2-minute install |
| Platform support | Apple Silicon and CPU inference both supported |
| Model management | Unified management with ollama pull/push |
| Cold start | 3.2s (faster than vLLM's 8.7s) |
| Learning curve | Start immediately without MLOps knowledge |
vLLM
| Item | Detail |
|---|---|
| Throughput | Peak 793 tok/s (19x vs. Ollama's 41 tok/s) |
| Latency | P99 80ms (at 128 concurrent connections) |
| Cost efficiency | 10–50x cost reduction per request at scale |
| Features | Speculative Decoding, LoRA, multimodal support |
| Latest updates | Disaggregated Prefill/Decode, Model Runner V2 |
Speculative Decoding: A technique that reduces latency by having a small draft model predict multiple tokens ahead, while a larger target model verifies them all at once. vLLM supports EAGLE, DFlash, n-gram, and suffix methods.
Drawbacks and Caveats
The most common mistake in practice is actually the second one below — not checking the model format beforehand. Plenty of people get stuck trying to bring over Ollama's GGUF files directly.
| Item | Detail | Mitigation |
|---|---|---|
| Ollama: sequential by default | Queuing under concurrent requests | Consider vLLM switch at 4+ users |
| Ollama: GGUF only | No AWQ/GPTQ support | Use vLLM if those formats are needed |
| vLLM: no Apple Silicon support | No Metal/MPS support | Use Ollama + MLX on Mac |
| vLLM: configuration complexity | Tensor parallelism, quantization format conversion required | Abstract with LiteLLM proxy |
| vLLM: no GGUF support | Cannot use Ollama models directly | Use AWQ models from HuggingFace Hub |
| vLLM: cold start | 8.7s (slower than Ollama's 3.2s) | Use container warm-up scripts |
AWQ (Activation-aware Weight Quantization): A method that protects important channels by considering activation distribution when quantizing weights to 4-bit. With AutoAWQ now deprecated,
llm-compressor(v0.10.0.1) is the current successor tool.
Most Common Mistakes in Practice
-
Attempting migration without checking the model format: Many people hit a wall trying to use Ollama's GGUF models directly with vLLM. Since vLLM does not support GGUF, it's recommended to find the same model on HuggingFace Hub in AWQ or original format.
-
Leaving
--max-model-lenat its default: vLLM by default tries to use the full maximum context length supported by the model. If GPU memory is insufficient, it will either OOM or fail to start the server entirely. It's best to constrain this to a value that fits your actual usage patterns. -
Forcing vLLM adoption without the MLOps capacity: Operating vLLM requires infrastructure knowledge — tensor parallelism configuration, quantization format conversion, monitoring setup. Many teams start the migration without sufficient capability and end up abandoning it. In this case, introducing a proxy layer like LiteLLM first, or evaluating a managed inference service, is a more realistic choice.
Practical Application
Example 1: E-commerce Customer Support Chatbot — Ollama to vLLM
This is a situation you frequently encounter in practice — things work fine at first, then collapse as traffic grows. You experience response times shooting from 2 seconds to 45+ seconds at peak concurrent access of around 10 users. The moment you think "yes, this is exactly my situation" is the right time to consider switching.
The good news is that the migration itself is not complex. Both Ollama and vLLM provide OpenAI-compatible APIs, so you only need to change two lines: base_url and model.
# Before: Ollama
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "안녕하세요"}]
)# After: vLLM — only two lines change
from openai import OpenAI
client = OpenAI(
base_url="http://your-vllm-server:8000/v1", # changed
api_key="token-abc123",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct", # changed
messages=[{"role": "user", "content": "안녕하세요"}]
)| Changed Item | Ollama | vLLM |
|---|---|---|
base_url |
http://localhost:11434/v1 |
http://{serverIP}:8000/v1 |
model |
GGUF model name (llama3.1:8b) |
HuggingFace model name |
api_key |
Any string | Configured token or any string |
One thing to verify beforehand is the model format. Ollama uses GGUF, while vLLM does not support GGUF and instead uses AWQ or GPTQ quantized models, or original HuggingFace models. AWQ-quantized versions are often already available on HuggingFace Hub, so you can use them directly without converting yourself.
Example 2: Small Team Copilot — Keep Ollama or Use a Dual Setup
For a 5-person dev team running an internal code completion tool, peak concurrent users will be around 5 at most. In this case, there's no real reason to switch away from Ollama.
However, if there's any chance of scaling to production later, it's worth planning for an Ollama (dev) + vLLM (production) dual setup from the start. Using LiteLLM as a proxy lets you swap the backend without touching application code.
# docker-compose.yml — unified integration via LiteLLM proxy
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
command: ["--config", "/app/config.yaml"]The key is that app code always uses a single model name — "llama3". Switching backends is just a matter of swapping the mounted config file.
# litellm_config_dev.yaml — development environment (Ollama backend)
model_list:
- model_name: llama3 # app code only ever uses this name
litellm_params:
model: ollama/llama3.1:8b
api_base: http://localhost:11434# litellm_config_prod.yaml — production environment (vLLM backend, no app code changes)
model_list:
- model_name: llama3 # same name, backend swapped
litellm_params:
model: openai/meta-llama/Llama-3.1-8B-Instruct
api_base: http://vllm-server:8000
api_key: "token-abc123"Example 3: Apple Silicon Environment — MLX Instead of vLLM
If you're on an Apple Silicon Mac, vLLM is off the table. vLLM supports only CUDA and ROCm — it does not support Metal/MPS.
In March 2026, Ollama announced a Preview adopting MLX as its inference engine on Apple Silicon, replacing the existing llama.cpp Metal backend. On an M4 Pro (64GB), MLX achieved approximately 130 tok/s versus llama.cpp's 43 tok/s on Qwen3-Coder-30B-A3B — roughly a 3x difference.
# Enable Ollama MLX preview
OLLAMA_USE_MLX=1 ollama serveNote that this feature is currently in Preview. It is disabled by default in stable releases, and the list of supported models is limited. If enabling it produces no change, the model likely doesn't support the MLX backend yet. It's worth checking the current supported model list in Ollama's official release notes before trying it. For Mac environments, the Ollama + MLX combination is the most realistic best option, and for most use cases it's more than sufficient.
Closing Thoughts
If your concurrent user count is exceeding 4–5 and you have an NVIDIA GPU, it's time to consider switching to vLLM. Until that point, Ollama is the better choice.
Here are some first steps you can take right now.
-
Measure your peak concurrent request count. Check application logs or use a simple load testing tool (k6, locust, etc.). If you don't have tooling yet, the command below will quickly characterize your server's response behavior.
bash# wrk load test: 10 concurrent users for 30 seconds wrk -t4 -c10 -d30s --timeout 60s http://localhost:11434/api/tagsIf you're approaching the threshold (~4 users), now is the time to start preparing for a switch.
-
Check your hardware first. Run
nvidia-smito see if you have an NVIDIA GPU. If you're on Apple Silicon, trying Ollama's MLX preview (OLLAMA_USE_MLX=1) is also a solid option. -
You can test the switch without changing your code. Spin up a vLLM server locally with Docker, change just the two parameters —
base_urlandmodel— in your existing code, and verify it works. Thanks to the OpenAI-compatible API, everything else stays the same.
References
- Ollama vs. vLLM: A deep dive into performance benchmarking | Red Hat Developer
- vLLM vs. Ollama: When to use each framework | Red Hat
- Performance vs Practicality: A Comparison of vLLM and Ollama | Medium
- From Local to Production: The Ultimate Ollama to vLLM Migration Guide | Towards AI
- Moving from Ollama to vLLM: Finding Stability for High-Throughput LLM Serving | Towards AI
- vLLM vs Ollama at 1/10/50/100 Users | GIGAGPU
- Ollama vs vLLM: Performance Benchmark 2026 | SitePoint
- Performance improvements with speculative decoding in vLLM for gpt-oss | Red Hat Developer
- Speculative Decoding | vLLM Official Docs
- LLM Quantization Explained: GGUF vs GPTQ vs AWQ (2026 Guide) | TensorRigs
- vLLM vs Ollama: Production Serving vs Local Inference | DeployBase