Five Inference Optimization Techniques to Double or Quadruple LLM Serving Throughput on the Same GPU — From Quantization to Speculative Decoding
I still remember the shock of deploying a GPT-4-class model to production for the first time and watching it take over three seconds to return a single response. The model was clearly intelligent, but it was too slow to use in a real service — and the cost was a serious problem too. Today, throughput on the same GPU has multiplied several times over, and inference costs for GPT-4-class models have fallen roughly 50x (from ~$20/million tokens at GPT-4's launch to ~$0.40/million tokens as of 2025, per the DigitalOcean LLM Inference Trilemma).
Behind this shift lies a remarkable advancement in LLM inference optimization techniques — not simply using faster GPUs, but fundamentally changing how models generate tokens. This article covers five key techniques you can apply in production right now — quantization, PagedAttention, Speculative Decoding, and more — along with guidance on which combination to choose for your workload. By the end, you'll have the code and decision criteria you need to launch a vLLM server with optimized settings.
Whether you're self-hosting LLMs, trying to cut API costs, or simply curious about this space — let's dig into something quite practical.
Core Concepts
The Two Phases and Three Bottlenecks of LLM Inference
LLM token generation proceeds in two broad phases:
- Prefill: Processes the entire input prompt at once to produce the initial KV cache. The KV cache is the memory space where the model stores the computed results (Key and Value matrices) from previous tokens.
- Decoding: Generates tokens one by one in an autoregressive fashion.
Prefill can be parallelized and is relatively fast, but decoding is slow due to its structural seriality — each token can only be generated after the previous one is determined. Three major bottlenecks arise here:
| Bottleneck Type | Cause | Impact |
|---|---|---|
| Memory bandwidth | GPU must reload tens of GB of weights at every step | Throughput degradation |
| KV cache fragmentation | Fixed memory reserved per sequence for KV tensors | VRAM waste |
| Sequential decoding | Autoregressive structure generates one token at a time | Hard lower bound on latency |
The techniques at the heart of modern LLM serving optimization each attack one or more of these three bottlenecks.
The Inference Trilemma: It is impossible to simultaneously maximize Throughput, Latency, and Cost. The starting point for optimization is deciding which of the three to prioritize for your workload. The techniques below each target different vertices of this trilemma.
Here is a quick overview of which technique combinations suit each workload type:
| Workload Type | Trilemma Priority | Key Technique Combination |
|---|---|---|
| Real-time chatbot | Minimize latency | Quantization + Prefix Caching + Speculative Decoding |
| RAG pipeline | Throughput + cost reduction | RadixAttention + PagedAttention |
| Batch processing | Maximize throughput | Quantization + Continuous Batching + large batches |
| Edge / on-device | Cost (VRAM constraint) | GGUF quantization + llama.cpp |
Technique 1: Quantization — Cutting VRAM by Half or More
Quantization represents model weights at a lower bit width, simultaneously reducing VRAM usage and memory bandwidth pressure.
My initial reaction was "doesn't lower precision make the model dumber?" — but in practice it holds up better than you'd expect. That said, there are task-specific differences worth watching out for.
# vLLM >= 0.6.x
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct-AWQ",
quantization="awq",
dtype="auto",
gpu_memory_utilization=0.90,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["LLM 추론 최적화를 설명해주세요."], sampling_params)
print(outputs[0].outputs[0].text)Here is a comparison of the mainstream quantization approaches:
| Method | Bits | Speed Gain | Accuracy Loss | Recommended Use |
|---|---|---|---|---|
| FP8 | 8-bit | ~2x (vs BF16) | Negligible | Data centers, H100/H200 |
| AWQ INT4 | 4-bit | ~3x | Within 5–10% | Production serving |
| GPTQ INT4 | 4-bit | ~3x | Similar to AWQ | Powerful when combined with Marlin kernel |
| GGUF Q4_K_M | 4-bit | Environment-dependent | Balanced | Edge / on-device |
The Q number in GGUF refers to bits per weight. Q8_0 stores at 8 bits, Q4_K_M at 4 bits, Q2_K at 2 bits — lower numbers mean smaller files and faster speeds, but also lower accuracy.
The Rise of FP8: As of 2025, NVIDIA Hopper (H100/H200) GPUs support FP8 at the hardware level. Delivering 2x throughput over BF16 with negligible accuracy loss, FP8 is establishing itself as the default precision for data center serving.
Workloads where this technique shines: When you want to reduce both latency and cost, especially for edge deployment or VRAM-constrained environments. For math reasoning and code generation tasks, always run a benchmark before switching to INT4.
Technique 2: PagedAttention + Continuous Batching — Keeping the GPU Busy
One of the biggest sources of waste in traditional LLM serving systems was pre-reserving KV cache memory up to the maximum sequence length. If you reserved space for 1,024 tokens but only used 200, the remaining 824 tokens' worth of VRAM was simply wasted.
PagedAttention applies the virtual memory paging concept from operating systems to KV caches, dynamically allocating only as many blocks as are actually used. Combined with Continuous Batching, the moment one request finishes, the next is pushed into the batch — keeping the GPU nearly idle-free. (Runpod — vLLM PagedAttention Guide)
# vLLM >= 0.6.x
# PagedAttention and Continuous Batching are enabled by default
# --enable-prefix-caching: reuses KV cache for common system prompts
# --max-num-seqs: maximum number of sequences to process concurrently
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--max-num-seqs 256Comparing before and after vLLM, it is common to see a 2–4x increase in the number of concurrent requests a single GPU can handle.
Workloads where this technique shines: Services with many concurrent users (chatbots, API gateways). This is a baseline technique applied in almost every throughput-first environment.
Technique 3: Speculative Decoding — Draft Ahead, Verify at Once
Honestly, when I first encountered this technique I thought "does this even make sense?" The idea seems almost too simple.
A small draft model (or auxiliary head) speculatively predicts several tokens ahead. Then the large target model validates all of those guesses in a single forward pass. Accepted tokens are kept as-is; generation resumes from the first rejected token.
Key guarantee: Accepted tokens follow mathematically the same distribution as if the target model had generated them directly. In other words, output quality is unchanged — only speed increases. (NVIDIA Developer — Introduction to Speculative Decoding)
The most widely adopted approach today is EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), which leverages the target model's own features to predict the next token rather than requiring a separate draft model. EAGLE-3 delivers a 2.5–2.8x speedup at roughly 80% acceptance rate, and vLLM, SGLang, and TensorRT-LLM all include native support. (EAGLE-3 GitHub)
# vLLM >= 0.6.x
# speculative_model="[ngram]": generates drafts from N-gram patterns without a separate model
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_model="[ngram]",
num_speculative_tokens=5,
use_v2_block_manager=True,
)ngram vs. EAGLE selection guide: ngram requires no additional model and has zero memory overhead — you can enable it instantly. It works well for tasks with repetitive patterns (template-based output, code comments). EAGLE uses a separate draft head and requires some setup, but achieves much higher acceptance rates on general conversation and summarization tasks.
Workloads where this technique shines: Real-time chatbots where latency is the top priority. For tasks with highly unpredictable outputs (creative writing, code), acceptance rates tend to be low and can actually introduce overhead — use with caution.
Technique 4: Prefix Caching / RadixAttention — Caching Repeated Context
If you have ever operated a chatbot with a long system prompt or a RAG pipeline, you have felt the pain of re-prefilling the same document or instructions on every request.
I ran into this myself — I enabled prefix caching and got a much lower cache hit rate than expected. It turned out that a timestamp at the end of the system prompt was making each request subtly different. Even a one-byte difference in the prefix causes a cache miss. Clean up your prompt structure first.
Prefix Caching saves the KV cache for common prefix tokens and reuses it on subsequent requests. In vLLM it is enabled with a single --enable-prefix-caching flag.
SGLang's RadixAttention goes a step further, managing shared prefixes in a Radix Tree structure so that even partial matches can be reused. For RAG pipelines or agents where a system prompt plus document chunks repeat across requests, you can expect more than 2x throughput improvement. (Introl Blog — KV Cache Optimization)
Workloads where this technique shines: Chatbots, RAG, and agents where multiple requests share the same system prompt or document chunks. For batch processing where context differs completely across requests, the benefit is minimal.
Technique 5: MLA — How DeepSeek Cut KV Cache by 90%
Multi-Head Latent Attention (MLA) is an architectural innovation introduced in DeepSeek-V2. My first reaction reading the paper was "isn't this just KV cache compression?" — but serving it in practice feels different.
Practical summary: When serving MLA-based models like DeepSeek-V3 or Kimi K2, you can fit far longer contexts or handle far more concurrent requests on the same VRAM — thanks to up to a 90% reduction in KV cache size.
Architectural background: Whereas standard MHA (Multi-Head Attention) caches both Key and Value matrices, MLA represents both using a single compressed latent vector. Retrofitting MLA onto existing MHA models requires fine-tuning, so in practice it is a factor you consider when choosing new models.
| Method | Cache Size per Head | Characteristics |
|---|---|---|
| MHA (standard) | 2d (K + V) | General purpose; supported by all existing models |
| GQA | 2d / G (shared across G heads) | Adopted by Llama 3; reduces KV cache without additional training |
| MLA | ~d/2 or less | Adopted by DeepSeek-V3, Kimi K2; maximum compression |
FlashMLA is a FlashAttention kernel purpose-built for MLA, achieving 660 TFlops in BF16 on NVIDIA H800. Given that H800's theoretical BF16 peak is roughly 1,979 TFlops, this may sound like only ~33% efficiency — but given the memory-bound nature of the decoding phase where KV cache access is concentrated, it is a substantively strong number. (DeepSeek FlashMLA GitHub)
Workloads where this technique shines: Serving models that use MLA architecture, such as DeepSeek-V3 or Kimi K2. Direct application to existing MHA models is not feasible.
Practical Application
Example 1: Conversational Chatbot — Targeting TTFT Under 100ms
For real-time chatbots, the metric users feel most acutely is TTFT (Time-to-First-Token) — the time until the first token appears. Once TTFT exceeds 200ms, users start to perceive the system as slow. This is a latency-minimization scenario in the trilemma.
# vLLM >= 0.6.x
from vllm import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
engine_args = AsyncEngineArgs(
model="Qwen/Qwen2.5-14B-Instruct-AWQ",
quantization="awq",
enable_prefix_caching=True,
max_num_seqs=128,
gpu_memory_utilization=0.85,
speculative_model="[ngram]",
num_speculative_tokens=4,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)| Optimization | TTFT Contribution |
|---|---|
| AWQ INT4 quantization | ~3x memory bandwidth improvement |
| Prefix Caching | Eliminates system prompt recomputation; 50–90% TTFT reduction |
| Speculative Decoding | 2.5–2.8x decoding latency reduction |
Example 2: RAG Pipeline — Reusing Document KV to Boost Throughput
RAG (Retrieval-Augmented Generation) pipelines typically repeat the structure [system prompt + retrieved document chunks + user question] across requests. When document chunks are shared across multiple requests, KV cache reuse reaches its maximum effectiveness. This is a scenario targeting both throughput and cost reduction in the trilemma.
# SGLang — RadixAttention is enabled by default
# --chunked-prefill-size: splits long document prefills into multiple chunks for processing
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 30000 \
--mem-fraction-static 0.85 \
--chunked-prefill-size 512# vLLM >= 0.6.x
# Combining LMCache + vLLM for persistent KV caching in multi-turn QA
# pip install lmcache vllm
import lmcache.integration.vllm # noqa: F401
# This import patches vLLM's internal classes with LMCache-integrated versions.
# The KV cache persistence feature is activated purely as a side effect of this import, with no further code changes required.
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
kv_cache_dtype="auto",
enable_prefix_caching=True,
)
# Subsequent requests referencing the same document are served without recomputing KVCombining LMCache with vLLM has been reported to yield up to 15x throughput improvement for multi-turn QA and document analysis workloads.
Example 3: Edge / On-Device Deployment — Running Without a Server GPU
This is a cost (VRAM-constrained) scenario in the trilemma. If you need to run an LLM on a MacBook or small device without a server GPU, llama.cpp is the de facto standard.
# After installing Ollama
# Q4_K_M: Medium variant of K-quants 4-bit quantization — balanced speed and accuracy
ollama pull qwen2.5:7b-instruct-q4_K_M
# Or using llama.cpp directly
# --n-gpu-layers: set to maximum on Apple Silicon to leverage the Neural Engine
./llama-cli \
-m ./models/qwen2.5-7b-instruct-Q4_K_M.gguf \
-n 512 \
--n-gpu-layers 35 \
-p "LLM 추론 최적화를 설명해줘"| GGUF Quantization Level | Model Size (7B) | Accuracy | Recommended Scenario |
|---|---|---|---|
| Q8_0 (8-bit) | ~7.7 GB | Nearly identical to BF16 | When VRAM is plentiful |
| Q4_K_M (4-bit) | ~4.4 GB | Balanced | General edge deployment |
| Q2_K (2-bit) | ~2.7 GB | Noticeable degradation | Extreme memory constraints |
Example 4: Batch Processing — Maximizing Throughput Configuration
For large-scale batch jobs where real-time responsiveness is not required, choose maximum throughput in the trilemma. Tokens processed per hour matters more than latency.
# vLLM >= 0.6.x
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen2.5-72B-Instruct-AWQ",
quantization="awq",
tensor_parallel_size=4,
gpu_memory_utilization=0.95,
max_num_seqs=512,
)
# Passing thousands of prompts at once lets the engine build optimal internal batches
prompts = [f"문서 {i}를 요약해주세요: ..." for i in range(5000)]
outputs = llm.generate(prompts, SamplingParams(max_tokens=256))Intelligent Model Routing: Routing simple tasks to smaller models (7B or below) and only complex tasks to larger models can save an additional 30–60% in cost. Implementable with LiteLLM or a custom router.
Pros and Cons Analysis
Advantages
| Item | Detail |
|---|---|
| Cost reduction | GPT-4-class inference costs have fallen ~50x since 2022; optimization can yield additional savings |
| Responsiveness | Speculative Decoding + Prefix Caching combination dramatically reduces TTFT |
| Scalability | 2–4x more concurrent requests on the same GPU |
| Edge viability | INT4 quantization enables serving 70B models on small servers |
| Quality preservation | Minimal quality degradation for most tasks with FP8 and AWQ |
Drawbacks and Caveats
| Item | Detail | Mitigation |
|---|---|---|
| Quantization accuracy | INT4 can cause >10% degradation on math reasoning and code generation | Use FP8 or AWQ INT4; always benchmark |
| Speculative Decoding variance | Low acceptance rates on unpredictable tasks like creative writing or code | Monitor per-task acceptance rate before committing |
| Memory overhead | Loading a draft model increases VRAM usage | Consider draft-free approaches like ngram or EAGLE |
| Hardware dependency | TensorRT-LLM and FP8 optimizations are NVIDIA-only | Use llama.cpp / ROCm for AMD / Apple Silicon |
| MLA adoption difficulty | Applying MLA to existing MHA models requires fine-tuning | Factor in MLA architecture when selecting new models |
GQA (Grouped-Query Attention): A middle ground between MHA and MLA, where multiple query heads share a single Key-Value head. Adopted by the Llama 3 series, it is a practical option for reducing KV cache size within an existing architecture without additional training.
The Most Common Mistakes in Practice
- Choosing a technique before analyzing the workload: If you have not first clarified whether you are latency-first or throughput-first, even the best technique can work against you. Applying batch optimizations to a chatbot can actually increase TTFT.
- Skipping the benchmark after quantization: "INT4 is usually fine" is broadly true, but I have personally seen larger-than-expected degradation on certain instruction-following tasks. Math reasoning and code generation in particular deserve a dedicated check.
- Not monitoring Speculative Decoding acceptance rate: I once skipped the acceptance rate check and only discovered much later that Speculative Decoding was actually slowing things down. When acceptance rate is low, the draft generation overhead becomes pure cost. Periodically check
spec_decode_acceptance_ratefrom vLLM's/metricsendpoint.
Closing Thoughts
The key to LLM serving optimization is not "one magic technique" but finding the right combination of techniques for your workload's characteristics. And to find that combination, you need to see the numbers first. Optimizing without monitoring is like driving long-distance without a fuel gauge. The first step is not spinning up a server — it is measuring what is slow right now.
Three steps you can take immediately:
- Install vLLM and start serving with an AWQ model. Run
pip install vllmand thenpython -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-AWQ --quantization awq --enable-prefix-caching— that single line brings up a server with PagedAttention, Continuous Batching, and Prefix Caching all enabled. - Determine whether your workload is latency-first or throughput-first, then reference the configuration from whichever of Examples 1–4 most closely matches your case. You can add
--enable-chunked-prefill(for RAG or long context) or Speculative Decoding (--speculative-model "[ngram]") one at a time and observe how the metrics change. - Start by running
curl http://localhost:8000/metricsto see the numbers vLLM exposes. GPU utilization, batch size, and Speculative Decoding acceptance rate all stream out as plain text. Once those numbers are visible, the direction for your next optimization becomes self-evident. When you have bandwidth, connect Prometheus + Grafana to build a dashboard.
References
Quantization
- LLM Inference Optimization Official Docs | HuggingFace
- LLM Model Quantization Optimization for AWS Inferentia | AWS Tech Blog
PagedAttention / Continuous Batching
- vLLM: PagedAttention, Continuous Batching Guide | Runpod
- LLM Inference Performance Engineering Best Practices | Databricks
Speculative Decoding
- Speculative Decoding for AI Inference Latency | NVIDIA Developer
- Speculative Decoding 2025 Guide | Introl Blog
- Applying Speculative Decoding to HyperCLOVA X | CLOVA Tech Blog
Prefix Caching / KV Cache Optimization
MLA / Attention Optimization
General / Trade-offs