Five Inference Optimization Techniques to Double or Quadruple LLM Serving Throughput on the Same GPU — From Quantization to Speculative Decoding

I still remember the shock of deploying a GPT-4-class model to production for the first time and watching it take over three seconds to return a single response. The model was clearly intelligent, but it was too slow to use in a real service — and the cost was a serious problem too. Today, throughput on the same GPU has multiplied several times over, and inference costs for GPT-4-class models have fallen roughly 50x (from ~$20/million tokens at GPT-4's launch to ~$0.40/million tokens as of 2025, per the DigitalOcean LLM Inference Trilemma).

Behind this shift lies a remarkable advancement in LLM inference optimization techniques — not simply using faster GPUs, but fundamentally changing how models generate tokens. This article covers five key techniques you can apply in production right now — quantization, PagedAttention, Speculative Decoding, and more — along with guidance on which combination to choose for your workload. By the end, you'll have the code and decision criteria you need to launch a vLLM server with optimized settings.

Whether you're self-hosting LLMs, trying to cut API costs, or simply curious about this space — let's dig into something quite practical.

Core Concepts

The Two Phases and Three Bottlenecks of LLM Inference

LLM token generation proceeds in two broad phases:

Prefill: Processes the entire input prompt at once to produce the initial KV cache. The KV cache is the memory space where the model stores the computed results (Key and Value matrices) from previous tokens.
Decoding: Generates tokens one by one in an autoregressive fashion.

Prefill can be parallelized and is relatively fast, but decoding is slow due to its structural seriality — each token can only be generated after the previous one is determined. Three major bottlenecks arise here:

Bottleneck Type	Cause	Impact
Memory bandwidth	GPU must reload tens of GB of weights at every step	Throughput degradation
KV cache fragmentation	Fixed memory reserved per sequence for KV tensors	VRAM waste
Sequential decoding	Autoregressive structure generates one token at a time	Hard lower bound on latency

The techniques at the heart of modern LLM serving optimization each attack one or more of these three bottlenecks.

The Inference Trilemma: It is impossible to simultaneously maximize Throughput, Latency, and Cost. The starting point for optimization is deciding which of the three to prioritize for your workload. The techniques below each target different vertices of this trilemma.

Here is a quick overview of which technique combinations suit each workload type:

Workload Type	Trilemma Priority	Key Technique Combination
Real-time chatbot	Minimize latency	Quantization + Prefix Caching + Speculative Decoding
RAG pipeline	Throughput + cost reduction	RadixAttention + PagedAttention
Batch processing	Maximize throughput	Quantization + Continuous Batching + large batches
Edge / on-device	Cost (VRAM constraint)	GGUF quantization + llama.cpp

Technique 1: Quantization — Cutting VRAM by Half or More

Quantization represents model weights at a lower bit width, simultaneously reducing VRAM usage and memory bandwidth pressure.

My initial reaction was "doesn't lower precision make the model dumber?" — but in practice it holds up better than you'd expect. That said, there are task-specific differences worth watching out for.

python

# vLLM >= 0.6.x
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    quantization="awq",
    dtype="auto",
    gpu_memory_utilization=0.90,
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["LLM 추론 최적화를 설명해주세요."], sampling_params)
print(outputs[0].outputs[0].text)

Here is a comparison of the mainstream quantization approaches:

Method	Bits	Speed Gain	Accuracy Loss	Recommended Use
FP8	8-bit	~2x (vs BF16)	Negligible	Data centers, H100/H200
AWQ INT4	4-bit	~3x	Within 5–10%	Production serving
GPTQ INT4	4-bit	~3x	Similar to AWQ	Powerful when combined with Marlin kernel
GGUF Q4_K_M	4-bit	Environment-dependent	Balanced	Edge / on-device

The Q number in GGUF refers to bits per weight. Q8_0 stores at 8 bits, Q4_K_M at 4 bits, Q2_K at 2 bits — lower numbers mean smaller files and faster speeds, but also lower accuracy.

The Rise of FP8: As of 2025, NVIDIA Hopper (H100/H200) GPUs support FP8 at the hardware level. Delivering 2x throughput over BF16 with negligible accuracy loss, FP8 is establishing itself as the default precision for data center serving.

Workloads where this technique shines: When you want to reduce both latency and cost, especially for edge deployment or VRAM-constrained environments. For math reasoning and code generation tasks, always run a benchmark before switching to INT4.

Technique 2: PagedAttention + Continuous Batching — Keeping the GPU Busy

One of the biggest sources of waste in traditional LLM serving systems was pre-reserving KV cache memory up to the maximum sequence length. If you reserved space for 1,024 tokens but only used 200, the remaining 824 tokens' worth of VRAM was simply wasted.

PagedAttention applies the virtual memory paging concept from operating systems to KV caches, dynamically allocating only as many blocks as are actually used. Combined with Continuous Batching, the moment one request finishes, the next is pushed into the batch — keeping the GPU nearly idle-free. (Runpod — vLLM PagedAttention Guide)

bash

# vLLM >= 0.6.x
# PagedAttention and Continuous Batching are enabled by default
# --enable-prefix-caching: reuses KV cache for common system prompts
# --max-num-seqs: maximum number of sequences to process concurrently
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --max-num-seqs 256

Comparing before and after vLLM, it is common to see a 2–4x increase in the number of concurrent requests a single GPU can handle.

Workloads where this technique shines: Services with many concurrent users (chatbots, API gateways). This is a baseline technique applied in almost every throughput-first environment.

Technique 3: Speculative Decoding — Draft Ahead, Verify at Once

Honestly, when I first encountered this technique I thought "does this even make sense?" The idea seems almost too simple.

A small draft model (or auxiliary head) speculatively predicts several tokens ahead. Then the large target model validates all of those guesses in a single forward pass. Accepted tokens are kept as-is; generation resumes from the first rejected token.

Key guarantee: Accepted tokens follow mathematically the same distribution as if the target model had generated them directly. In other words, output quality is unchanged — only speed increases. (NVIDIA Developer — Introduction to Speculative Decoding)

The most widely adopted approach today is EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), which leverages the target model's own features to predict the next token rather than requiring a separate draft model. EAGLE-3 delivers a 2.5–2.8x speedup at roughly 80% acceptance rate, and vLLM, SGLang, and TensorRT-LLM all include native support. (EAGLE-3 GitHub)

python

# vLLM >= 0.6.x
# speculative_model="[ngram]": generates drafts from N-gram patterns without a separate model
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_model="[ngram]",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)

ngram vs. EAGLE selection guide: ngram requires no additional model and has zero memory overhead — you can enable it instantly. It works well for tasks with repetitive patterns (template-based output, code comments). EAGLE uses a separate draft head and requires some setup, but achieves much higher acceptance rates on general conversation and summarization tasks.

Workloads where this technique shines: Real-time chatbots where latency is the top priority. For tasks with highly unpredictable outputs (creative writing, code), acceptance rates tend to be low and can actually introduce overhead — use with caution.

Technique 4: Prefix Caching / RadixAttention — Caching Repeated Context

If you have ever operated a chatbot with a long system prompt or a RAG pipeline, you have felt the pain of re-prefilling the same document or instructions on every request.

I ran into this myself — I enabled prefix caching and got a much lower cache hit rate than expected. It turned out that a timestamp at the end of the system prompt was making each request subtly different. Even a one-byte difference in the prefix causes a cache miss. Clean up your prompt structure first.

Prefix Caching saves the KV cache for common prefix tokens and reuses it on subsequent requests. In vLLM it is enabled with a single --enable-prefix-caching flag.

SGLang's RadixAttention goes a step further, managing shared prefixes in a Radix Tree structure so that even partial matches can be reused. For RAG pipelines or agents where a system prompt plus document chunks repeat across requests, you can expect more than 2x throughput improvement. (Introl Blog — KV Cache Optimization)

Workloads where this technique shines: Chatbots, RAG, and agents where multiple requests share the same system prompt or document chunks. For batch processing where context differs completely across requests, the benefit is minimal.

Technique 5: MLA — How DeepSeek Cut KV Cache by 90%

Multi-Head Latent Attention (MLA) is an architectural innovation introduced in DeepSeek-V2. My first reaction reading the paper was "isn't this just KV cache compression?" — but serving it in practice feels different.

Practical summary: When serving MLA-based models like DeepSeek-V3 or Kimi K2, you can fit far longer contexts or handle far more concurrent requests on the same VRAM — thanks to up to a 90% reduction in KV cache size.

Architectural background: Whereas standard MHA (Multi-Head Attention) caches both Key and Value matrices, MLA represents both using a single compressed latent vector. Retrofitting MLA onto existing MHA models requires fine-tuning, so in practice it is a factor you consider when choosing new models.

Method	Cache Size per Head	Characteristics
MHA (standard)	2d (K + V)	General purpose; supported by all existing models
GQA	2d / G (shared across G heads)	Adopted by Llama 3; reduces KV cache without additional training
MLA	~d/2 or less	Adopted by DeepSeek-V3, Kimi K2; maximum compression

FlashMLA is a FlashAttention kernel purpose-built for MLA, achieving 660 TFlops in BF16 on NVIDIA H800. Given that H800's theoretical BF16 peak is roughly 1,979 TFlops, this may sound like only ~33% efficiency — but given the memory-bound nature of the decoding phase where KV cache access is concentrated, it is a substantively strong number. (DeepSeek FlashMLA GitHub)

Workloads where this technique shines: Serving models that use MLA architecture, such as DeepSeek-V3 or Kimi K2. Direct application to existing MHA models is not feasible.

Practical Application

Example 1: Conversational Chatbot — Targeting TTFT Under 100ms

For real-time chatbots, the metric users feel most acutely is TTFT (Time-to-First-Token) — the time until the first token appears. Once TTFT exceeds 200ms, users start to perceive the system as slow. This is a latency-minimization scenario in the trilemma.

python

# vLLM >= 0.6.x
from vllm import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
 
engine_args = AsyncEngineArgs(
    model="Qwen/Qwen2.5-14B-Instruct-AWQ",
    quantization="awq",
    enable_prefix_caching=True,
    max_num_seqs=128,
    gpu_memory_utilization=0.85,
    speculative_model="[ngram]",
    num_speculative_tokens=4,
)
 
engine = AsyncLLMEngine.from_engine_args(engine_args)

Optimization	TTFT Contribution
AWQ INT4 quantization	~3x memory bandwidth improvement
Prefix Caching	Eliminates system prompt recomputation; 50–90% TTFT reduction
Speculative Decoding	2.5–2.8x decoding latency reduction

Example 2: RAG Pipeline — Reusing Document KV to Boost Throughput

RAG (Retrieval-Augmented Generation) pipelines typically repeat the structure [system prompt + retrieved document chunks + user question] across requests. When document chunks are shared across multiple requests, KV cache reuse reaches its maximum effectiveness. This is a scenario targeting both throughput and cost reduction in the trilemma.

bash

# SGLang — RadixAttention is enabled by default
# --chunked-prefill-size: splits long document prefills into multiple chunks for processing
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 512

python

# vLLM >= 0.6.x
# Combining LMCache + vLLM for persistent KV caching in multi-turn QA
# pip install lmcache vllm
 
import lmcache.integration.vllm  # noqa: F401
# This import patches vLLM's internal classes with LMCache-integrated versions.
# The KV cache persistence feature is activated purely as a side effect of this import, with no further code changes required.
from vllm import LLM
 
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    kv_cache_dtype="auto",
    enable_prefix_caching=True,
)
# Subsequent requests referencing the same document are served without recomputing KV

Combining LMCache with vLLM has been reported to yield up to 15x throughput improvement for multi-turn QA and document analysis workloads.

Example 3: Edge / On-Device Deployment — Running Without a Server GPU

This is a cost (VRAM-constrained) scenario in the trilemma. If you need to run an LLM on a MacBook or small device without a server GPU, llama.cpp is the de facto standard.

bash

# After installing Ollama
# Q4_K_M: Medium variant of K-quants 4-bit quantization — balanced speed and accuracy
ollama pull qwen2.5:7b-instruct-q4_K_M
 
# Or using llama.cpp directly
# --n-gpu-layers: set to maximum on Apple Silicon to leverage the Neural Engine
./llama-cli \
  -m ./models/qwen2.5-7b-instruct-Q4_K_M.gguf \
  -n 512 \
  --n-gpu-layers 35 \
  -p "LLM 추론 최적화를 설명해줘"

GGUF Quantization Level	Model Size (7B)	Accuracy	Recommended Scenario
Q8_0 (8-bit)	~7.7 GB	Nearly identical to BF16	When VRAM is plentiful
Q4_K_M (4-bit)	~4.4 GB	Balanced	General edge deployment
Q2_K (2-bit)	~2.7 GB	Noticeable degradation	Extreme memory constraints

Example 4: Batch Processing — Maximizing Throughput Configuration

For large-scale batch jobs where real-time responsiveness is not required, choose maximum throughput in the trilemma. Tokens processed per hour matters more than latency.

python

# vLLM >= 0.6.x
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct-AWQ",
    quantization="awq",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.95,
    max_num_seqs=512,
)
 
# Passing thousands of prompts at once lets the engine build optimal internal batches
prompts = [f"문서 {i}를 요약해주세요: ..." for i in range(5000)]
outputs = llm.generate(prompts, SamplingParams(max_tokens=256))

Intelligent Model Routing: Routing simple tasks to smaller models (7B or below) and only complex tasks to larger models can save an additional 30–60% in cost. Implementable with LiteLLM or a custom router.

Pros and Cons Analysis

Advantages

Item	Detail
Cost reduction	GPT-4-class inference costs have fallen ~50x since 2022; optimization can yield additional savings
Responsiveness	Speculative Decoding + Prefix Caching combination dramatically reduces TTFT
Scalability	2–4x more concurrent requests on the same GPU
Edge viability	INT4 quantization enables serving 70B models on small servers
Quality preservation	Minimal quality degradation for most tasks with FP8 and AWQ

Drawbacks and Caveats

Item	Detail	Mitigation
Quantization accuracy	INT4 can cause >10% degradation on math reasoning and code generation	Use FP8 or AWQ INT4; always benchmark
Speculative Decoding variance	Low acceptance rates on unpredictable tasks like creative writing or code	Monitor per-task acceptance rate before committing
Memory overhead	Loading a draft model increases VRAM usage	Consider draft-free approaches like ngram or EAGLE
Hardware dependency	TensorRT-LLM and FP8 optimizations are NVIDIA-only	Use llama.cpp / ROCm for AMD / Apple Silicon
MLA adoption difficulty	Applying MLA to existing MHA models requires fine-tuning	Factor in MLA architecture when selecting new models

GQA (Grouped-Query Attention): A middle ground between MHA and MLA, where multiple query heads share a single Key-Value head. Adopted by the Llama 3 series, it is a practical option for reducing KV cache size within an existing architecture without additional training.

The Most Common Mistakes in Practice

Choosing a technique before analyzing the workload: If you have not first clarified whether you are latency-first or throughput-first, even the best technique can work against you. Applying batch optimizations to a chatbot can actually increase TTFT.
Skipping the benchmark after quantization: "INT4 is usually fine" is broadly true, but I have personally seen larger-than-expected degradation on certain instruction-following tasks. Math reasoning and code generation in particular deserve a dedicated check.
Not monitoring Speculative Decoding acceptance rate: I once skipped the acceptance rate check and only discovered much later that Speculative Decoding was actually slowing things down. When acceptance rate is low, the draft generation overhead becomes pure cost. Periodically check spec_decode_acceptance_rate from vLLM's /metrics endpoint.

Closing Thoughts

The key to LLM serving optimization is not "one magic technique" but finding the right combination of techniques for your workload's characteristics. And to find that combination, you need to see the numbers first. Optimizing without monitoring is like driving long-distance without a fuel gauge. The first step is not spinning up a server — it is measuring what is slow right now.

Three steps you can take immediately:

Install vLLM and start serving with an AWQ model. Run pip install vllm and then python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-AWQ --quantization awq --enable-prefix-caching — that single line brings up a server with PagedAttention, Continuous Batching, and Prefix Caching all enabled.
Determine whether your workload is latency-first or throughput-first, then reference the configuration from whichever of Examples 1–4 most closely matches your case. You can add --enable-chunked-prefill (for RAG or long context) or Speculative Decoding (--speculative-model "[ngram]") one at a time and observe how the metrics change.
Start by running curl http://localhost:8000/metrics to see the numbers vLLM exposes. GPU utilization, batch size, and Speculative Decoding acceptance rate all stream out as plain text. Once those numbers are visible, the direction for your next optimization becomes self-evident. When you have bandwidth, connect Prometheus + Grafana to build a dashboard.

References

Quantization

PagedAttention / Continuous Batching

Speculative Decoding

Prefix Caching / KV Cache Optimization

KV Cache Optimization: Production LLMs | Introl Blog

MLA / Attention Optimization

General / Trade-offs

같은 GPU에서 LLM 서빙 처리량을 2~4배 향상시키는 다섯 가지 추론 최적화 기법 — 양자화부터 Speculative Decoding까지 | DEV BAK - 기술블로그

Five Inference Optimization Techniques to Double or Quadruple LLM Serving Throughput on the Same GPU — From Quantization to Speculative Decoding

Whether you're self-hosting LLMs, trying to cut API costs, or simply curious about this space — let's dig into something quite practical.

Core Concepts

The Two Phases and Three Bottlenecks of LLM Inference

LLM token generation proceeds in two broad phases:

Prefill: Processes the entire input prompt at once to produce the initial KV cache. The KV cache is the memory space where the model stores the computed results (Key and Value matrices) from previous tokens.
Decoding: Generates tokens one by one in an autoregressive fashion.

Bottleneck Type	Cause	Impact
Memory bandwidth	GPU must reload tens of GB of weights at every step	Throughput degradation
KV cache fragmentation	Fixed memory reserved per sequence for KV tensors	VRAM waste
Sequential decoding	Autoregressive structure generates one token at a time	Hard lower bound on latency

The techniques at the heart of modern LLM serving optimization each attack one or more of these three bottlenecks.

The Inference Trilemma: It is impossible to simultaneously maximize Throughput, Latency, and Cost. The starting point for optimization is deciding which of the three to prioritize for your workload. The techniques below each target different vertices of this trilemma.

Here is a quick overview of which technique combinations suit each workload type:

Workload Type	Trilemma Priority	Key Technique Combination
Real-time chatbot	Minimize latency	Quantization + Prefix Caching + Speculative Decoding
RAG pipeline	Throughput + cost reduction	RadixAttention + PagedAttention
Batch processing	Maximize throughput	Quantization + Continuous Batching + large batches
Edge / on-device	Cost (VRAM constraint)	GGUF quantization + llama.cpp

Technique 1: Quantization — Cutting VRAM by Half or More

Quantization represents model weights at a lower bit width, simultaneously reducing VRAM usage and memory bandwidth pressure.

python

# vLLM >= 0.6.x
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    quantization="awq",
    dtype="auto",
    gpu_memory_utilization=0.90,
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["LLM 추론 최적화를 설명해주세요."], sampling_params)
print(outputs[0].outputs[0].text)

Here is a comparison of the mainstream quantization approaches:

Method	Bits	Speed Gain	Accuracy Loss	Recommended Use
FP8	8-bit	~2x (vs BF16)	Negligible	Data centers, H100/H200
AWQ INT4	4-bit	~3x	Within 5–10%	Production serving
GPTQ INT4	4-bit	~3x	Similar to AWQ	Powerful when combined with Marlin kernel
GGUF Q4_K_M	4-bit	Environment-dependent	Balanced	Edge / on-device

The Q number in GGUF refers to bits per weight. Q8_0 stores at 8 bits, Q4_K_M at 4 bits, Q2_K at 2 bits — lower numbers mean smaller files and faster speeds, but also lower accuracy.

The Rise of FP8: As of 2025, NVIDIA Hopper (H100/H200) GPUs support FP8 at the hardware level. Delivering 2x throughput over BF16 with negligible accuracy loss, FP8 is establishing itself as the default precision for data center serving.

Technique 2: PagedAttention + Continuous Batching — Keeping the GPU Busy

bash

# vLLM >= 0.6.x
# PagedAttention and Continuous Batching are enabled by default
# --enable-prefix-caching: reuses KV cache for common system prompts
# --max-num-seqs: maximum number of sequences to process concurrently
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --max-num-seqs 256

Comparing before and after vLLM, it is common to see a 2–4x increase in the number of concurrent requests a single GPU can handle.

Workloads where this technique shines: Services with many concurrent users (chatbots, API gateways). This is a baseline technique applied in almost every throughput-first environment.

Technique 3: Speculative Decoding — Draft Ahead, Verify at Once

Honestly, when I first encountered this technique I thought "does this even make sense?" The idea seems almost too simple.

Key guarantee: Accepted tokens follow mathematically the same distribution as if the target model had generated them directly. In other words, output quality is unchanged — only speed increases. (NVIDIA Developer — Introduction to Speculative Decoding)

python

# vLLM >= 0.6.x
# speculative_model="[ngram]": generates drafts from N-gram patterns without a separate model
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_model="[ngram]",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)

Technique 4: Prefix Caching / RadixAttention — Caching Repeated Context

If you have ever operated a chatbot with a long system prompt or a RAG pipeline, you have felt the pain of re-prefilling the same document or instructions on every request.

Prefix Caching saves the KV cache for common prefix tokens and reuses it on subsequent requests. In vLLM it is enabled with a single --enable-prefix-caching flag.

Technique 5: MLA — How DeepSeek Cut KV Cache by 90%

Method	Cache Size per Head	Characteristics
MHA (standard)	2d (K + V)	General purpose; supported by all existing models
GQA	2d / G (shared across G heads)	Adopted by Llama 3; reduces KV cache without additional training
MLA	~d/2 or less	Adopted by DeepSeek-V3, Kimi K2; maximum compression

Workloads where this technique shines: Serving models that use MLA architecture, such as DeepSeek-V3 or Kimi K2. Direct application to existing MHA models is not feasible.

Practical Application

Example 1: Conversational Chatbot — Targeting TTFT Under 100ms

python

# vLLM >= 0.6.x
from vllm import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
 
engine_args = AsyncEngineArgs(
    model="Qwen/Qwen2.5-14B-Instruct-AWQ",
    quantization="awq",
    enable_prefix_caching=True,
    max_num_seqs=128,
    gpu_memory_utilization=0.85,
    speculative_model="[ngram]",
    num_speculative_tokens=4,
)
 
engine = AsyncLLMEngine.from_engine_args(engine_args)

Optimization	TTFT Contribution
AWQ INT4 quantization	~3x memory bandwidth improvement
Prefix Caching	Eliminates system prompt recomputation; 50–90% TTFT reduction
Speculative Decoding	2.5–2.8x decoding latency reduction

Example 2: RAG Pipeline — Reusing Document KV to Boost Throughput

bash

# SGLang — RadixAttention is enabled by default
# --chunked-prefill-size: splits long document prefills into multiple chunks for processing
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 512

python

# vLLM >= 0.6.x
# Combining LMCache + vLLM for persistent KV caching in multi-turn QA
# pip install lmcache vllm
 
import lmcache.integration.vllm  # noqa: F401
# This import patches vLLM's internal classes with LMCache-integrated versions.
# The KV cache persistence feature is activated purely as a side effect of this import, with no further code changes required.
from vllm import LLM
 
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    kv_cache_dtype="auto",
    enable_prefix_caching=True,
)
# Subsequent requests referencing the same document are served without recomputing KV

Combining LMCache with vLLM has been reported to yield up to 15x throughput improvement for multi-turn QA and document analysis workloads.

Example 3: Edge / On-Device Deployment — Running Without a Server GPU

This is a cost (VRAM-constrained) scenario in the trilemma. If you need to run an LLM on a MacBook or small device without a server GPU, llama.cpp is the de facto standard.

bash

# After installing Ollama
# Q4_K_M: Medium variant of K-quants 4-bit quantization — balanced speed and accuracy
ollama pull qwen2.5:7b-instruct-q4_K_M
 
# Or using llama.cpp directly
# --n-gpu-layers: set to maximum on Apple Silicon to leverage the Neural Engine
./llama-cli \
  -m ./models/qwen2.5-7b-instruct-Q4_K_M.gguf \
  -n 512 \
  --n-gpu-layers 35 \
  -p "LLM 추론 최적화를 설명해줘"

GGUF Quantization Level	Model Size (7B)	Accuracy	Recommended Scenario
Q8_0 (8-bit)	~7.7 GB	Nearly identical to BF16	When VRAM is plentiful
Q4_K_M (4-bit)	~4.4 GB	Balanced	General edge deployment
Q2_K (2-bit)	~2.7 GB	Noticeable degradation	Extreme memory constraints

Example 4: Batch Processing — Maximizing Throughput Configuration

For large-scale batch jobs where real-time responsiveness is not required, choose maximum throughput in the trilemma. Tokens processed per hour matters more than latency.

python

# vLLM >= 0.6.x
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="Qwen/Qwen2.5-72B-Instruct-AWQ",
    quantization="awq",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.95,
    max_num_seqs=512,
)
 
# Passing thousands of prompts at once lets the engine build optimal internal batches
prompts = [f"문서 {i}를 요약해주세요: ..." for i in range(5000)]
outputs = llm.generate(prompts, SamplingParams(max_tokens=256))

Intelligent Model Routing: Routing simple tasks to smaller models (7B or below) and only complex tasks to larger models can save an additional 30–60% in cost. Implementable with LiteLLM or a custom router.

Pros and Cons Analysis

Advantages

Item	Detail
Cost reduction	GPT-4-class inference costs have fallen ~50x since 2022; optimization can yield additional savings
Responsiveness	Speculative Decoding + Prefix Caching combination dramatically reduces TTFT
Scalability	2–4x more concurrent requests on the same GPU
Edge viability	INT4 quantization enables serving 70B models on small servers
Quality preservation	Minimal quality degradation for most tasks with FP8 and AWQ

Drawbacks and Caveats

Item	Detail	Mitigation
Quantization accuracy	INT4 can cause >10% degradation on math reasoning and code generation	Use FP8 or AWQ INT4; always benchmark
Speculative Decoding variance	Low acceptance rates on unpredictable tasks like creative writing or code	Monitor per-task acceptance rate before committing
Memory overhead	Loading a draft model increases VRAM usage	Consider draft-free approaches like ngram or EAGLE
Hardware dependency	TensorRT-LLM and FP8 optimizations are NVIDIA-only	Use llama.cpp / ROCm for AMD / Apple Silicon
MLA adoption difficulty	Applying MLA to existing MHA models requires fine-tuning	Factor in MLA architecture when selecting new models

GQA (Grouped-Query Attention): A middle ground between MHA and MLA, where multiple query heads share a single Key-Value head. Adopted by the Llama 3 series, it is a practical option for reducing KV cache size within an existing architecture without additional training.

The Most Common Mistakes in Practice

Choosing a technique before analyzing the workload: If you have not first clarified whether you are latency-first or throughput-first, even the best technique can work against you. Applying batch optimizations to a chatbot can actually increase TTFT.
Skipping the benchmark after quantization: "INT4 is usually fine" is broadly true, but I have personally seen larger-than-expected degradation on certain instruction-following tasks. Math reasoning and code generation in particular deserve a dedicated check.
Not monitoring Speculative Decoding acceptance rate: I once skipped the acceptance rate check and only discovered much later that Speculative Decoding was actually slowing things down. When acceptance rate is low, the draft generation overhead becomes pure cost. Periodically check spec_decode_acceptance_rate from vLLM's /metrics endpoint.

Closing Thoughts

Three steps you can take immediately:

Install vLLM and start serving with an AWQ model. Run pip install vllm and then python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-AWQ --quantization awq --enable-prefix-caching — that single line brings up a server with PagedAttention, Continuous Batching, and Prefix Caching all enabled.
Determine whether your workload is latency-first or throughput-first, then reference the configuration from whichever of Examples 1–4 most closely matches your case. You can add --enable-chunked-prefill (for RAG or long context) or Speculative Decoding (--speculative-model "[ngram]") one at a time and observe how the metrics change.
Start by running curl http://localhost:8000/metrics to see the numbers vLLM exposes. GPU utilization, batch size, and Speculative Decoding acceptance rate all stream out as plain text. Once those numbers are visible, the direction for your next optimization becomes self-evident. When you have bandwidth, connect Prometheus + Grafana to build a dashboard.

References

Quantization

PagedAttention / Continuous Batching

Speculative Decoding

Prefix Caching / KV Cache Optimization

KV Cache Optimization: Production LLMs | Introl Blog

MLA / Attention Optimization

General / Trade-offs

Core Concepts

The Two Phases and Three Bottlenecks of LLM Inference

Technique 1: Quantization — Cutting VRAM by Half or More

Technique 2: PagedAttention + Continuous Batching — Keeping the GPU Busy

Technique 3: Speculative Decoding — Draft Ahead, Verify at Once

Technique 4: Prefix Caching / RadixAttention — Caching Repeated Context

Technique 5: MLA — How DeepSeek Cut KV Cache by 90%

Practical Application

Example 1: Conversational Chatbot — Targeting TTFT Under 100ms

Example 2: RAG Pipeline — Reusing Document KV to Boost Throughput

Example 3: Edge / On-Device Deployment — Running Without a Server GPU

Example 4: Batch Processing — Maximizing Throughput Configuration

Pros and Cons Analysis

Advantages

Drawbacks and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

The Two Phases and Three Bottlenecks of LLM Inference

Technique 1: Quantization — Cutting VRAM by Half or More

Technique 2: PagedAttention + Continuous Batching — Keeping the GPU Busy

Technique 3: Speculative Decoding — Draft Ahead, Verify at Once

Technique 4: Prefix Caching / RadixAttention — Caching Repeated Context

Technique 5: MLA — How DeepSeek Cut KV Cache by 90%

Practical Application

Example 1: Conversational Chatbot — Targeting TTFT Under 100ms

Example 2: RAG Pipeline — Reusing Document KV to Boost Throughput

Example 3: Edge / On-Device Deployment — Running Without a Server GPU

Example 4: Batch Processing — Maximizing Throughput Configuration

Pros and Cons Analysis

Advantages

Drawbacks and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Building an AI Agent Monitoring & Evaluation System: Catching Quality That Silently Breaks in Production with DeepEval and Langfuse

Figma MCP Server + Claude Code/Cursor Integration: Building React Components from a Single Design URL

Boosting AI Component Reuse with Figma Code Connect — From Mapping to Measurement in Large-Scale Design Systems

Cutting the Design-to-Code Iteration Cycle by Up to 80% with Figma MCP: Practical Integration of Model Context Protocol and AI Coding Agents

Applying Coding Rules and Design Rules Simultaneously to AI Agents — How to Use CLAUDE.md and DESIGN.md Together for Claude Code Team Setup

DESIGN.md: The Agent-Native File Format That Makes AI Coding Agents Follow Brand Design Rules on Their Own