Migrating AI Inference Servers After Fly.io GPU Shutdown — Modal · RunPod · Google Cloud Run Cost Comparison & Cold Start Benchmarks
When Fly.io announced it was shutting down its GPU service, I honestly thought, "No way." Fly.io — famous for millisecond container starts — published an official blog post admitting its own limitations ("We Were Wrong About GPUs"), and GPU services will be completely terminated as of July 31, 2026. If you're running an LLM inference server on Fly.io right now, you could find requests suddenly failing after July 31. This is no longer something you can put off.
In this post, I'll share my hands-on experience with three platforms — Modal, RunPod, and Google Cloud Run — covering:
- Per-platform H100 cost and per-second billing comparison
- Measured cold start data (based on Beam Cloud 2025 benchmarks)
- Copy-paste-ready deployment code (based on vLLM serverless migration)
- Recommended platform by workload and a common migration checklist
Cold start, cost, and developer experience vary significantly across platforms, so choosing based on your workload characteristics is key. I'll walk through everything concretely from start to finish so you can build your migration plan right away.
Core Concepts
AI Inference Migration Background and Key Metrics
In practice, it's quite common to look only at "GPU specs are good" when preparing a migration, then get hit with a surprise bill later. Comparing by these three metrics from the start leads to much more rational decisions.
| Metric | Description | Why It Matters |
|---|---|---|
| Cold Start | Time for a GPU container to become ready from idle state to first request | Directly impacts response latency and user experience |
| GPU Cost per Hour | On-demand vs. usage-based per-second billing | Determines total cost with variable traffic |
| Scale-to-Zero | Ability to bring cost to $0 when there's no traffic | Reduces cost during low-traffic periods |
Scale-to-Zero: A model that completely shuts down containers during periods with no requests, stopping billing. For services with uneven 24-hour traffic, this can dramatically cut costs.
Fly.io's core value proposition was "millisecond machine startup," but GPU workloads are inherently at odds with this philosophy. The GPU preparation process itself — CUDA driver initialization, model weight loading, etc. — requires anywhere from several seconds to tens of seconds. Fly.io honestly acknowledged this point, and the result was the decision to shut down the service. I actually viewed this decision positively. Rather than cramming GPU infrastructure into a general-purpose platform, having a GPU-specialized managed serverless platform fill that role is better for users too.
All three platforms officially support vLLM-based deployment. Thanks to a technology called PagedAttention that efficiently manages GPU memory and significantly increases throughput, vLLM has become one of the most widely adopted LLM serving engines today (competing engines like SGLang and TGI are also actively used).
vLLM: An LLM serving engine that dynamically allocates GPU memory in page units to maximize batch processing efficiency. The fact that it provides an OpenAI-compatible endpoint by default is practical — it means you can use existing OpenAI SDK code with almost no modifications.
Practical Application
Example 1: Modal — Spinning Up a vLLM Server with Python Code Alone
I was skeptical at first — "Define GPU infrastructure with just Python, no YAML?" — but after actually using it, you really feel that one file of code handles everything without configuration files. It's especially powerful when an ML team needs to serve models directly without an infrastructure engineer.
The @app.cls decorator abstracts an entire Python class into a single GPU container. Simply declaring the class makes "an instance of this class = one GPU container."
import modal
app = modal.App("vllm-inference")
# GPU container image definition — pip_install layers are cached
image = (
modal.Image.debian_slim()
.pip_install("vllm", "huggingface_hub")
)
# Modal Volume for storing model weights — key to reducing cold starts
volume = modal.Volume.from_name("model-weights", create_if_missing=True)
WEIGHTS_PATH = "/vol/models"
@app.cls(
gpu="H100",
image=image,
container_idle_timeout=300, # 5분 유휴 시 자동 종료 → 스케일-투-제로
volumes={WEIGHTS_PATH: volume},
)
class Model:
@modal.enter()
def load_model(self):
from vllm import LLM
# Volume에 캐시된 가중치를 읽으면 반복 다운로드가 생략됩니다
self.llm = LLM(
model="Qwen/Qwen3-8B-FP8",
download_dir=WEIGHTS_PATH,
)
@modal.method()
def generate(self, prompt: str):
# 프로덕션에서는 OOM, 타임아웃 처리를 추가하는 것을 권장합니다
return self.llm.generate(prompt)modal deploy inference.py| Code Element | Role |
|---|---|
@app.cls(gpu="H100") |
Declares allocation of 1 H100 GPU |
container_idle_timeout=300 |
Automatically terminates container after 5 minutes of no requests |
volumes={WEIGHTS_PATH: volume} |
Mounts Modal Volume — key to weight caching |
@modal.enter() |
Runs once at container startup — model loading |
@modal.method() |
Method to be exposed as an HTTP endpoint |
A single modal deploy line instantly creates an OpenAI-compatible API endpoint. The key here is modal.Volume. Early on I included model weights directly in the image and once waited nearly 40 minutes per deployment. By caching weights downloaded once to a Volume, you completely skip the repeated download time in subsequent cold starts.
Example 2: RunPod — Getting Started Fastest and Cheapest
If cost is your top priority, RunPod is currently the cheapest option on the market for H100. From the RunPod Hub, you can choose the vLLM Worker template and spin up major models like Llama 3, Mistral, Qwen3, DeepSeek-R1, and Phi-4 as serverless endpoints with a single click. No need to build Docker images yourself.
If you need a custom image, it's recommended to use the RunPod official vLLM base image (runpod/vllm:latest) as your starting point. Here's an example of deploying via CLI:
# RunPod CLI로 커스텀 Docker 이미지 배포
# 베이스 이미지: runpod/vllm:latest 를 시작점으로 활용하는 것을 권장합니다
runpodctl create endpoint \
--name my-llm \
--image my-org/my-vllm:latest \
--gpu-type H100 \
--workers-min 0 \
--workers-max 10| Option | Description |
|---|---|
--workers-min 0 |
Enables scale-to-zero |
--workers-max 10 |
Scales up to 10 workers during traffic spikes |
--gpu-type |
Wide selection available: A4000, A100, H100, AMD, etc. |
According to Beam Cloud's 2025 Serverless GPU Benchmark, thanks to FlashBoot technology, 48% of small containers complete cold starts in under 200ms. Cases of 77% cost savings at moderate traffic levels compared to traditional dedicated Pods have also been reported (source: RunPod official blog, 2025).
Example 3: Google Cloud Run — Continuing Seamlessly Within the GCP Ecosystem
If you're already on GCP, Cloud Run is the most natural choice since IAM, Cloud Monitoring, and Artifact Registry connect directly. You can use NVIDIA NIM containers directly, so major models like Llama and Gemma come up without any extra configuration.
Cloud Run runs on Google's Knative foundation. Knative is a framework that adds a serverless abstraction layer on top of Kubernetes — including GPU driver initialization, cold starts tend to be longer than other platforms. Including model loading, it takes about 19 seconds (based on gemma3:4b), so it's important to configure startupProbe to prevent traffic from hitting too early.
# cloud-run-gpu.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: llm-inference
spec:
template:
metadata:
annotations:
run.googleapis.com/execution-environment: gen2
spec:
containers:
- image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
resources:
limits:
nvidia.com/gpu: "1"
memory: 32Gi
startupProbe:
httpGet:
path: /v1/models
port: 8000
initialDelaySeconds: 20 # 모델 로딩 시간을 감안한 초기 대기
periodSeconds: 5
failureThreshold: 10gcloud run services replace cloud-run-gpu.yaml --region us-central1The L4 GPU ($0.67/hr) offers good price-to-performance and is sufficient for mid-scale model serving. However, keep in mind that it has the longest overall cold start of the three platforms. For latency-sensitive services, it's recommended to use this alongside a setting that keeps at least one minimum instance running.
Common Migration Checklist
Regardless of which platform you're moving to from Fly.io, here are five things to take care of in common.
- Migrate model weight storage (this takes the longest): Fly Volumes → Modal Volume / RunPod Network Volume / GCS. Depending on weight size, the migration itself can take several hours, so it's best to start this first.
- Rebuild Docker images: Remove Fly.io-specific
fly.tomlsettings, switch to a standard Dockerfile - Migrate environment variables and secrets: Fly secrets → each platform's secrets manager
- Replace endpoint URLs: Update base URLs in client code
- Re-configure scaling policies: Redefine min/max worker counts and idle timeout
Pros and Cons Analysis
Platform Cost at a Glance
| Platform | GPU | Cost per Hour | Billing Unit | Scale-to-Zero |
|---|---|---|---|---|
| RunPod | H100 SXM | $1.99/hr | Per second | Supported |
| Modal | H100 | $3.95/hr | Per second | Supported |
| Google Cloud Run | L4 (24GB) | ~$0.67/hr | Per second | Supported |
| Google Cloud Run | A100 40GB | ~$3.67/hr | Per second | Supported |
Cold Start Comparison
| Platform | Cold Start | Technical Background |
|---|---|---|
| RunPod | Under 200ms for small containers (48%), 6–12s for large | FlashBoot |
| Modal | Consistently 2–4s (Beam Cloud 2025 benchmark, Qwen3-8B basis) | gVisor lightweight VM + container pooling |
| Google Cloud Run | Under 5s for GPU driver, ~19s including model loading | Knative-based, gVisor sandbox |
Container Pooling: A technique used by Modal that maintains a pool of pre-warmed containers and assigns them immediately when a request arrives. This is why cold starts come out "consistently" at 2–4s. gVisor is a lightweight kernel isolation layer that allows containers to start much faster than regular VMs while maintaining security.
Strengths
| Platform | Strengths |
|---|---|
| Modal | Python-native API, stable cold starts, excellent developer experience (DX), automatic scaling |
| RunPod | Lowest GPU price ($1.99/hr), fastest cold starts (FlashBoot), wide GPU selection (including AMD) |
| Google Cloud Run | Full GCP ecosystem integration (IAM · Monitoring · Artifact Registry), enterprise-grade SLA, L4 price competitiveness |
Weaknesses and Caveats
| Platform | Weaknesses | Mitigation |
|---|---|---|
| Modal | ~2x the cost of RunPod for H100 | For variable workloads with under 50% GPU utilization, per-second billing can actually result in lower total cost — it's worth calculating against your actual usage pattern |
| RunPod | Cold start 6–12s for large custom images, stability less certain compared to large clouds | For latency-critical services, it's recommended to run in always-on Pod mode |
| Google Cloud Run | Longest overall cold start (~19s), limited high-performance GPU options | Pre-upload models to GCS and fast-mount in startup script; keeping a minimum of 1 instance running together helps |
Recommended Platform by Workload
| Situation | Recommended Platform | Reason |
|---|---|---|
| Python ML team, rapid prototyping | Modal | Code-centric deployment without YAML, reliable DX |
| Cost-first, variable traffic | RunPod Serverless | Lowest GPU price + FlashBoot |
| Leveraging existing GCP infrastructure | Google Cloud Run | Ecosystem integration, minimal reconfiguration |
| Low-latency production requirements | RunPod Pod (always-on) | Dedicated instance with no cold start |
Most Common Mistakes in Practice
-
Including model weights in the Docker image: The image size balloons to tens of GB, causing build and deployment times to explode. I once included weights in the image early on and had deployments taking 40 minutes. It's strongly recommended to keep weights in external storage (Modal Volume, RunPod Network Volume, GCS) and mount them at runtime.
-
Setting idle timeout too short: Scale-to-zero saves costs, but higher cold start frequency degrades user experience. It's much safer to analyze your traffic patterns first, then set the idle timeout.
-
Deciding based on a single metric (cost per hour): Looking only at H100 hourly cost makes RunPod seem overwhelmingly better, but for actual variable traffic with low GPU utilization, Modal's per-second billing can result in lower total costs. It's strongly recommended to always calculate based on your actual usage patterns.
Closing Thoughts
The Fly.io GPU shutdown is another confirmation of why you should choose a GPU-specialized platform. Modal, RunPod, and Cloud Run are all platforms far better suited to GPU inference workloads than Fly.io, and the migration is also an opportunity to move to better infrastructure.
Three steps you can start right now:
- It's recommended to start with migrating model weights. Depending on weight size, this step will take the longest. Check your daily average GPU usage hours and traffic variance, and compare against the workload-based platform recommendation table above to decide on your destination first.
- You can proceed in the order of the common migration checklist (model weight migration → Dockerfile cleanup → secrets migration → URL replacement → scaling reconfiguration). For Modal, you can use the Python code example above as your starting point; for RunPod, the
runpodctl create endpointcommand; and for Cloud Run, the YAML above. - It's recommended to validate in a staging environment with plenty of time before July 31, 2026. It's much safer to confirm cold start measurements and cost simulations yourself before switching to production.
If you're unsure which platform suits your workload, leave your situation in the comments. We can think it through together.
Next post: An in-depth guide to model weight caching strategies for reducing cold starts to under 1 second with vLLM + Modal Volume
References
- We Were Wrong About GPUs | Fly.io 공식 블로그
- Fly.io GPU 마이그레이션 커뮤니티 공지 (종료일: 2026-07-31)
- Modal 공식 vLLM 배포 가이드
- Modal — vLLM 배포 방법 블로그
- RunPod Serverless LLM 2025 업데이트
- RunPod vLLM 시작 가이드
- Google Cloud Run GPU 공식 출시 블로그
- NVIDIA: Google Cloud Run L4 GPU 지원 발표
- 서버리스 GPU 플랫폼 콜드스타트 벤치마크 2025 | Beam Cloud
- Google Cloud Run GPU 가격 공식 문서
- Modal NVIDIA L4 가격 분석