Migrating AI Inference Servers After Fly.io GPU Shutdown — Modal · RunPod · Google Cloud Run Cost Comparison & Cold Start Benchmarks

When Fly.io announced it was shutting down its GPU service, I honestly thought, "No way." Fly.io — famous for millisecond container starts — published an official blog post admitting its own limitations ("We Were Wrong About GPUs"), and GPU services will be completely terminated as of July 31, 2026. If you're running an LLM inference server on Fly.io right now, you could find requests suddenly failing after July 31. This is no longer something you can put off.

In this post, I'll share my hands-on experience with three platforms — Modal, RunPod, and Google Cloud Run — covering:

Per-platform H100 cost and per-second billing comparison
Measured cold start data (based on Beam Cloud 2025 benchmarks)
Copy-paste-ready deployment code (based on vLLM serverless migration)
Recommended platform by workload and a common migration checklist

Cold start, cost, and developer experience vary significantly across platforms, so choosing based on your workload characteristics is key. I'll walk through everything concretely from start to finish so you can build your migration plan right away.

Core Concepts

AI Inference Migration Background and Key Metrics

In practice, it's quite common to look only at "GPU specs are good" when preparing a migration, then get hit with a surprise bill later. Comparing by these three metrics from the start leads to much more rational decisions.

Metric	Description	Why It Matters
Cold Start	Time for a GPU container to become ready from idle state to first request	Directly impacts response latency and user experience
GPU Cost per Hour	On-demand vs. usage-based per-second billing	Determines total cost with variable traffic
Scale-to-Zero	Ability to bring cost to $0 when there's no traffic	Reduces cost during low-traffic periods

Scale-to-Zero: A model that completely shuts down containers during periods with no requests, stopping billing. For services with uneven 24-hour traffic, this can dramatically cut costs.

Fly.io's core value proposition was "millisecond machine startup," but GPU workloads are inherently at odds with this philosophy. The GPU preparation process itself — CUDA driver initialization, model weight loading, etc. — requires anywhere from several seconds to tens of seconds. Fly.io honestly acknowledged this point, and the result was the decision to shut down the service. I actually viewed this decision positively. Rather than cramming GPU infrastructure into a general-purpose platform, having a GPU-specialized managed serverless platform fill that role is better for users too.

All three platforms officially support vLLM-based deployment. Thanks to a technology called PagedAttention that efficiently manages GPU memory and significantly increases throughput, vLLM has become one of the most widely adopted LLM serving engines today (competing engines like SGLang and TGI are also actively used).

vLLM: An LLM serving engine that dynamically allocates GPU memory in page units to maximize batch processing efficiency. The fact that it provides an OpenAI-compatible endpoint by default is practical — it means you can use existing OpenAI SDK code with almost no modifications.

Practical Application

I was skeptical at first — "Define GPU infrastructure with just Python, no YAML?" — but after actually using it, you really feel that one file of code handles everything without configuration files. It's especially powerful when an ML team needs to serve models directly without an infrastructure engineer.

The @app.cls decorator abstracts an entire Python class into a single GPU container. Simply declaring the class makes "an instance of this class = one GPU container."

python

import modal
 
app = modal.App("vllm-inference")
 
# GPU container image definition — pip_install layers are cached
image = (
    modal.Image.debian_slim()
    .pip_install("vllm", "huggingface_hub")
)
 
# Modal Volume for storing model weights — key to reducing cold starts
volume = modal.Volume.from_name("model-weights", create_if_missing=True)
WEIGHTS_PATH = "/vol/models"
 
@app.cls(
    gpu="H100",
    image=image,
    container_idle_timeout=300,  # 5분 유휴 시 자동 종료 → 스케일-투-제로
    volumes={WEIGHTS_PATH: volume},
)
class Model:
    @modal.enter()
    def load_model(self):
        from vllm import LLM
        # Volume에 캐시된 가중치를 읽으면 반복 다운로드가 생략됩니다
        self.llm = LLM(
            model="Qwen/Qwen3-8B-FP8",
            download_dir=WEIGHTS_PATH,
        )
 
    @modal.method()
    def generate(self, prompt: str):
        # 프로덕션에서는 OOM, 타임아웃 처리를 추가하는 것을 권장합니다
        return self.llm.generate(prompt)

bash

modal deploy inference.py

Code Element	Role
`@app.cls(gpu="H100")`	Declares allocation of 1 H100 GPU
`container_idle_timeout=300`	Automatically terminates container after 5 minutes of no requests
`volumes={WEIGHTS_PATH: volume}`	Mounts Modal Volume — key to weight caching
`@modal.enter()`	Runs once at container startup — model loading
`@modal.method()`	Method to be exposed as an HTTP endpoint

A single modal deploy line instantly creates an OpenAI-compatible API endpoint. The key here is modal.Volume. Early on I included model weights directly in the image and once waited nearly 40 minutes per deployment. By caching weights downloaded once to a Volume, you completely skip the repeated download time in subsequent cold starts.

Example 2: RunPod — Getting Started Fastest and Cheapest

If cost is your top priority, RunPod is currently the cheapest option on the market for H100. From the RunPod Hub, you can choose the vLLM Worker template and spin up major models like Llama 3, Mistral, Qwen3, DeepSeek-R1, and Phi-4 as serverless endpoints with a single click. No need to build Docker images yourself.

If you need a custom image, it's recommended to use the RunPod official vLLM base image (runpod/vllm:latest) as your starting point. Here's an example of deploying via CLI:

bash

# RunPod CLI로 커스텀 Docker 이미지 배포
# 베이스 이미지: runpod/vllm:latest 를 시작점으로 활용하는 것을 권장합니다
runpodctl create endpoint \
  --name my-llm \
  --image my-org/my-vllm:latest \
  --gpu-type H100 \
  --workers-min 0 \
  --workers-max 10

Option	Description
`--workers-min 0`	Enables scale-to-zero
`--workers-max 10`	Scales up to 10 workers during traffic spikes
`--gpu-type`	Wide selection available: A4000, A100, H100, AMD, etc.

According to Beam Cloud's 2025 Serverless GPU Benchmark, thanks to FlashBoot technology, 48% of small containers complete cold starts in under 200ms. Cases of 77% cost savings at moderate traffic levels compared to traditional dedicated Pods have also been reported (source: RunPod official blog, 2025).

Example 3: Google Cloud Run — Continuing Seamlessly Within the GCP Ecosystem

If you're already on GCP, Cloud Run is the most natural choice since IAM, Cloud Monitoring, and Artifact Registry connect directly. You can use NVIDIA NIM containers directly, so major models like Llama and Gemma come up without any extra configuration.

Cloud Run runs on Google's Knative foundation. Knative is a framework that adds a serverless abstraction layer on top of Kubernetes — including GPU driver initialization, cold starts tend to be longer than other platforms. Including model loading, it takes about 19 seconds (based on gemma3:4b), so it's important to configure startupProbe to prevent traffic from hitting too early.

yaml

# cloud-run-gpu.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: llm-inference
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/execution-environment: gen2
    spec:
      containers:
      - image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 32Gi
        startupProbe:
          httpGet:
            path: /v1/models
            port: 8000
          initialDelaySeconds: 20   # 모델 로딩 시간을 감안한 초기 대기
          periodSeconds: 5
          failureThreshold: 10

bash

gcloud run services replace cloud-run-gpu.yaml --region us-central1

The L4 GPU ($0.67/hr) offers good price-to-performance and is sufficient for mid-scale model serving. However, keep in mind that it has the longest overall cold start of the three platforms. For latency-sensitive services, it's recommended to use this alongside a setting that keeps at least one minimum instance running.

Common Migration Checklist

Regardless of which platform you're moving to from Fly.io, here are five things to take care of in common.

Migrate model weight storage (this takes the longest): Fly Volumes → Modal Volume / RunPod Network Volume / GCS. Depending on weight size, the migration itself can take several hours, so it's best to start this first.
Rebuild Docker images: Remove Fly.io-specific fly.toml settings, switch to a standard Dockerfile
Migrate environment variables and secrets: Fly secrets → each platform's secrets manager
Replace endpoint URLs: Update base URLs in client code
Re-configure scaling policies: Redefine min/max worker counts and idle timeout

Pros and Cons Analysis

Platform Cost at a Glance

Platform	GPU	Cost per Hour	Billing Unit	Scale-to-Zero
RunPod	H100 SXM	$1.99/hr	Per second	Supported
Modal	H100	$3.95/hr	Per second	Supported
Google Cloud Run	L4 (24GB)	~$0.67/hr	Per second	Supported
Google Cloud Run	A100 40GB	~$3.67/hr	Per second	Supported

Cold Start Comparison

Platform	Cold Start	Technical Background
RunPod	Under 200ms for small containers (48%), 6–12s for large	FlashBoot
Modal	Consistently 2–4s (Beam Cloud 2025 benchmark, Qwen3-8B basis)	gVisor lightweight VM + container pooling
Google Cloud Run	Under 5s for GPU driver, ~19s including model loading	Knative-based, gVisor sandbox

Container Pooling: A technique used by Modal that maintains a pool of pre-warmed containers and assigns them immediately when a request arrives. This is why cold starts come out "consistently" at 2–4s. gVisor is a lightweight kernel isolation layer that allows containers to start much faster than regular VMs while maintaining security.

Strengths

Platform	Strengths
Modal	Python-native API, stable cold starts, excellent developer experience (DX), automatic scaling
RunPod	Lowest GPU price ($1.99/hr), fastest cold starts (FlashBoot), wide GPU selection (including AMD)
Google Cloud Run	Full GCP ecosystem integration (IAM · Monitoring · Artifact Registry), enterprise-grade SLA, L4 price competitiveness

Weaknesses and Caveats

Platform	Weaknesses	Mitigation
Modal	~2x the cost of RunPod for H100	For variable workloads with under 50% GPU utilization, per-second billing can actually result in lower total cost — it's worth calculating against your actual usage pattern
RunPod	Cold start 6–12s for large custom images, stability less certain compared to large clouds	For latency-critical services, it's recommended to run in always-on Pod mode
Google Cloud Run	Longest overall cold start (~19s), limited high-performance GPU options	Pre-upload models to GCS and fast-mount in startup script; keeping a minimum of 1 instance running together helps

Recommended Platform by Workload

Situation	Recommended Platform	Reason
Python ML team, rapid prototyping	Modal	Code-centric deployment without YAML, reliable DX
Cost-first, variable traffic	RunPod Serverless	Lowest GPU price + FlashBoot
Leveraging existing GCP infrastructure	Google Cloud Run	Ecosystem integration, minimal reconfiguration
Low-latency production requirements	RunPod Pod (always-on)	Dedicated instance with no cold start

Most Common Mistakes in Practice

Including model weights in the Docker image: The image size balloons to tens of GB, causing build and deployment times to explode. I once included weights in the image early on and had deployments taking 40 minutes. It's strongly recommended to keep weights in external storage (Modal Volume, RunPod Network Volume, GCS) and mount them at runtime.
Setting idle timeout too short: Scale-to-zero saves costs, but higher cold start frequency degrades user experience. It's much safer to analyze your traffic patterns first, then set the idle timeout.
Deciding based on a single metric (cost per hour): Looking only at H100 hourly cost makes RunPod seem overwhelmingly better, but for actual variable traffic with low GPU utilization, Modal's per-second billing can result in lower total costs. It's strongly recommended to always calculate based on your actual usage patterns.

Closing Thoughts

The Fly.io GPU shutdown is another confirmation of why you should choose a GPU-specialized platform. Modal, RunPod, and Cloud Run are all platforms far better suited to GPU inference workloads than Fly.io, and the migration is also an opportunity to move to better infrastructure.

Three steps you can start right now:

It's recommended to start with migrating model weights. Depending on weight size, this step will take the longest. Check your daily average GPU usage hours and traffic variance, and compare against the workload-based platform recommendation table above to decide on your destination first.
You can proceed in the order of the common migration checklist (model weight migration → Dockerfile cleanup → secrets migration → URL replacement → scaling reconfiguration). For Modal, you can use the Python code example above as your starting point; for RunPod, the runpodctl create endpoint command; and for Cloud Run, the YAML above.
It's recommended to validate in a staging environment with plenty of time before July 31, 2026. It's much safer to confirm cold start measurements and cost simulations yourself before switching to production.

If you're unsure which platform suits your workload, leave your situation in the comments. We can think it through together.

Next post: An in-depth guide to model weight caching strategies for reducing cold starts to under 1 second with vLLM + Modal Volume

References

Migrating AI Inference Servers After Fly.io GPU Shutdown — Modal · RunPod · Google Cloud Run Cost Comparison & Cold Start Benchmarks | DEV BAK - 기술블로그

Migrating AI Inference Servers After Fly.io GPU Shutdown — Modal · RunPod · Google Cloud Run Cost Comparison & Cold Start Benchmarks

In this post, I'll share my hands-on experience with three platforms — Modal, RunPod, and Google Cloud Run — covering:

Per-platform H100 cost and per-second billing comparison
Measured cold start data (based on Beam Cloud 2025 benchmarks)
Copy-paste-ready deployment code (based on vLLM serverless migration)
Recommended platform by workload and a common migration checklist

Core Concepts

AI Inference Migration Background and Key Metrics

Metric	Description	Why It Matters
Cold Start	Time for a GPU container to become ready from idle state to first request	Directly impacts response latency and user experience
GPU Cost per Hour	On-demand vs. usage-based per-second billing	Determines total cost with variable traffic
Scale-to-Zero	Ability to bring cost to $0 when there's no traffic	Reduces cost during low-traffic periods

Scale-to-Zero: A model that completely shuts down containers during periods with no requests, stopping billing. For services with uneven 24-hour traffic, this can dramatically cut costs.

vLLM: An LLM serving engine that dynamically allocates GPU memory in page units to maximize batch processing efficiency. The fact that it provides an OpenAI-compatible endpoint by default is practical — it means you can use existing OpenAI SDK code with almost no modifications.

Practical Application

The @app.cls decorator abstracts an entire Python class into a single GPU container. Simply declaring the class makes "an instance of this class = one GPU container."

python

import modal
 
app = modal.App("vllm-inference")
 
# GPU container image definition — pip_install layers are cached
image = (
    modal.Image.debian_slim()
    .pip_install("vllm", "huggingface_hub")
)
 
# Modal Volume for storing model weights — key to reducing cold starts
volume = modal.Volume.from_name("model-weights", create_if_missing=True)
WEIGHTS_PATH = "/vol/models"
 
@app.cls(
    gpu="H100",
    image=image,
    container_idle_timeout=300,  # 5분 유휴 시 자동 종료 → 스케일-투-제로
    volumes={WEIGHTS_PATH: volume},
)
class Model:
    @modal.enter()
    def load_model(self):
        from vllm import LLM
        # Volume에 캐시된 가중치를 읽으면 반복 다운로드가 생략됩니다
        self.llm = LLM(
            model="Qwen/Qwen3-8B-FP8",
            download_dir=WEIGHTS_PATH,
        )
 
    @modal.method()
    def generate(self, prompt: str):
        # 프로덕션에서는 OOM, 타임아웃 처리를 추가하는 것을 권장합니다
        return self.llm.generate(prompt)

bash

modal deploy inference.py

Code Element	Role
`@app.cls(gpu="H100")`	Declares allocation of 1 H100 GPU
`container_idle_timeout=300`	Automatically terminates container after 5 minutes of no requests
`volumes={WEIGHTS_PATH: volume}`	Mounts Modal Volume — key to weight caching
`@modal.enter()`	Runs once at container startup — model loading
`@modal.method()`	Method to be exposed as an HTTP endpoint

Example 2: RunPod — Getting Started Fastest and Cheapest

If you need a custom image, it's recommended to use the RunPod official vLLM base image (runpod/vllm:latest) as your starting point. Here's an example of deploying via CLI:

bash

# RunPod CLI로 커스텀 Docker 이미지 배포
# 베이스 이미지: runpod/vllm:latest 를 시작점으로 활용하는 것을 권장합니다
runpodctl create endpoint \
  --name my-llm \
  --image my-org/my-vllm:latest \
  --gpu-type H100 \
  --workers-min 0 \
  --workers-max 10

Option	Description
`--workers-min 0`	Enables scale-to-zero
`--workers-max 10`	Scales up to 10 workers during traffic spikes
`--gpu-type`	Wide selection available: A4000, A100, H100, AMD, etc.

Example 3: Google Cloud Run — Continuing Seamlessly Within the GCP Ecosystem

yaml

# cloud-run-gpu.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: llm-inference
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/execution-environment: gen2
    spec:
      containers:
      - image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 32Gi
        startupProbe:
          httpGet:
            path: /v1/models
            port: 8000
          initialDelaySeconds: 20   # 모델 로딩 시간을 감안한 초기 대기
          periodSeconds: 5
          failureThreshold: 10

bash

gcloud run services replace cloud-run-gpu.yaml --region us-central1

Common Migration Checklist

Regardless of which platform you're moving to from Fly.io, here are five things to take care of in common.

Migrate model weight storage (this takes the longest): Fly Volumes → Modal Volume / RunPod Network Volume / GCS. Depending on weight size, the migration itself can take several hours, so it's best to start this first.
Rebuild Docker images: Remove Fly.io-specific fly.toml settings, switch to a standard Dockerfile
Migrate environment variables and secrets: Fly secrets → each platform's secrets manager
Replace endpoint URLs: Update base URLs in client code
Re-configure scaling policies: Redefine min/max worker counts and idle timeout

Pros and Cons Analysis

Platform Cost at a Glance

Platform	GPU	Cost per Hour	Billing Unit	Scale-to-Zero
RunPod	H100 SXM	$1.99/hr	Per second	Supported
Modal	H100	$3.95/hr	Per second	Supported
Google Cloud Run	L4 (24GB)	~$0.67/hr	Per second	Supported
Google Cloud Run	A100 40GB	~$3.67/hr	Per second	Supported

Cold Start Comparison

Platform	Cold Start	Technical Background
RunPod	Under 200ms for small containers (48%), 6–12s for large	FlashBoot
Modal	Consistently 2–4s (Beam Cloud 2025 benchmark, Qwen3-8B basis)	gVisor lightweight VM + container pooling
Google Cloud Run	Under 5s for GPU driver, ~19s including model loading	Knative-based, gVisor sandbox

Container Pooling: A technique used by Modal that maintains a pool of pre-warmed containers and assigns them immediately when a request arrives. This is why cold starts come out "consistently" at 2–4s. gVisor is a lightweight kernel isolation layer that allows containers to start much faster than regular VMs while maintaining security.

Strengths

Platform	Strengths
Modal	Python-native API, stable cold starts, excellent developer experience (DX), automatic scaling
RunPod	Lowest GPU price ($1.99/hr), fastest cold starts (FlashBoot), wide GPU selection (including AMD)
Google Cloud Run	Full GCP ecosystem integration (IAM · Monitoring · Artifact Registry), enterprise-grade SLA, L4 price competitiveness

Weaknesses and Caveats

Platform	Weaknesses	Mitigation
Modal	~2x the cost of RunPod for H100	For variable workloads with under 50% GPU utilization, per-second billing can actually result in lower total cost — it's worth calculating against your actual usage pattern
RunPod	Cold start 6–12s for large custom images, stability less certain compared to large clouds	For latency-critical services, it's recommended to run in always-on Pod mode
Google Cloud Run	Longest overall cold start (~19s), limited high-performance GPU options	Pre-upload models to GCS and fast-mount in startup script; keeping a minimum of 1 instance running together helps

Recommended Platform by Workload

Situation	Recommended Platform	Reason
Python ML team, rapid prototyping	Modal	Code-centric deployment without YAML, reliable DX
Cost-first, variable traffic	RunPod Serverless	Lowest GPU price + FlashBoot
Leveraging existing GCP infrastructure	Google Cloud Run	Ecosystem integration, minimal reconfiguration
Low-latency production requirements	RunPod Pod (always-on)	Dedicated instance with no cold start

Most Common Mistakes in Practice

Including model weights in the Docker image: The image size balloons to tens of GB, causing build and deployment times to explode. I once included weights in the image early on and had deployments taking 40 minutes. It's strongly recommended to keep weights in external storage (Modal Volume, RunPod Network Volume, GCS) and mount them at runtime.
Setting idle timeout too short: Scale-to-zero saves costs, but higher cold start frequency degrades user experience. It's much safer to analyze your traffic patterns first, then set the idle timeout.
Deciding based on a single metric (cost per hour): Looking only at H100 hourly cost makes RunPod seem overwhelmingly better, but for actual variable traffic with low GPU utilization, Modal's per-second billing can result in lower total costs. It's strongly recommended to always calculate based on your actual usage patterns.

Closing Thoughts

Three steps you can start right now:

It's recommended to start with migrating model weights. Depending on weight size, this step will take the longest. Check your daily average GPU usage hours and traffic variance, and compare against the workload-based platform recommendation table above to decide on your destination first.
You can proceed in the order of the common migration checklist (model weight migration → Dockerfile cleanup → secrets migration → URL replacement → scaling reconfiguration). For Modal, you can use the Python code example above as your starting point; for RunPod, the runpodctl create endpoint command; and for Cloud Run, the YAML above.
It's recommended to validate in a staging environment with plenty of time before July 31, 2026. It's much safer to confirm cold start measurements and cost simulations yourself before switching to production.

If you're unsure which platform suits your workload, leave your situation in the comments. We can think it through together.

Next post: An in-depth guide to model weight caching strategies for reducing cold starts to under 1 second with vLLM + Modal Volume

Core Concepts

AI Inference Migration Background and Key Metrics

Practical Application

Example 1: Modal — Spinning Up a vLLM Server with Python Code Alone

Example 2: RunPod — Getting Started Fastest and Cheapest

Example 3: Google Cloud Run — Continuing Seamlessly Within the GCP Ecosystem

Common Migration Checklist

Pros and Cons Analysis

Platform Cost at a Glance

Cold Start Comparison

Strengths

Weaknesses and Caveats

Recommended Platform by Workload

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

AI Inference Migration Background and Key Metrics

Practical Application

Example 1: Modal — Spinning Up a vLLM Server with Python Code Alone

Example 2: RunPod — Getting Started Fastest and Cheapest

Example 3: Google Cloud Run — Continuing Seamlessly Within the GCP Ecosystem

Common Migration Checklist

Pros and Cons Analysis

Platform Cost at a Glance

Cold Start Comparison

Strengths

Weaknesses and Caveats

Recommended Platform by Workload

Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Weight Caching + GPU Snapshot Recipe for Sub-Second Cold Starts with vLLM + Modal Volume

The KV Cache Dilemma of Multi-Replica LLMs — Spreading KV Cache Cluster-Wide with LMCache + llm-d

Cutting Infrastructure Costs 10x with AI Agents — Multi-Agent Performance Optimization Through the Meta Capacity Efficiency Pattern

Migrating a Replit Agent App to a Fly.io GPU Inference Server

How to Move Your Replit Agent App to Production — A 2025 Guide to Choosing Between Vercel, Railway, and Fly.io by Cost, Difficulty, and Workload

Deploying an MVP in a Day with Replit Agent: From Prompt to Live URL (Vibe Coding)