Migrating a Replit Agent App to a Fly.io GPU Inference Server
— Practical vLLM·Ollama Deployment and What You Need to Know Before July 2026
In July 2026, Fly.io GPUs will be gone. Here's why learning this pattern now means you can reuse it no matter which platform you migrate to.
I spent a long time scratching my head after quickly building a chatbot app with Replit Agent, wondering "how do I connect this to a GPU server?" Even though Replit can generate your app code in an instant, the infrastructure side of wiring up AI inference still needs to be done by hand. This post walks through the entire workflow connecting those two worlds, complete with real-world code.
What this post covers:
- The full two-stage deployment pattern: Replit → GitHub → Fly.io GPU
- How to containerize vLLM and Ollama as LLM inference servers
- The reality of Fly.io GPU sunset (2026-07-31) and alternative platforms to prepare for what comes after
Core Concepts
Replit Agent: An AI Coding Agent That Builds Full-Stack Apps from Natural Language
Replit Agent is a tool that generates complete code — from a FastAPI backend to a React frontend — from natural language prompts alone, and runs it directly in the Replit environment. The generated code can be pushed straight to a GitHub repository or packaged as a Docker image, making it relatively straightforward to connect to external infrastructure.
Key idea: If you isolate the LLM API endpoint URL as an environment variable in the backend code Replit Agent generates, you can switch the inference route simply by swapping in a Fly.io GPU server address. You won't need to touch a single line of client code.
Fly.io GPU Machine: Container-Friendly GPU Servers (Until 2026-07-31)
Fly.io GPU Machines are containerized VMs offering NVIDIA A10, L40S, A100-40GB, and A100-80GB. The biggest selling point was that specifying vm.size in fly.toml brought up a machine with CUDA drivers pre-installed.
There is one thing you absolutely need to know, however. In February 2025, Fly.io acknowledged in an official blog post that its GPU strategy had failed and announced that GPU Machines would be fully deprecated as of July 31, 2026. The reason given was that the GPU workload market turned out to be a far narrower niche than expected, and that most developers prefer APIs like OpenAI and Anthropic over running their own GPUs. Honestly, it's a decision that makes sense.
That doesn't mean this pattern is useless right now. It's still valid through July 2026, and the skills you'll build here — writing Dockerfiles, separating environment variables, configuring CI/CD — transfer directly when migrating to Modal or RunPod.
It's also worth knowing that GPU-available regions are limited.
- A100-80GB: only
iad,sjc,syd,amsregions - A10·L40S: only the
ordregion
AI Inference Engines: Should You Choose vLLM or Ollama?
Both tools expose a /v1/chat/completions endpoint, so client code using the OpenAI SDK connects as-is with just a URL change. The difference is in their purpose.
vLLM is a high-throughput inference engine that uses PagedAttention (a technique that dynamically allocates GPU memory in pages to greatly improve concurrent request handling efficiency), making it suited for production environments. Ollama is llama.cpp-based, easy to configure, and great for quickly spinning up a variety of models, making it a better fit for development, testing, or small-to-medium services. If you're trying this stack for the first time, it's convenient to start with Ollama and move up to vLLM if you need to.
Practical Application
Example 1: Exporting Code from Replit
First, generate your app with Replit Agent, then link your repository and push using the Connect to GitHub button in the dashboard. The single most important thing to do at this point is to isolate the inference endpoint URL as an environment variable. That way you won't need to touch the code when you swap in the Fly.io address later.
# Inference client — isolating the URL as an environment variable saves headaches later
import os
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url=os.environ["INFERENCE_BASE_URL"],
api_key=os.environ.get("INFERENCE_API_KEY", "none"),
)
async def chat(message: str) -> str:
response = await client.chat.completions.create(
model=os.environ.get("MODEL_NAME", "llama3"),
messages=[{"role": "user", "content": message}],
)
return response.choices[0].message.contentExample 2: Fly.io GPU Server Deployment Configuration
Installing flyctl and Initializing the App
curl -L https://fly.io/install.sh | sh
fly auth login
# --no-deploy: generates config files without deploying immediately
fly launch --no-deploy --region ordfly.toml Configuration
app = "my-ai-inference-app"
primary_region = "ord"
[vm]
size = "a10"
[[mounts]]
source = "model_storage"
destination = "/models"
[[services]]
internal_port = 8000
protocol = "tcp"
[[services.ports]]
port = 443
handlers = ["tls", "http"]
[[services.ports]]
port = 80
handlers = ["http"]| Setting | Example Value | Description |
|---|---|---|
vm.size |
a10, l40s, a100-40gb, a100-80gb |
GPU machine type |
primary_region |
ord, iad, sjc |
Only GPU-available regions can be specified |
mounts.destination |
/models |
Persistent storage path for large model files |
Without a [[services]] block, the app cannot receive HTTP requests from the outside. When fly.toml is first generated, this block is sometimes missing — it's worth double-checking.
vLLM Dockerfile
FROM vllm/vllm-openai:latest
ENV MODEL_NAME="meta-llama/Llama-3-8B-Instruct"
EXPOSE 8000
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "meta-llama/Llama-3-8B-Instruct", \
"--host", "0.0.0.0", \
"--port", "8000"]Important: Writing your HuggingFace token directly in the Dockerfile leaves the token exposed in the image layers. Always inject secrets using the method below.
fly secrets set HF_TOKEN=hf_your_token_hereInside the container, access it via os.environ["HF_TOKEN"].
Ollama Dockerfile
There's a pitfall here. Running ollama serve alone will start the container, but since no model is loaded, it returns nothing when it receives requests. You need an entrypoint that pulls the model in advance.
FROM ollama/ollama:latest
EXPOSE 11434
ENTRYPOINT ["/bin/sh", "-c", "ollama serve & sleep 5 && ollama pull llama3 && wait"]Running the Deployment
fly deployOnce deployment is complete, just swap the environment variables in your Replit app.
INFERENCE_BASE_URL=https://my-ai-inference-app.fly.dev/v1
MODEL_NAME=meta-llama/Llama-3-8B-InstructExample 3: Reducing Idle GPU Costs with Scale-to-Zero
Since costs run around $2.5–3.5 per hour for an A100, scaling the VM down during periods with no requests is a commonly used pattern.
# Reduce VM instance count to 0 to stop billing (the app is not deleted)
fly scale count 0
# Start it up again when needed
fly scale count 1If you need automatic restart, you can use the Fly.io Machines API to wake the VM when an HTTP request comes in. However, you have to accept that cold starts can mean waiting anywhere from tens of seconds to several minutes. In practice, this delay feels larger than you'd expect. For latency-sensitive services, it's better to keep at least one instance running at all times.
Example 4: CI/CD Automation with GitHub Actions
# .github/workflows/fly-deploy.yml
name: Deploy to Fly.io
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: superfly/flyctl-actions/setup-flyctl@master
- name: Deploy
run: fly deploy --remote-only
env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}Pros and Cons
Pros
| Item | Details |
|---|---|
| Rapid prototyping | App code complete in minutes with Replit Agent, immediately scalable to a GPU server |
| Container-friendly | Reproduce a CUDA-driver-included environment with a single Dockerfile |
| OpenAI-compatible API | Both vLLM and Ollama support /v1/chat/completions — switch without modifying client code |
| Scale-to-Zero | Completely eliminate idle GPU costs |
| Persistent model storage | Reuse large models without re-downloading via Fly Volumes (persistent block storage that survives VM restarts) |
Cons and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Imminent service shutdown | Fly.io GPU deprecated as of 2026-07-31 | Plan migration to Modal, RunPod, or Replicate in advance |
| Limited GPU regions | A100-80GB in only 4 regions; A10·L40S in ord only |
Run fly platform vm-sizes before deploying to confirm available regions |
| Cold start latency | Tens of seconds to several minutes to restart after scale-to-zero | Keep at least 1 instance running for latency-sensitive services |
| Model file size | LLMs are tens of GBs → volume configuration required on initial deploy | Connect a Fly Volume with [[mounts]] |
| Cost | ~$2.5–3.5/hr for A100-40GB | Check the official Pricing page for the latest rates |
| Replit code quality | Agent-generated code requires production security and optimization review | Manually review auth, input validation, and error handling |
| GPU driver compatibility | Compatibility varies depending on the base image | Use Ubuntu 22.04-based images (nvidia/cuda:12.x) |
The Most Common Mistakes in Practice
-
Running
fly launchwithout checking GPU region availability — The app gets created in a region that doesn't support GPUs, causing deployment failure. Explicitly specifying the--regionflag withfly launchavoids this problem. -
Deploying a large model without a Fly Volume mount — Every time the VM restarts, it re-downloads tens of GBs of model data. Leaving out the
[[mounts]]configuration is a waste of both money and time. -
Pushing Replit Agent-generated code to production without review — Cases of missing authentication, SQL injection vulnerabilities, and hardcoded secrets are not uncommon. It's safer to run a security review on Agent-generated code before shipping.
Closing Thoughts
For teams building a GPU inference server for the first time, I recommend learning this pattern on Fly.io and creating a Modal or RunPod account before July 2026. When you migrate, your Dockerfile and environment variable configuration will carry over almost unchanged, so what you learn now won't go to waste.
Three steps you can start right now:
- Use Replit Agent to generate a FastAPI + React chatbot app, isolate the inference endpoint URL as the
INFERENCE_BASE_URLenvironment variable, and connect it to GitHub. - Install
flyctl, initialize withfly launch --no-deploy --region ord, add thevm.size,[[mounts]], and[[services]]blocks tofly.toml, then runfly deploy. - Read through the Fly.io GPU deprecation announcement thread and take a look at the free tiers for Modal or RunPod.
Next post: After Fly.io GPU Sunset — How to Migrate Your AI Inference Server to Modal, RunPod, and Google Cloud Run, with a Platform-by-Platform Cost and Cold Start Comparison