Migrating a Replit Agent App to a Fly.io GPU Inference Server

— Practical vLLM·Ollama Deployment and What You Need to Know Before July 2026

In July 2026, Fly.io GPUs will be gone. Here's why learning this pattern now means you can reuse it no matter which platform you migrate to.

I spent a long time scratching my head after quickly building a chatbot app with Replit Agent, wondering "how do I connect this to a GPU server?" Even though Replit can generate your app code in an instant, the infrastructure side of wiring up AI inference still needs to be done by hand. This post walks through the entire workflow connecting those two worlds, complete with real-world code.

What this post covers:

The full two-stage deployment pattern: Replit → GitHub → Fly.io GPU
How to containerize vLLM and Ollama as LLM inference servers
The reality of Fly.io GPU sunset (2026-07-31) and alternative platforms to prepare for what comes after

Core Concepts

Replit Agent: An AI Coding Agent That Builds Full-Stack Apps from Natural Language

Replit Agent is a tool that generates complete code — from a FastAPI backend to a React frontend — from natural language prompts alone, and runs it directly in the Replit environment. The generated code can be pushed straight to a GitHub repository or packaged as a Docker image, making it relatively straightforward to connect to external infrastructure.

Key idea: If you isolate the LLM API endpoint URL as an environment variable in the backend code Replit Agent generates, you can switch the inference route simply by swapping in a Fly.io GPU server address. You won't need to touch a single line of client code.

Fly.io GPU Machine: Container-Friendly GPU Servers (Until 2026-07-31)

Fly.io GPU Machines are containerized VMs offering NVIDIA A10, L40S, A100-40GB, and A100-80GB. The biggest selling point was that specifying vm.size in fly.toml brought up a machine with CUDA drivers pre-installed.

There is one thing you absolutely need to know, however. In February 2025, Fly.io acknowledged in an official blog post that its GPU strategy had failed and announced that GPU Machines would be fully deprecated as of July 31, 2026. The reason given was that the GPU workload market turned out to be a far narrower niche than expected, and that most developers prefer APIs like OpenAI and Anthropic over running their own GPUs. Honestly, it's a decision that makes sense.

That doesn't mean this pattern is useless right now. It's still valid through July 2026, and the skills you'll build here — writing Dockerfiles, separating environment variables, configuring CI/CD — transfer directly when migrating to Modal or RunPod.

It's also worth knowing that GPU-available regions are limited.

A100-80GB: only iad, sjc, syd, ams regions
A10·L40S: only the ord region

AI Inference Engines: Should You Choose vLLM or Ollama?

Both tools expose a /v1/chat/completions endpoint, so client code using the OpenAI SDK connects as-is with just a URL change. The difference is in their purpose.

vLLM is a high-throughput inference engine that uses PagedAttention (a technique that dynamically allocates GPU memory in pages to greatly improve concurrent request handling efficiency), making it suited for production environments. Ollama is llama.cpp-based, easy to configure, and great for quickly spinning up a variety of models, making it a better fit for development, testing, or small-to-medium services. If you're trying this stack for the first time, it's convenient to start with Ollama and move up to vLLM if you need to.

Practical Application

Example 1: Exporting Code from Replit

First, generate your app with Replit Agent, then link your repository and push using the Connect to GitHub button in the dashboard. The single most important thing to do at this point is to isolate the inference endpoint URL as an environment variable. That way you won't need to touch the code when you swap in the Fly.io address later.

python

# Inference client — isolating the URL as an environment variable saves headaches later
import os
from openai import AsyncOpenAI
 
client = AsyncOpenAI(
    base_url=os.environ["INFERENCE_BASE_URL"],
    api_key=os.environ.get("INFERENCE_API_KEY", "none"),
)
 
async def chat(message: str) -> str:
    response = await client.chat.completions.create(
        model=os.environ.get("MODEL_NAME", "llama3"),
        messages=[{"role": "user", "content": message}],
    )
    return response.choices[0].message.content

Example 2: Fly.io GPU Server Deployment Configuration

Installing flyctl and Initializing the App

bash

curl -L https://fly.io/install.sh | sh
fly auth login
 
# --no-deploy: generates config files without deploying immediately
fly launch --no-deploy --region ord

`fly.toml` Configuration

toml

app = "my-ai-inference-app"
primary_region = "ord"
 
[vm]
  size = "a10"
 
[[mounts]]
  source = "model_storage"
  destination = "/models"
 
[[services]]
  internal_port = 8000
  protocol = "tcp"
 
  [[services.ports]]
    port = 443
    handlers = ["tls", "http"]
 
  [[services.ports]]
    port = 80
    handlers = ["http"]

Setting	Example Value	Description
`vm.size`	`a10`, `l40s`, `a100-40gb`, `a100-80gb`	GPU machine type
`primary_region`	`ord`, `iad`, `sjc`	Only GPU-available regions can be specified
`mounts.destination`	`/models`	Persistent storage path for large model files

Without a [[services]] block, the app cannot receive HTTP requests from the outside. When fly.toml is first generated, this block is sometimes missing — it's worth double-checking.

vLLM Dockerfile

dockerfile

FROM vllm/vllm-openai:latest
 
ENV MODEL_NAME="meta-llama/Llama-3-8B-Instruct"
 
EXPOSE 8000
 
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "meta-llama/Llama-3-8B-Instruct", \
     "--host", "0.0.0.0", \
     "--port", "8000"]

Important: Writing your HuggingFace token directly in the Dockerfile leaves the token exposed in the image layers. Always inject secrets using the method below.

bash

fly secrets set HF_TOKEN=hf_your_token_here

Inside the container, access it via os.environ["HF_TOKEN"].

Ollama Dockerfile

There's a pitfall here. Running ollama serve alone will start the container, but since no model is loaded, it returns nothing when it receives requests. You need an entrypoint that pulls the model in advance.

dockerfile

FROM ollama/ollama:latest
 
EXPOSE 11434
 
ENTRYPOINT ["/bin/sh", "-c", "ollama serve & sleep 5 && ollama pull llama3 && wait"]

Running the Deployment

bash

fly deploy

Once deployment is complete, just swap the environment variables in your Replit app.

bash

INFERENCE_BASE_URL=https://my-ai-inference-app.fly.dev/v1
MODEL_NAME=meta-llama/Llama-3-8B-Instruct

Example 3: Reducing Idle GPU Costs with Scale-to-Zero

Since costs run around $2.5–3.5 per hour for an A100, scaling the VM down during periods with no requests is a commonly used pattern.

bash

# Reduce VM instance count to 0 to stop billing (the app is not deleted)
fly scale count 0
 
# Start it up again when needed
fly scale count 1

If you need automatic restart, you can use the Fly.io Machines API to wake the VM when an HTTP request comes in. However, you have to accept that cold starts can mean waiting anywhere from tens of seconds to several minutes. In practice, this delay feels larger than you'd expect. For latency-sensitive services, it's better to keep at least one instance running at all times.

Example 4: CI/CD Automation with GitHub Actions

yaml

# .github/workflows/fly-deploy.yml
name: Deploy to Fly.io
 
on:
  push:
    branches: [main]
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - uses: superfly/flyctl-actions/setup-flyctl@master
 
      - name: Deploy
        run: fly deploy --remote-only
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

Pros and Cons

Pros

Item	Details
Rapid prototyping	App code complete in minutes with Replit Agent, immediately scalable to a GPU server
Container-friendly	Reproduce a CUDA-driver-included environment with a single Dockerfile
OpenAI-compatible API	Both vLLM and Ollama support `/v1/chat/completions` — switch without modifying client code
Scale-to-Zero	Completely eliminate idle GPU costs
Persistent model storage	Reuse large models without re-downloading via Fly Volumes (persistent block storage that survives VM restarts)

Cons and Caveats

Item	Details	Mitigation
Imminent service shutdown	Fly.io GPU deprecated as of 2026-07-31	Plan migration to Modal, RunPod, or Replicate in advance
Limited GPU regions	A100-80GB in only 4 regions; A10·L40S in `ord` only	Run `fly platform vm-sizes` before deploying to confirm available regions
Cold start latency	Tens of seconds to several minutes to restart after scale-to-zero	Keep at least 1 instance running for latency-sensitive services
Model file size	LLMs are tens of GBs → volume configuration required on initial deploy	Connect a Fly Volume with `[[mounts]]`
Cost	~$2.5–3.5/hr for A100-40GB	Check the official Pricing page for the latest rates
Replit code quality	Agent-generated code requires production security and optimization review	Manually review auth, input validation, and error handling
GPU driver compatibility	Compatibility varies depending on the base image	Use Ubuntu 22.04-based images (`nvidia/cuda:12.x`)

The Most Common Mistakes in Practice

Running fly launch without checking GPU region availability — The app gets created in a region that doesn't support GPUs, causing deployment failure. Explicitly specifying the --region flag with fly launch avoids this problem.
Deploying a large model without a Fly Volume mount — Every time the VM restarts, it re-downloads tens of GBs of model data. Leaving out the [[mounts]] configuration is a waste of both money and time.
Pushing Replit Agent-generated code to production without review — Cases of missing authentication, SQL injection vulnerabilities, and hardcoded secrets are not uncommon. It's safer to run a security review on Agent-generated code before shipping.

Closing Thoughts

For teams building a GPU inference server for the first time, I recommend learning this pattern on Fly.io and creating a Modal or RunPod account before July 2026. When you migrate, your Dockerfile and environment variable configuration will carry over almost unchanged, so what you learn now won't go to waste.

Three steps you can start right now:

Use Replit Agent to generate a FastAPI + React chatbot app, isolate the inference endpoint URL as the INFERENCE_BASE_URL environment variable, and connect it to GitHub.
Install flyctl, initialize with fly launch --no-deploy --region ord, add the vm.size, [[mounts]], and [[services]] blocks to fly.toml, then run fly deploy.
Read through the Fly.io GPU deprecation announcement thread and take a look at the free tiers for Modal or RunPod.

Next post: After Fly.io GPU Sunset — How to Migrate Your AI Inference Server to Modal, RunPod, and Google Cloud Run, with a Platform-by-Platform Cost and Cold Start Comparison

References

Migrating a Replit Agent App to a Fly.io GPU Inference Server | DEV BAK - 기술블로그

Migrating a Replit Agent App to a Fly.io GPU Inference Server

— Practical vLLM·Ollama Deployment and What You Need to Know Before July 2026

In July 2026, Fly.io GPUs will be gone. Here's why learning this pattern now means you can reuse it no matter which platform you migrate to.

What this post covers:

The full two-stage deployment pattern: Replit → GitHub → Fly.io GPU
How to containerize vLLM and Ollama as LLM inference servers
The reality of Fly.io GPU sunset (2026-07-31) and alternative platforms to prepare for what comes after

Core Concepts

Replit Agent: An AI Coding Agent That Builds Full-Stack Apps from Natural Language

Key idea: If you isolate the LLM API endpoint URL as an environment variable in the backend code Replit Agent generates, you can switch the inference route simply by swapping in a Fly.io GPU server address. You won't need to touch a single line of client code.

Fly.io GPU Machine: Container-Friendly GPU Servers (Until 2026-07-31)

It's also worth knowing that GPU-available regions are limited.

A100-80GB: only iad, sjc, syd, ams regions
A10·L40S: only the ord region

AI Inference Engines: Should You Choose vLLM or Ollama?

Both tools expose a /v1/chat/completions endpoint, so client code using the OpenAI SDK connects as-is with just a URL change. The difference is in their purpose.

Practical Application

Example 1: Exporting Code from Replit

python

# Inference client — isolating the URL as an environment variable saves headaches later
import os
from openai import AsyncOpenAI
 
client = AsyncOpenAI(
    base_url=os.environ["INFERENCE_BASE_URL"],
    api_key=os.environ.get("INFERENCE_API_KEY", "none"),
)
 
async def chat(message: str) -> str:
    response = await client.chat.completions.create(
        model=os.environ.get("MODEL_NAME", "llama3"),
        messages=[{"role": "user", "content": message}],
    )
    return response.choices[0].message.content

Example 2: Fly.io GPU Server Deployment Configuration

Installing flyctl and Initializing the App

bash

curl -L https://fly.io/install.sh | sh
fly auth login
 
# --no-deploy: generates config files without deploying immediately
fly launch --no-deploy --region ord

`fly.toml` Configuration

toml

app = "my-ai-inference-app"
primary_region = "ord"
 
[vm]
  size = "a10"
 
[[mounts]]
  source = "model_storage"
  destination = "/models"
 
[[services]]
  internal_port = 8000
  protocol = "tcp"
 
  [[services.ports]]
    port = 443
    handlers = ["tls", "http"]
 
  [[services.ports]]
    port = 80
    handlers = ["http"]

Setting	Example Value	Description
`vm.size`	`a10`, `l40s`, `a100-40gb`, `a100-80gb`	GPU machine type
`primary_region`	`ord`, `iad`, `sjc`	Only GPU-available regions can be specified
`mounts.destination`	`/models`	Persistent storage path for large model files

Without a [[services]] block, the app cannot receive HTTP requests from the outside. When fly.toml is first generated, this block is sometimes missing — it's worth double-checking.

vLLM Dockerfile

dockerfile

FROM vllm/vllm-openai:latest
 
ENV MODEL_NAME="meta-llama/Llama-3-8B-Instruct"
 
EXPOSE 8000
 
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "meta-llama/Llama-3-8B-Instruct", \
     "--host", "0.0.0.0", \
     "--port", "8000"]

Important: Writing your HuggingFace token directly in the Dockerfile leaves the token exposed in the image layers. Always inject secrets using the method below.

bash

fly secrets set HF_TOKEN=hf_your_token_here

Inside the container, access it via os.environ["HF_TOKEN"].

Ollama Dockerfile

dockerfile

FROM ollama/ollama:latest
 
EXPOSE 11434
 
ENTRYPOINT ["/bin/sh", "-c", "ollama serve & sleep 5 && ollama pull llama3 && wait"]

Running the Deployment

bash

fly deploy

Once deployment is complete, just swap the environment variables in your Replit app.

bash

INFERENCE_BASE_URL=https://my-ai-inference-app.fly.dev/v1
MODEL_NAME=meta-llama/Llama-3-8B-Instruct

Example 3: Reducing Idle GPU Costs with Scale-to-Zero

Since costs run around $2.5–3.5 per hour for an A100, scaling the VM down during periods with no requests is a commonly used pattern.

bash

# Reduce VM instance count to 0 to stop billing (the app is not deleted)
fly scale count 0
 
# Start it up again when needed
fly scale count 1

Example 4: CI/CD Automation with GitHub Actions

yaml

# .github/workflows/fly-deploy.yml
name: Deploy to Fly.io
 
on:
  push:
    branches: [main]
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - uses: superfly/flyctl-actions/setup-flyctl@master
 
      - name: Deploy
        run: fly deploy --remote-only
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

Pros and Cons

Pros

Item	Details
Rapid prototyping	App code complete in minutes with Replit Agent, immediately scalable to a GPU server
Container-friendly	Reproduce a CUDA-driver-included environment with a single Dockerfile
OpenAI-compatible API	Both vLLM and Ollama support `/v1/chat/completions` — switch without modifying client code
Scale-to-Zero	Completely eliminate idle GPU costs
Persistent model storage	Reuse large models without re-downloading via Fly Volumes (persistent block storage that survives VM restarts)

Cons and Caveats

Item	Details	Mitigation
Imminent service shutdown	Fly.io GPU deprecated as of 2026-07-31	Plan migration to Modal, RunPod, or Replicate in advance
Limited GPU regions	A100-80GB in only 4 regions; A10·L40S in `ord` only	Run `fly platform vm-sizes` before deploying to confirm available regions
Cold start latency	Tens of seconds to several minutes to restart after scale-to-zero	Keep at least 1 instance running for latency-sensitive services
Model file size	LLMs are tens of GBs → volume configuration required on initial deploy	Connect a Fly Volume with `[[mounts]]`
Cost	~$2.5–3.5/hr for A100-40GB	Check the official Pricing page for the latest rates
Replit code quality	Agent-generated code requires production security and optimization review	Manually review auth, input validation, and error handling
GPU driver compatibility	Compatibility varies depending on the base image	Use Ubuntu 22.04-based images (`nvidia/cuda:12.x`)

The Most Common Mistakes in Practice

Running fly launch without checking GPU region availability — The app gets created in a region that doesn't support GPUs, causing deployment failure. Explicitly specifying the --region flag with fly launch avoids this problem.
Deploying a large model without a Fly Volume mount — Every time the VM restarts, it re-downloads tens of GBs of model data. Leaving out the [[mounts]] configuration is a waste of both money and time.
Pushing Replit Agent-generated code to production without review — Cases of missing authentication, SQL injection vulnerabilities, and hardcoded secrets are not uncommon. It's safer to run a security review on Agent-generated code before shipping.

Closing Thoughts

Three steps you can start right now:

Use Replit Agent to generate a FastAPI + React chatbot app, isolate the inference endpoint URL as the INFERENCE_BASE_URL environment variable, and connect it to GitHub.
Install flyctl, initialize with fly launch --no-deploy --region ord, add the vm.size, [[mounts]], and [[services]] blocks to fly.toml, then run fly deploy.
Read through the Fly.io GPU deprecation announcement thread and take a look at the free tiers for Modal or RunPod.

Next post: After Fly.io GPU Sunset — How to Migrate Your AI Inference Server to Modal, RunPod, and Google Cloud Run, with a Platform-by-Platform Cost and Cold Start Comparison

— Practical vLLM·Ollama Deployment and What You Need to Know Before July 2026

Core Concepts

Replit Agent: An AI Coding Agent That Builds Full-Stack Apps from Natural Language

Fly.io GPU Machine: Container-Friendly GPU Servers (Until 2026-07-31)

AI Inference Engines: Should You Choose vLLM or Ollama?

Practical Application

Example 1: Exporting Code from Replit

Example 2: Fly.io GPU Server Deployment Configuration

Installing flyctl and Initializing the App

fly.toml Configuration

vLLM Dockerfile

Ollama Dockerfile

Running the Deployment

Example 3: Reducing Idle GPU Costs with Scale-to-Zero

Example 4: CI/CD Automation with GitHub Actions

Pros and Cons

Pros

Cons and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

— Practical vLLM·Ollama Deployment and What You Need to Know Before July 2026

Core Concepts

Replit Agent: An AI Coding Agent That Builds Full-Stack Apps from Natural Language

Fly.io GPU Machine: Container-Friendly GPU Servers (Until 2026-07-31)

AI Inference Engines: Should You Choose vLLM or Ollama?

Practical Application

Example 1: Exporting Code from Replit

Example 2: Fly.io GPU Server Deployment Configuration

Installing flyctl and Initializing the App

fly.toml Configuration

vLLM Dockerfile

Ollama Dockerfile

Running the Deployment

Example 3: Reducing Idle GPU Costs with Scale-to-Zero

Example 4: CI/CD Automation with GitHub Actions

Pros and Cons

Pros

Cons and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Migrating AI Inference Servers After Fly.io GPU Shutdown — Modal · RunPod · Google Cloud Run Cost Comparison & Cold Start Benchmarks

Weight Caching + GPU Snapshot Recipe for Sub-Second Cold Starts with vLLM + Modal Volume

The KV Cache Dilemma of Multi-Replica LLMs — Spreading KV Cache Cluster-Wide with LMCache + llm-d

How to Move Your Replit Agent App to Production — A 2025 Guide to Choosing Between Vercel, Railway, and Fly.io by Cost, Difficulty, and Workload

Deploying an MVP in a Day with Replit Agent: From Prompt to Live URL (Vibe Coding)

Perplexity Computer: A Complete Breakdown of the Multi-Model Orchestration Architecture Coordinating 19 AI Models

`fly.toml` Configuration

`fly.toml` Configuration