Building a Local LLM Infrastructure with Ollama + Hermes — $0 API Costs, Zero Data Leakage
There's probably been at least one moment where you looked at your monthly OpenAI bill and thought, "Is this really worth it?" I had the same experience — I had GPT-4 API hooked up to a side project and watched tens of thousands of won disappear in a single month. What bothered me even more was that my test data had some internal company text mixed in. It was hard to even verify whether sensitive information was being sent off to an external server.
After wrestling with that problem, I switched to a combination of Ollama and NousResearch's Hermes model. Connecting these two lets you build an AI agent pipeline that runs entirely on your own machine, with no external cloud involved. Let's walk through how that's possible, whether you can switch without touching your existing OpenAI code, and the pitfalls I ran into along the way.
Core Concepts
Ollama: The Docker of AI Models
Think back to the first time you used Docker. Remember that refreshing feeling of being able to spin up any service with a single docker run, without worrying about environment setup? Ollama is exactly that for LLMs.
Running an LLM locally used to require quite a bit of prep work — quantization settings, VRAM allocation, CUDA/Metal acceleration configuration. Ollama abstracts all of that away. With a single command, you download the model, it automatically detects your GPU and applies acceleration, and it spins up an OpenAI-compatible REST API on port 11434.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Run a model (auto-downloads on first run)
ollama run llama3.1:8b
# Call it the same way you'd call the OpenAI SDK
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "안녕하세요!"}]
}'OpenAI-Compatible API: Ollama implements the
/v1/chat/completionsendpoint to the same spec as OpenAI. In existing OpenAI SDK code, you only need to changebase_urltohttp://localhost:11434/v1— everything else stays untouched and you're running on a local model.
Quantization: A technique that compresses model weights from 32-bit floating point down to 4-bit or 8-bit integers. This dramatically reduces model size and memory requirements at the cost of a slight loss in accuracy. In Ollama, you can choose a quantization level using formats like
llama3.1:8b-q4_0.
The table below summarizes which models work best for different hardware setups.
| Model | Size | Recommended RAM | Use Case |
|---|---|---|---|
llama3.2:3b |
~2GB | 8GB | Entry-level / low-spec |
llama3.1:8b |
~5GB | 16GB | General-purpose recommendation |
qwen2.5-coder:7b |
~5GB | 16GB | Coding-focused |
nous-hermes |
~5GB | 16GB | Optimized for function calling & agent tasks |
mistral:7b |
~4GB | 16GB | Balance of fast responses & high quality |
Hermes: NousResearch's Fine-Tuned Model Optimized for Agent Tasks
Hermes is a series of fine-tuned LLM models developed by NousResearch. The series has evolved through Hermes 2, Hermes 3, and beyond, and is available directly from the Ollama library under the name nous-hermes.
Where Hermes stands out compared to general base models is in function calling and structured output. When you need to invoke tools or generate responses conforming to a JSON schema in LangChain or a custom agent pipeline, it behaves far more reliably than general-purpose models. I started out building an agent with llama3.1:8b, kept running into failures parsing tool calls, switched to nous-hermes, and the problem went away immediately.
NousResearch is also developing a separate agent framework called Hermes Agent, which advertises self-improving loops and cross-session memory persistence. In this post, I'll focus on the verifiable nous-hermes model. If you're interested in the Hermes Agent framework, I'd recommend checking the latest release status directly on NousResearch's official GitHub.
Ollama + Hermes: The Beauty of Separation of Concerns
Connecting the two tools gives you a clean separation of responsibilities.
┌─────────────────────────────────┐
│ Agent Orchestration │ ← LangChain / custom pipeline
│ (memory · tools · workflow) │ function calling, RAG, multi-step tasks
└──────────────┬──────────────────┘
│ OpenAI-compatible API
│ http://localhost:11434/v1
┌──────────────▼──────────────────┐
│ Ollama │ ← model serving layer
│ (quantization · GPU · API) │ nous-hermes, llama3.1, etc.
└─────────────────────────────────┘Agent Orchestration: The layer responsible for deciding which tools an agent uses, in what order, and for managing intermediate state and memory. LangChain is a typical orchestration layer; Ollama sits below it, handling the actual model inference.
You can swap the orchestration layer from LangChain to another framework without touching Ollama, and conversely, you can replace Ollama with a different serving layer like vLLM without affecting the code above it.
Practical Examples
Example 1: Installing Ollama and Running the nous-hermes Model
The fastest entry point to get started right away. All you need is Ollama and the nous-hermes model.
# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# 2. Download the nous-hermes model
ollama pull nous-hermes # optimized for function calling & agent tasks
ollama pull llama3.1:8b # general-purpose backup model
# 3. Verify installation
ollama list # list downloaded models# Verify API is working
curl http://localhost:11434/v1/models
# Test function calling with nous-hermes
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nous-hermes",
"messages": [
{"role": "user", "content": "현재 날씨 조회가 필요하면 JSON 형식으로 도구 호출을 응답해줘."}
]
}'| Step | Command | How to Verify |
|---|---|---|
| Check Ollama is running | ollama list |
Prints list of downloaded models |
| Check API is working | curl http://localhost:11434/v1/models |
Returns model JSON response |
| Chat with model directly | ollama run nous-hermes |
Enters terminal chat interface |
Example 2: Migrating Existing OpenAI Code to Local
If you already have code written with the OpenAI SDK, this is the first pattern to try.
This is the most commonly used pattern in practice. You don't need to rewrite your entire codebase — just change two lines.
from openai import OpenAI
# Before
client = OpenAI(api_key="sk-...")
# After — that's all there is to it
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # no real key needed locally, but can't be an empty string
)
response = client.chat.completions.create(
model="nous-hermes",
messages=[
{"role": "system", "content": "당신은 친절한 코드 리뷰어입니다."},
{"role": "user", "content": "이 함수에서 개선할 점을 알려주세요."}
]
)
print(response.choices[0].message.content)Note on api_key: Ollama's local environment doesn't require authentication, but the OpenAI SDK will raise an exception if
api_keyis an empty string. Just pass any arbitrary string like"ollama"and it'll work. I spent a while confused about why I was getting anAuthenticationErrorbefore I figured this out.
Example 3: Setting Up a Team Environment with Docker
The right pattern when you need an internal AI chat environment that your whole team can share, not just yourself.
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
# Only enable the block below for GPU environments (remove this entire section for CPU-only)
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
ollama_data:# Start
docker compose up -d
# Initial model download (once only)
docker compose exec ollama ollama pull llama3.1:8b
docker compose exec ollama ollama pull nous-hermes
# Open WebUI is accessible at http://localhost:3000CPU-only environment note: If you're on a MacBook or a server without a GPU, just remove the entire
deploy.resources.reservations.devicesblock. It will still work without a GPU, but response times will be significantly slower. For a team server without a GPU, it's recommended to start with a lighter model likellama3.2:3b.
With this setup, team members can access http://server-ip:3000 and use it just like ChatGPT — an internal AI chat environment. When I first deployed this setup for my team, the most common question was "What's the difference from real ChatGPT?" The biggest difference is that data never leaves the team's server.
Example 4: RAG Pipeline Integrated with LangChain
The right pattern when you need AI-powered search over internal documents or a domain-specific Q&A system.
Connecting Ollama with LangChain to build a document-based Q&A system is another frequently used pattern.
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader
# Initialize local models
llm = OllamaLLM(model="nous-hermes", base_url="http://localhost:11434")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Load documents and create vector store
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)
# Query against internal documents
result = qa_chain.invoke({"query": "배포 프로세스가 어떻게 되나요?"})
print(result["result"])Both embedding and inference happen entirely locally. Connecting your company's internal wiki or technical documentation this way lets you run an AI-powered search system without any risk of sensitive content leaking externally.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Complete privacy | Data never leaves your machine. No logs, no telemetry |
| Cost structure | $0 operating cost after initial hardware investment. Can save hundreds of thousands of won per year vs. cloud |
| Predictable latency | Consistent response speed with no network dependency (approximately 50–80 tokens/sec for 7B models on Apple Silicon M-series) |
| Offline operation | Fully functional without an internet connection. Ideal for IoT and remote field environments |
| OpenAI-compatible API | Migrate to local with minimal code changes |
| Function calling quality | nous-hermes is more stable than general-purpose models of the same size for tool invocation and JSON output |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Hardware requirements | Recommended 8–16GB+ GPU VRAM. Agent use cases require models with 64K+ token context | Apple Silicon Macs (unified memory) offer a relatively affordable starting point |
| Model performance gap | Complex reasoning is weaker compared to latest cloud models like GPT-4.5 or Claude Opus | Hybrid strategy: local for simple tasks, escalate to cloud only for complex reasoning |
| Upfront hardware cost | Purchasing a GPU server or high-spec workstation | Validate at small scale on existing hardware (MacBook M-series, etc.) before scaling |
| Model management overhead | Must handle updates and version management yourself | Automate ollama pull scripts with cron |
| Context window limitations | Smaller models struggle to maintain sufficient context for multi-step agent tasks | Use 8B+ models, or supplement context length limitations with RAG |
Most Common Mistakes in Practice
These are also the most frequent questions I got when teams first adopted this stack.
-
Building an agent with too small a model — connecting a 3B model to an agent pipeline will cause frequent failures in tool call parsing and multi-step reasoning. For agent use cases, a minimum of 8B is recommended, preferably a
nous-hermes-series model. -
Setting
api_keyto an empty string — the OpenAI SDK raises an exception whenapi_key="". In a local environment, just pass an arbitrary string like"ollama". -
Mistaking the first model load time for response latency — the first request before the model is loaded into memory can take tens of seconds. Solve this by pre-loading the model with
ollama run nous-hermes, or by sending one warm-up request at service startup.
Closing Thoughts
When I first set up a local AI pipeline, I kept thinking, "There's so much configuration — is this actually going to be usable?" But after loading nous-hermes into Ollama and connecting my existing OpenAI code by changing just the base_url, the moment I got that first fully local response, I had the feeling of "this actually works." After that, I could experiment freely without worrying about cloud bills.
The Ollama + nous-hermes combination is a practical local AI infrastructure choice that simultaneously solves privacy protection and cost reduction, while letting you reuse your existing OpenAI code almost entirely as-is.
Three steps you can start with right now:
-
Install Ollama and run llama3.1:8b — you can start with a single command:
curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.1:8b. If you have less than 16GB of RAM, tryollama run llama3.2:3bfirst. -
Just change the
base_urlin your existing OpenAI code — modify toOpenAI(base_url="http://localhost:11434/v1", api_key="ollama")and the same code will call the local model. Experiencing the response quality and speed firsthand is worthwhile. -
Connect an agent pipeline with
nous-hermes— afterollama pull nous-hermes, hook it up to a LangChain agent and give it tool-calling tasks. You'll immediately feel the difference in function calling reliability compared to general-purpose models.
References
- Ollama OpenAI Compatibility Official Blog | ollama.com
- Ollama Official Library — nous-hermes model | ollama.com
- NousResearch/Hermes-Function-Calling | GitHub
- LangChain + Ollama Integration | LangChain Docs
- Open WebUI | GitHub
- Local LLM Hosting: Complete 2025 Guide | DEV Community
- Local LLMs vs Cloud LLMs Comparison (2026) | freeacademy.ai
- Ollama Advanced Integrations — Open WebUI, LiteLLM, LangChain | cohorte.co