Building a Local LLM Infrastructure with Ollama + Hermes — $0 API Costs, Zero Data Leakage

There's probably been at least one moment where you looked at your monthly OpenAI bill and thought, "Is this really worth it?" I had the same experience — I had GPT-4 API hooked up to a side project and watched tens of thousands of won disappear in a single month. What bothered me even more was that my test data had some internal company text mixed in. It was hard to even verify whether sensitive information was being sent off to an external server.

After wrestling with that problem, I switched to a combination of Ollama and NousResearch's Hermes model. Connecting these two lets you build an AI agent pipeline that runs entirely on your own machine, with no external cloud involved. Let's walk through how that's possible, whether you can switch without touching your existing OpenAI code, and the pitfalls I ran into along the way.

Core Concepts

Ollama: The Docker of AI Models

Think back to the first time you used Docker. Remember that refreshing feeling of being able to spin up any service with a single docker run, without worrying about environment setup? Ollama is exactly that for LLMs.

Running an LLM locally used to require quite a bit of prep work — quantization settings, VRAM allocation, CUDA/Metal acceleration configuration. Ollama abstracts all of that away. With a single command, you download the model, it automatically detects your GPU and applies acceleration, and it spins up an OpenAI-compatible REST API on port 11434.

bash

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# Run a model (auto-downloads on first run)
ollama run llama3.1:8b
 
# Call it the same way you'd call the OpenAI SDK
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "안녕하세요!"}]
  }'

OpenAI-Compatible API: Ollama implements the /v1/chat/completions endpoint to the same spec as OpenAI. In existing OpenAI SDK code, you only need to change base_url to http://localhost:11434/v1 — everything else stays untouched and you're running on a local model.

Quantization: A technique that compresses model weights from 32-bit floating point down to 4-bit or 8-bit integers. This dramatically reduces model size and memory requirements at the cost of a slight loss in accuracy. In Ollama, you can choose a quantization level using formats like llama3.1:8b-q4_0.

The table below summarizes which models work best for different hardware setups.

Model	Size	Recommended RAM	Use Case
`llama3.2:3b`	~2GB	8GB	Entry-level / low-spec
`llama3.1:8b`	~5GB	16GB	General-purpose recommendation
`qwen2.5-coder:7b`	~5GB	16GB	Coding-focused
`nous-hermes`	~5GB	16GB	Optimized for function calling & agent tasks
`mistral:7b`	~4GB	16GB	Balance of fast responses & high quality

Hermes: NousResearch's Fine-Tuned Model Optimized for Agent Tasks

Hermes is a series of fine-tuned LLM models developed by NousResearch. The series has evolved through Hermes 2, Hermes 3, and beyond, and is available directly from the Ollama library under the name nous-hermes.

Where Hermes stands out compared to general base models is in function calling and structured output. When you need to invoke tools or generate responses conforming to a JSON schema in LangChain or a custom agent pipeline, it behaves far more reliably than general-purpose models. I started out building an agent with llama3.1:8b, kept running into failures parsing tool calls, switched to nous-hermes, and the problem went away immediately.

NousResearch is also developing a separate agent framework called Hermes Agent, which advertises self-improving loops and cross-session memory persistence. In this post, I'll focus on the verifiable nous-hermes model. If you're interested in the Hermes Agent framework, I'd recommend checking the latest release status directly on NousResearch's official GitHub.

Ollama + Hermes: The Beauty of Separation of Concerns

Connecting the two tools gives you a clean separation of responsibilities.

┌─────────────────────────────────┐
│       Agent Orchestration       │  ← LangChain / custom pipeline
│  (memory · tools · workflow)    │     function calling, RAG, multi-step tasks
└──────────────┬──────────────────┘
               │ OpenAI-compatible API
               │ http://localhost:11434/v1
┌──────────────▼──────────────────┐
│             Ollama              │  ← model serving layer
│  (quantization · GPU · API)     │     nous-hermes, llama3.1, etc.
└─────────────────────────────────┘

Agent Orchestration: The layer responsible for deciding which tools an agent uses, in what order, and for managing intermediate state and memory. LangChain is a typical orchestration layer; Ollama sits below it, handling the actual model inference.

You can swap the orchestration layer from LangChain to another framework without touching Ollama, and conversely, you can replace Ollama with a different serving layer like vLLM without affecting the code above it.

Practical Examples

Example 1: Installing Ollama and Running the nous-hermes Model

The fastest entry point to get started right away. All you need is Ollama and the nous-hermes model.

bash

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# 2. Download the nous-hermes model
ollama pull nous-hermes        # optimized for function calling & agent tasks
ollama pull llama3.1:8b        # general-purpose backup model
 
# 3. Verify installation
ollama list                    # list downloaded models

bash

# Verify API is working
curl http://localhost:11434/v1/models
 
# Test function calling with nous-hermes
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nous-hermes",
    "messages": [
      {"role": "user", "content": "현재 날씨 조회가 필요하면 JSON 형식으로 도구 호출을 응답해줘."}
    ]
  }'

Step	Command	How to Verify
Check Ollama is running	`ollama list`	Prints list of downloaded models
Check API is working	`curl http://localhost:11434/v1/models`	Returns model JSON response
Chat with model directly	`ollama run nous-hermes`	Enters terminal chat interface

Example 2: Migrating Existing OpenAI Code to Local

If you already have code written with the OpenAI SDK, this is the first pattern to try.

This is the most commonly used pattern in practice. You don't need to rewrite your entire codebase — just change two lines.

python

from openai import OpenAI
 
# Before
client = OpenAI(api_key="sk-...")
 
# After — that's all there is to it
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # no real key needed locally, but can't be an empty string
)
 
response = client.chat.completions.create(
    model="nous-hermes",
    messages=[
        {"role": "system", "content": "당신은 친절한 코드 리뷰어입니다."},
        {"role": "user", "content": "이 함수에서 개선할 점을 알려주세요."}
    ]
)
 
print(response.choices[0].message.content)

Note on api_key: Ollama's local environment doesn't require authentication, but the OpenAI SDK will raise an exception if api_key is an empty string. Just pass any arbitrary string like "ollama" and it'll work. I spent a while confused about why I was getting an AuthenticationError before I figured this out.

Example 3: Setting Up a Team Environment with Docker

The right pattern when you need an internal AI chat environment that your whole team can share, not just yourself.

yaml

# docker-compose.yml
 
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # Only enable the block below for GPU environments (remove this entire section for CPU-only)
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
 
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
 
volumes:
  ollama_data:

bash

# Start
docker compose up -d
 
# Initial model download (once only)
docker compose exec ollama ollama pull llama3.1:8b
docker compose exec ollama ollama pull nous-hermes
 
# Open WebUI is accessible at http://localhost:3000

CPU-only environment note: If you're on a MacBook or a server without a GPU, just remove the entire deploy.resources.reservations.devices block. It will still work without a GPU, but response times will be significantly slower. For a team server without a GPU, it's recommended to start with a lighter model like llama3.2:3b.

With this setup, team members can access http://server-ip:3000 and use it just like ChatGPT — an internal AI chat environment. When I first deployed this setup for my team, the most common question was "What's the difference from real ChatGPT?" The biggest difference is that data never leaves the team's server.

Example 4: RAG Pipeline Integrated with LangChain

The right pattern when you need AI-powered search over internal documents or a domain-specific Q&A system.

Connecting Ollama with LangChain to build a document-based Q&A system is another frequently used pattern.

python

from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader
 
# Initialize local models
llm = OllamaLLM(model="nous-hermes", base_url="http://localhost:11434")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
 
# Load documents and create vector store
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()
 
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
 
# Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)
 
# Query against internal documents
result = qa_chain.invoke({"query": "배포 프로세스가 어떻게 되나요?"})
print(result["result"])

Both embedding and inference happen entirely locally. Connecting your company's internal wiki or technical documentation this way lets you run an AI-powered search system without any risk of sensitive content leaking externally.

Pros and Cons

Advantages

Item	Details
Complete privacy	Data never leaves your machine. No logs, no telemetry
Cost structure	$0 operating cost after initial hardware investment. Can save hundreds of thousands of won per year vs. cloud
Predictable latency	Consistent response speed with no network dependency (approximately 50–80 tokens/sec for 7B models on Apple Silicon M-series)
Offline operation	Fully functional without an internet connection. Ideal for IoT and remote field environments
OpenAI-compatible API	Migrate to local with minimal code changes
Function calling quality	nous-hermes is more stable than general-purpose models of the same size for tool invocation and JSON output

Disadvantages and Caveats

Item	Details	Mitigation
Hardware requirements	Recommended 8–16GB+ GPU VRAM. Agent use cases require models with 64K+ token context	Apple Silicon Macs (unified memory) offer a relatively affordable starting point
Model performance gap	Complex reasoning is weaker compared to latest cloud models like GPT-4.5 or Claude Opus	Hybrid strategy: local for simple tasks, escalate to cloud only for complex reasoning
Upfront hardware cost	Purchasing a GPU server or high-spec workstation	Validate at small scale on existing hardware (MacBook M-series, etc.) before scaling
Model management overhead	Must handle updates and version management yourself	Automate `ollama pull` scripts with cron
Context window limitations	Smaller models struggle to maintain sufficient context for multi-step agent tasks	Use 8B+ models, or supplement context length limitations with RAG

Most Common Mistakes in Practice

These are also the most frequent questions I got when teams first adopted this stack.

Building an agent with too small a model — connecting a 3B model to an agent pipeline will cause frequent failures in tool call parsing and multi-step reasoning. For agent use cases, a minimum of 8B is recommended, preferably a nous-hermes-series model.
Setting api_key to an empty string — the OpenAI SDK raises an exception when api_key="". In a local environment, just pass an arbitrary string like "ollama".
Mistaking the first model load time for response latency — the first request before the model is loaded into memory can take tens of seconds. Solve this by pre-loading the model with ollama run nous-hermes, or by sending one warm-up request at service startup.

Closing Thoughts

When I first set up a local AI pipeline, I kept thinking, "There's so much configuration — is this actually going to be usable?" But after loading nous-hermes into Ollama and connecting my existing OpenAI code by changing just the base_url, the moment I got that first fully local response, I had the feeling of "this actually works." After that, I could experiment freely without worrying about cloud bills.

The Ollama + nous-hermes combination is a practical local AI infrastructure choice that simultaneously solves privacy protection and cost reduction, while letting you reuse your existing OpenAI code almost entirely as-is.

Three steps you can start with right now:

Install Ollama and run llama3.1:8b — you can start with a single command: curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.1:8b. If you have less than 16GB of RAM, try ollama run llama3.2:3b first.
Just change the base_url in your existing OpenAI code — modify to OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") and the same code will call the local model. Experiencing the response quality and speed firsthand is worthwhile.
Connect an agent pipeline with nous-hermes — after ollama pull nous-hermes, hook it up to a LangChain agent and give it tool-calling tasks. You'll immediately feel the difference in function calling reliability compared to general-purpose models.

References

#Ollama#LocalLLM#LangChain#RAG#FunctionCalling#Docker#OpenAI호환API#NousHermes#벡터DB#AI에이전트

Building a Local LLM Infrastructure with Ollama + Hermes — $0 API Costs, Zero Data Leakage | DEV BAK - 기술블로그

Building a Local LLM Infrastructure with Ollama + Hermes — $0 API Costs, Zero Data Leakage

Core Concepts

Ollama: The Docker of AI Models

bash

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# Run a model (auto-downloads on first run)
ollama run llama3.1:8b
 
# Call it the same way you'd call the OpenAI SDK
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "안녕하세요!"}]
  }'

OpenAI-Compatible API: Ollama implements the /v1/chat/completions endpoint to the same spec as OpenAI. In existing OpenAI SDK code, you only need to change base_url to http://localhost:11434/v1 — everything else stays untouched and you're running on a local model.

Quantization: A technique that compresses model weights from 32-bit floating point down to 4-bit or 8-bit integers. This dramatically reduces model size and memory requirements at the cost of a slight loss in accuracy. In Ollama, you can choose a quantization level using formats like llama3.1:8b-q4_0.

The table below summarizes which models work best for different hardware setups.

Model	Size	Recommended RAM	Use Case
`llama3.2:3b`	~2GB	8GB	Entry-level / low-spec
`llama3.1:8b`	~5GB	16GB	General-purpose recommendation
`qwen2.5-coder:7b`	~5GB	16GB	Coding-focused
`nous-hermes`	~5GB	16GB	Optimized for function calling & agent tasks
`mistral:7b`	~4GB	16GB	Balance of fast responses & high quality

Hermes: NousResearch's Fine-Tuned Model Optimized for Agent Tasks

Ollama + Hermes: The Beauty of Separation of Concerns

Connecting the two tools gives you a clean separation of responsibilities.

┌─────────────────────────────────┐
│       Agent Orchestration       │  ← LangChain / custom pipeline
│  (memory · tools · workflow)    │     function calling, RAG, multi-step tasks
└──────────────┬──────────────────┘
               │ OpenAI-compatible API
               │ http://localhost:11434/v1
┌──────────────▼──────────────────┐
│             Ollama              │  ← model serving layer
│  (quantization · GPU · API)     │     nous-hermes, llama3.1, etc.
└─────────────────────────────────┘

Agent Orchestration: The layer responsible for deciding which tools an agent uses, in what order, and for managing intermediate state and memory. LangChain is a typical orchestration layer; Ollama sits below it, handling the actual model inference.

Practical Examples

Example 1: Installing Ollama and Running the nous-hermes Model

The fastest entry point to get started right away. All you need is Ollama and the nous-hermes model.

bash

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# 2. Download the nous-hermes model
ollama pull nous-hermes        # optimized for function calling & agent tasks
ollama pull llama3.1:8b        # general-purpose backup model
 
# 3. Verify installation
ollama list                    # list downloaded models

bash

# Verify API is working
curl http://localhost:11434/v1/models
 
# Test function calling with nous-hermes
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nous-hermes",
    "messages": [
      {"role": "user", "content": "현재 날씨 조회가 필요하면 JSON 형식으로 도구 호출을 응답해줘."}
    ]
  }'

Step	Command	How to Verify
Check Ollama is running	`ollama list`	Prints list of downloaded models
Check API is working	`curl http://localhost:11434/v1/models`	Returns model JSON response
Chat with model directly	`ollama run nous-hermes`	Enters terminal chat interface

Example 2: Migrating Existing OpenAI Code to Local

If you already have code written with the OpenAI SDK, this is the first pattern to try.

This is the most commonly used pattern in practice. You don't need to rewrite your entire codebase — just change two lines.

python

from openai import OpenAI
 
# Before
client = OpenAI(api_key="sk-...")
 
# After — that's all there is to it
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # no real key needed locally, but can't be an empty string
)
 
response = client.chat.completions.create(
    model="nous-hermes",
    messages=[
        {"role": "system", "content": "당신은 친절한 코드 리뷰어입니다."},
        {"role": "user", "content": "이 함수에서 개선할 점을 알려주세요."}
    ]
)
 
print(response.choices[0].message.content)

Note on api_key: Ollama's local environment doesn't require authentication, but the OpenAI SDK will raise an exception if api_key is an empty string. Just pass any arbitrary string like "ollama" and it'll work. I spent a while confused about why I was getting an AuthenticationError before I figured this out.

Example 3: Setting Up a Team Environment with Docker

The right pattern when you need an internal AI chat environment that your whole team can share, not just yourself.

yaml

# docker-compose.yml
 
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # Only enable the block below for GPU environments (remove this entire section for CPU-only)
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
 
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
 
volumes:
  ollama_data:

bash

# Start
docker compose up -d
 
# Initial model download (once only)
docker compose exec ollama ollama pull llama3.1:8b
docker compose exec ollama ollama pull nous-hermes
 
# Open WebUI is accessible at http://localhost:3000

CPU-only environment note: If you're on a MacBook or a server without a GPU, just remove the entire deploy.resources.reservations.devices block. It will still work without a GPU, but response times will be significantly slower. For a team server without a GPU, it's recommended to start with a lighter model like llama3.2:3b.

Example 4: RAG Pipeline Integrated with LangChain

The right pattern when you need AI-powered search over internal documents or a domain-specific Q&A system.

Connecting Ollama with LangChain to build a document-based Q&A system is another frequently used pattern.

python

from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import DirectoryLoader
 
# Initialize local models
llm = OllamaLLM(model="nous-hermes", base_url="http://localhost:11434")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
 
# Load documents and create vector store
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()
 
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
 
# Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)
 
# Query against internal documents
result = qa_chain.invoke({"query": "배포 프로세스가 어떻게 되나요?"})
print(result["result"])

Pros and Cons

Advantages

Item	Details
Complete privacy	Data never leaves your machine. No logs, no telemetry
Cost structure	$0 operating cost after initial hardware investment. Can save hundreds of thousands of won per year vs. cloud
Predictable latency	Consistent response speed with no network dependency (approximately 50–80 tokens/sec for 7B models on Apple Silicon M-series)
Offline operation	Fully functional without an internet connection. Ideal for IoT and remote field environments
OpenAI-compatible API	Migrate to local with minimal code changes
Function calling quality	nous-hermes is more stable than general-purpose models of the same size for tool invocation and JSON output

Disadvantages and Caveats

Item	Details	Mitigation
Hardware requirements	Recommended 8–16GB+ GPU VRAM. Agent use cases require models with 64K+ token context	Apple Silicon Macs (unified memory) offer a relatively affordable starting point
Model performance gap	Complex reasoning is weaker compared to latest cloud models like GPT-4.5 or Claude Opus	Hybrid strategy: local for simple tasks, escalate to cloud only for complex reasoning
Upfront hardware cost	Purchasing a GPU server or high-spec workstation	Validate at small scale on existing hardware (MacBook M-series, etc.) before scaling
Model management overhead	Must handle updates and version management yourself	Automate `ollama pull` scripts with cron
Context window limitations	Smaller models struggle to maintain sufficient context for multi-step agent tasks	Use 8B+ models, or supplement context length limitations with RAG

Most Common Mistakes in Practice

These are also the most frequent questions I got when teams first adopted this stack.

Building an agent with too small a model — connecting a 3B model to an agent pipeline will cause frequent failures in tool call parsing and multi-step reasoning. For agent use cases, a minimum of 8B is recommended, preferably a nous-hermes-series model.
Setting api_key to an empty string — the OpenAI SDK raises an exception when api_key="". In a local environment, just pass an arbitrary string like "ollama".
Mistaking the first model load time for response latency — the first request before the model is loaded into memory can take tens of seconds. Solve this by pre-loading the model with ollama run nous-hermes, or by sending one warm-up request at service startup.

Closing Thoughts

Three steps you can start with right now:

Install Ollama and run llama3.1:8b — you can start with a single command: curl -fsSL https://ollama.com/install.sh | sh && ollama run llama3.1:8b. If you have less than 16GB of RAM, try ollama run llama3.2:3b first.
Just change the base_url in your existing OpenAI code — modify to OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") and the same code will call the local model. Experiencing the response quality and speed firsthand is worthwhile.
Connect an agent pipeline with nous-hermes — after ollama pull nous-hermes, hook it up to a LangChain agent and give it tool-calling tasks. You'll immediately feel the difference in function calling reliability compared to general-purpose models.

References

#Ollama#LocalLLM#LangChain#RAG#FunctionCalling#Docker#OpenAI호환API#NousHermes#벡터DB#AI에이전트

Core Concepts

Ollama: The Docker of AI Models

Hermes: NousResearch's Fine-Tuned Model Optimized for Agent Tasks

Ollama + Hermes: The Beauty of Separation of Concerns

Practical Examples

Example 1: Installing Ollama and Running the nous-hermes Model

Example 2: Migrating Existing OpenAI Code to Local

Example 3: Setting Up a Team Environment with Docker

Example 4: RAG Pipeline Integrated with LangChain

Pros and Cons

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Ollama: The Docker of AI Models

Hermes: NousResearch's Fine-Tuned Model Optimized for Agent Tasks

Ollama + Hermes: The Beauty of Separation of Concerns

Practical Examples

Example 1: Installing Ollama and Running the nous-hermes Model

Example 2: Migrating Existing OpenAI Code to Local

Example 3: Setting Up a Team Environment with Docker

Example 4: RAG Pipeline Integrated with LangChain

Pros and Cons

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

OpenCode vs Claude Code: Comparing Terminal AI Agents and Choosing the Right One for Your Team

Building a TypeScript LSP Self-Correction Loop with OpenCode — AI That Catches Its Own Type Errors

Running an AI Coding Agent in the Terminal Without the Cloud — Connecting Local LLMs with Ollama + OpenCode

Centralizing Hermes Agent SKILL.md via Git Tap Lets Multiple Instances Share the Same Skill Base

Automating Deployment Pipelines with Hermes Agent

AI Agent-Based CI/CD Automation — Hermes Agent Crons' state.db Structure and Isolated Execution Mechanics