Implementing In-House Document Q&A Without API Costs Using Ollama·LangChain — Privacy and Search Quality Together with Hybrid Search and Reranking
Two things nag at you when using cloud LLM APIs. One is the invoice that arrives at the end of the month, and the other is the anxiety of "our internal documents are being uploaded to OpenAI's servers, aren't they?" I still remember breaking into a cold sweat when the legal team asked, "Isn't this contract content going outside the company?" the first time I put a RAG pipeline into production.
Ollama keeps everything from LLM inference to embedding generation entirely on your local machine, and provides an OpenAI-compatible endpoint so you can connect it to existing code with almost no modifications. That said, it's not a perfect solution. On a CPU-only environment, a 7B model runs at around 1–5 token/s, making real-time responses unrealistic, and without a GPU, anything above 14B is practically too slow. Knowing those trade-offs before you start is important.
This article is aimed at readers with a basic knowledge of Python who want to try local RAG for the first time. It covers a three-stage pipeline using Ollama + LangChain to index in-house documents and boost search quality with hybrid search and reranking. By the end, you'll have a chatbot that can answer questions about contracts and internal documents running on your local machine.
Table of Contents
- Core Concepts
- Prerequisites
- Hands-On Implementation
- Pros and Cons Analysis
- Closing Thoughts
- Quick Start
- References
Core Concepts
The Problem RAG Solves
RAG (Retrieval-Augmented Generation) is an architectural pattern that works around the limitation of LLMs having knowledge frozen at training time. It retrieves external documents in real time and injects them into the prompt, allowing you to reference data beyond the training cutoff and significantly reducing hallucinations since responses are grounded in retrieved facts.
| Problem | How RAG Addresses It |
|---|---|
| Knowledge freshness | References external documents in real time, regardless of training cutoff |
| Hallucination | Reduces factual errors by generating answers based on retrieved facts |
| Cost & privacy | Zero API costs and no external data leakage through fully local operation |
Understanding the Pipeline in Three Stages
The Ollama RAG pipeline is structurally divided into three stages. Simply knowing which stage a quality problem originates from can cut your debugging time in half.
Stage 1 — Indexing
Documents are split into chunks of a fixed size, vectorized using the Ollama embedding model, and stored in a local vector database. This is typically run once, or re-run only when documents are updated.
Stage 2 — Retrieval
The user query is vectorized using the same embedding model, and chunks with high similarity are retrieved from the vector database. Retrieval quality largely determines final answer quality.
Stage 3 — Generation
The retrieved chunks are injected into the prompt, and Ollama generates a response using the local LLM.
Core Ollama interfaces:
ollama pull(download model),ollama run(interactive execution), HTTP REST API (port 11434). It supports OpenAI-compatible endpoints (/v1/embeddings,/v1/chat/completions), allowing seamless integration with existing RAG frameworks like LangChain and LlamaIndex without code changes.
Choosing an Embedding Model
Honestly, I started by just throwing in nomic-embed-text and moving on, but when working with Korean documents I found the search results were unsatisfactory and ended up diving back into benchmarks.
| Model | Parameters | Context | Vector Dimensions | Recommended When |
|---|---|---|---|---|
qwen3-embedding:8b |
8B | 8192 | 4096 | Korean / multilingual mixed documents (MTEB multilingual #1) |
nomic-embed-text |
137M | 8192 | 768 | English documents, prioritizing fast processing speed |
bge-m3 |
567M | 8192 | 1024 | Balanced Chinese/Korean/English multilingual |
mxbai-embed-large |
335M | 512 | 1024 | English-focused, prioritizing quality |
Released in June 2025, qwen3-embedding:8b ranked #1 on the MTEB multilingual leaderboard with a score of 70.58. Scores in the 70s compete with OpenAI's text-embedding-3-large, a meaningful leap compared to previous local embedding models that hovered in the low-to-mid 60s. In my experience, for documents that include Korean, the difference in search hit rate versus nomic-embed-text is perceptible.
MTEB (Massive Text Embedding Benchmark): The standard benchmark for embedding models, evaluating retrieval, classification, and clustering performance across multiple datasets. Higher scores indicate better performance across diverse search scenarios.
One important caveat: once you choose an embedding model, switching later requires re-indexing the entire vector database. Choose carefully from the start, and it's strongly recommended to keep model names in a configuration file rather than hardcoded in your code.
Chunking Strategy Matters More Than You Think
A recent arXiv study (2505.21700) produced an interesting finding: the performance gap from changing chunking strategy (approximately 7%) is larger than the performance gap between embedding models (approximately 5%). It's worth reviewing your chunking before switching to a more expensive embedding model.
| Scenario | Recommended Chunk Size | Approach |
|---|---|---|
| FAQ, short queries | 128–256 characters | Fixed size |
| Technical docs, manuals | 512–1024 characters | Fixed size + overlap |
| Long-form analysis, legal docs | Parent 1024 + child 256 characters | Hierarchical chunking |
There's a reason the unit is specified as "characters" here. LangChain's RecursiveCharacterTextSplitter calculates chunk size by character count by default (length_function=len). 512 characters is roughly 170–250 tokens in Korean. If you think of it as a "512-token chunk," you'll end up with much smaller chunks in practice, which can impact search quality. If you want to work in tokens, you need to pass tiktoken or a HuggingFace tokenizer directly into length_function.
Prerequisites
Have the following environment set up before following the code examples.
1. Install Ollama and prepare models
# After installing Ollama from ollama.com
ollama pull qwen3-embedding:8b # Embedding model (~5GB)
ollama pull qwen3:14b # Generation model (~9GB)
# The Ollama server must be running before executing code
ollama serveIf you don't run
ollama servefirst, you'll get aConnectionRefusedErrorin the code. It's helpful to open another terminal to keep the server running, or configure it to start automatically at system startup.
2. Install Python packages
pip install langchain-ollama langchain-chroma langchain-community \
rank-bm25 sentence-transformers pypdflangchain-ollama and langchain-chroma are prone to API changes between versions. If you encounter version conflicts, check the LangChain official docs for compatible versions.
Hardware notes: On a CPU-only environment, a 7B model runs at around 1–5 token/s, which is slow. If you need faster responses, a GPU (16GB+ VRAM) environment with a 14B or larger model is recommended. The example code below will run in any environment, but response speed will vary significantly depending on hardware.
Hands-On Implementation
Building a Basic Q&A Chatbot
This is the most common pattern in environments with legal and compliance requirements. It indexes contracts, policy documents, and manuals, answering questions without sending queries externally.
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# Load document (PDF or text file)
loader = PyPDFLoader("contracts/sample.pdf")
# loader = TextLoader("docs/manual.txt", encoding="utf-8") # for text files
docs = loader.load()
# Local embedding model
embeddings = OllamaEmbeddings(model="qwen3-embedding:8b")
# Chunk splitting: 512 characters, 50-character overlap
# length_function=len is character-count based (different from token count — 512 Korean chars ≈ 170–250 tokens)
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len,
)
chunks = splitter.split_documents(docs)
# Record as metadata so the embedding model used for indexing can be tracked later
for chunk in chunks:
chunk.metadata["embedding_model"] = "qwen3-embedding:8b"
# Vector storage (persisted to local disk)
# Re-running with the same persist_directory overwrites the existing DB
vectorstore = Chroma.from_documents(
chunks,
embeddings,
persist_directory="./db"
)
# MMR retrieval: returns relevant, non-redundant results
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 5, "fetch_k": 20}
)
prompt = ChatPromptTemplate.from_template("""
Please answer the question based only on the context below.
If the answer cannot be found in the context, respond with "This cannot be confirmed in the document."
Context: {context}
Question: {question}
""")
llm = ChatOllama(model="qwen3:14b")
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
response = chain.invoke("What is the procedure for requesting annual leave?")
print(response)| Code Point | Description |
|---|---|
search_type="mmr" |
Maximal Marginal Relevance — considers both relevance and diversity to filter out duplicate chunks |
fetch_k=20 |
Internally retrieves 20 candidates, then uses the MMR algorithm to select the final 5 |
persist_directory |
ChromaDB saves to disk so re-indexing is unnecessary after server restarts |
embedding_model metadata |
Allows tracking which model was used for indexing when the embedding model is swapped |
Why ChromaDB? Qdrant and Weaviate are excellent too, but ChromaDB works immediately with just a Python package — no Docker required. The biggest advantage for a first RAG attempt is minimizing environment setup. When you need advanced filtering or higher performance in production, that's the time to consider migrating to Qdrant.
Improving Accuracy with BM25 + Semantic Search Hybrid
A common situation in practice is that pure vector search alone fails to catch keywords like "Article 3" or "SKU-2024-001." Hybrid search combining BM25 (keyword search) with Dense Embedding (semantic search) helps in these cases.
The code below reuses the chunks created in the basic Q&A chatbot implementation. You can run it directly without separate document loading.
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma
embeddings = OllamaEmbeddings(model="qwen3-embedding:8b")
# Dense retriever: semantics-based search
# Separate persist_directory to avoid overwriting the DB from the previous example
vectorstore = Chroma.from_documents(
chunks, # reusing chunks split in the previous example
embeddings,
persist_directory="./db_hybrid"
)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# BM25 retriever: keyword-based search
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 10
# Ensemble: merges results from both using RRF (Reciprocal Rank Fusion)
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.4, 0.6] # higher weight on semantic search
)
results = ensemble_retriever.invoke("Article 3 contract termination conditions")RRF (Reciprocal Rank Fusion): An algorithm that determines a final ranking by summing multiple search result lists based on their ranks. It compensates for the weaknesses of each method and tends to produce consistently better results than a single search approach.
weights=[0.4, 0.6] is a starting point, not the definitive answer. For legal or regulatory documents where keyword matching matters, try raising the BM25 weight; for FAQ-style content with many natural language questions, lean toward the Dense side. In my experience, starting at 50:50 or 40:60 and comparing Ragas scores gives you a feel for the right balance.
Improving Top-Result Precision with Cross-Encoder Reranking
Rather than passing top-k retrieval results directly to the LLM, re-sorting them with a cross-encoder noticeably improves precision. The code below operates on the ensemble_retriever created in the BM25 + semantic search hybrid section.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Cross-encoder reranker (runs locally; model is auto-downloaded on first run)
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3")
compressor = CrossEncoderReranker(model=model, top_n=5)
# Retrieve top-20 → cross-encoder reranking → return top-5
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=ensemble_retriever # hybrid retriever from the previous example
)
reranked_results = compression_retriever.invoke("Penalty provisions upon contract termination")Cross-Encoder: A model that takes a query and a document as simultaneous inputs and computes a relevance score. It is more accurate than simple vector similarity, but requires inference for every document, making it slower. A two-stage approach — quickly retrieving the top-20 first, then using the cross-encoder to narrow down to top-5 — balances speed and quality effectively.
Reranking runs locally so there are no API cost concerns, but it introduces additional inference overhead that increases response latency. If you have a large number of documents or real-time response is critical, measure the latency before deciding whether to adopt it.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Complete data privacy | All queries and documents processed locally only; satisfies legal compliance |
| Zero API costs | No operating costs after initial setup; economical for large-scale document processing |
| Offline operation | No network dependency; suitable for isolated environments (factories, hospitals, military facilities) |
| OpenAI-compatible API | Cloud → local migration possible with minimal code changes |
| Model variety | 100+ models supported including Llama, Qwen, Mistral, DeepSeek, Gemma |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Hardware requirements | 7B: possible with 8GB CPU RAM (1–5 token/s); 14B+: 16GB VRAM GPU recommended | Choose model size to match resources; use quantized models |
| Concurrent request limits | Parallel processing performance of a single instance is limited compared to cloud | Multiple Ollama instances + load balancer setup |
| Initial model download | First pull takes time for models ranging from Llama3.2:3b (2GB) to Qwen3:32b (20GB) | Pre-pull in CI/CD; use a local registry |
| Embedding migration cost | Switching embedding models requires re-indexing the entire vector DB | Choose initial model carefully; externalize model name to config |
Honestly, the most painful "disadvantage" for me was the embedding migration cost. I started quickly with nomic-embed-text and later switched to qwen3-embedding:8b, which meant re-indexing tens of thousands of documents over a considerable amount of time. That experience taught me just how important it is to put model names in environment variables from the start and record them as metadata in the vector database.
Quantization: A technique that reduces model weights from float32 to lower precision formats like int8 or int4. This improves VRAM usage and inference speed at a slight cost to quality. In Ollama, quantized versions can be selected using tags like
:q4_K_Mor:q8_0.
Most Common Mistakes in Practice
-
Reaching for a bigger embedding model first: When search quality is low, the instinct is to immediately swap to a larger embedding model. However, as the arXiv study mentioned earlier shows, optimizing your chunking strategy yields larger performance gains than swapping embedding models. Try adjusting chunk size and overlap first.
-
Relying solely on vector search: Pure semantic search is weak at exact keyword matching for terms like "Article 3" or "SKU-2024-001." For production, hybrid search combined with BM25 consistently produces better results.
-
Hardcoding embedding model names: Embedding model names scattered throughout your code means re-indexing and code changes happen simultaneously when you switch. Keep model names in a config file or environment variable, and store them as metadata in the vector database (see
chunk.metadata["embedding_model"]in the basic Q&A chatbot code).
Closing Thoughts
After reading this, you might be thinking: "A better embedding model would still improve quality, right?" That's true, but the most counterintuitive point I want to highlight in this article is that chunking strategy delivers a larger quality improvement than switching embedding models. Before hunting for a more expensive embedding model, check whether your current chunk size is suited to the structure of your documents.
Ollama-based local RAG is a practical stack that eliminates API costs, keeps your data from leaving your environment, and can be equipped with hybrid search and reranking. With a model under 7B, you can get started on a laptop CPU, and with a GPU, stepping up to 14B or above brings a noticeably significant improvement in perceived quality.
Quick Start
Three steps to get started right now.
-
Install Ollama and prepare models: Install Ollama from ollama.com, then run
ollama pull qwen3-embedding:8b && ollama pull qwen3:14bto download a combination optimized for Korean documents. Start the server withollama servebefore running your code. -
Try your first indexing with ChromaDB: Install packages with
pip install langchain-ollama langchain-chroma langchain-community pypdf, then try your first indexing using the basic Q&A chatbot code with a few of your own PDF or text files. ChromaDB works immediately with just the Python package — no separate server needed. -
Measure search quality with Ragas: Install with
pip install ragas, then measure the three metrics: Faithfulness, Answer Relevancy, and Context Precision. Managing quality by the numbers rather than by gut feeling ("seems to be working") lets you clearly verify the effect of changing chunking strategies or introducing reranking.
References
- Ollama Embedding Models | Ollama Blog
- Multimodal Models | Ollama Blog
- Ollama Model Library
- Local RAG Tutorial: LangChain, Ollama & ChromaDB with Ragas | Medium
- Building a Robust RAG Pipeline with LangChain, Ollama, and Chroma | Codes and Chips
- Local RAG System for Privacy with Ollama and Weaviate | Weaviate Blog
- Building a Private RAG System with Ollama | Markaicode
- Ollama Embedding Models: Benchmarks, VRAM, and Which to Use | Morph
- Advanced RAG Techniques: Hybrid Search and Re-ranking | dasroot.net
- Why Your RAG Pipeline Returns Wrong Answers: Chunk Size & Embedding | Medium
- Best RAG Frameworks 2025: LangChain vs LlamaIndex vs Haystack vs RAGFlow | LangCopilot
- How to Optimize RAG Retrieval Accuracy with Ollama: 7 Proven Techniques | Markaicode
- Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis | arXiv
- 오프라인 RAG 시스템 구축 가이드: Ollama 및 Python 활용 | Toolify