Building a Local RAG System with MongoDB, Ollama & Streamlit
A step-by-step guide to building a fully local AI-powered notes assistant — no cloud, no API costs.
What I have Built
A RAG (Retrieval-Augmented Generation) system that lets you ask questions about your own notes and documents, powered entirely by local AI models.
Tech Stack
| Layer | Tool |
|---|---|
| Database | MongoDB 7.0 (Docker) |
| Embeddings | nomic-embed-text (Ollama) |
| LLM | llama3.2 (Ollama) |
| Backend | Python (pymongo, numpy, requests) |
| UI | Streamlit |
| Infrastructure | Docker + Docker Compose |
Step 1 — Set Up MongoDB with Docker
- Install Docker Desktop from docker.com
- Pull MongoDB image:
docker pull mongo:7.0
- Run a 3-node Replica Set using Docker Compose
- Initialize the replica set via
rs.initiate() in mongosh
docker pull mongo:7.0rs.initiate() in mongoshKey point: Use Docker Compose for a clean, reproducible setup. Replica sets are required for transactions and change streams.
Step 2 — Install Ollama & Pull Models
- Download Ollama from ollama.com
- Pull the LLM:
ollama pull llama3.2
- Pull the embedding model:
ollama pull nomic-embed-text
ollama pull llama3.2ollama pull nomic-embed-textKey point: Ollama runs models fully locally — no API keys, no costs, no data leaving your machine.
Step 3 — Build the Ingestion Pipeline (ingest.py)
- Split text into overlapping chunks (500 words, 50-word overlap)
- Generate vector embeddings for each chunk using
nomic-embed-text
- Store chunks + embeddings into MongoDB
nomic-embed-textKey point: Chunking with overlap ensures context is not lost at chunk boundaries.
Step 4 — Build the RAG Query Pipeline (rag.py)
- Embed the user's question using the same model
- Compute cosine similarity against all stored chunks in MongoDB
- Retrieve top-K most relevant chunks as context
- Send context + question to llama3.2 and return the answer
Key point: Always use the same embedding model for both ingestion and querying - mismatched dimensions will cause errors.
Step 5 — Build the Streamlit Chat UI (app.py)
- Dark-themed chat interface in the browser
- Sidebar to paste text or upload
.txt files
- Real-time ingestion with chunk counter
- Chat history with source document chips per answer
.txt filesKey point: Use st.set_page_config() as the very first Streamlit command in your script.
Architecture
Your Notes → Chunk & Embed → MongoDB
↓
Question → Embed → Cosine Search → Top Chunks
↓
llama3.2 → Answer
Your Notes → Chunk & Embed → MongoDB
↓
Question → Embed → Cosine Search → Top Chunks
↓
llama3.2 → Answer
Challenges & Fixes
- Port conflicts — Stop old containers before starting new ones
- torch DLL crash on Windows — Replace sentence-transformers with Ollama embeddings
- Embedding dimension mismatch (384 vs 768) — Clear MongoDB and re-ingest when switching embedding models
- Windows Long Path error — Enable long paths via PowerShell or Registry
Run It Yourself
# 1. Start MongoDB
docker-compose up -d
# 2. Ingest your notes
python ingest.py
# 3. Launch the UI
streamlit run app.py
# Open http://localhost:8501
# 1. Start MongoDB
docker-compose up -d
# 2. Ingest your notes
python ingest.py
# 3. Launch the UI
streamlit run app.py
# Open http://localhost:8501
Built entirely locally — your data never leaves your machine. 🍃