Skip to content

· 2 min read

Why I'm building RAG without an API tax.

  • ai
  • rag
  • engineering

Most RAG tutorials still assume a paid LLM and a paid embedding API in the loop. That's fine for a demo. It's a poor default for the rest of us — hobbyists, students, anyone hosting on their own machine, and any team that doesn't want their documents leaving their network.

FreeRAG is my attempt at a sensible default: a retrieval-augmented generation stack that runs on local Hugging Face embeddings, a swappable vector store, and a FastAPI service that owns the grounding loop. No per-token cost. No data egress.

The shape of it

docs → chunker → embeddings (HF, local) → vector store
                                          ↓
            user query → retriever → reranker → LLM (local or hosted)
                                                 ↓
                                            grounded answer + citations

Every component is a seam — you can swap the embedding model, the vector store, the reranker, or the LLM without touching the rest. That's the point. Most "RAG frameworks" hide these seams behind a one-line API; I wanted them exposed.

What it taught me

  1. Embedding choice is half the result. Two HF models on the same corpus produce dramatically different retrieval quality. Test them.
  2. Reranking is cheap and underrated. A small cross-encoder pass after the vector retrieval lifts answer relevance more than I expected.
  3. Citations aren't optional. If the user can't see where the answer came from, you've built a confident liar.

The repo is on GitHub. A longer case study lives at /work/freerag.