RAG Implementation Guide for SMBs (2026)
How to ship a Retrieval-Augmented Generation system that actually works for SMBs — chunking, embeddings, evaluation, and the mistakes that cost us six weeks.

Why this guide exists
Most RAG tutorials stop at "load documents → embed → search → ask LLM." That works for a demo. It does not work in production.
This guide is what we wished we'd had two years ago when we shipped our first RAG system. It's specifically aimed at SMBs — teams without an in-house ML platform, but with real users who'll notice when the agent hallucinates.
We'll cover what works, what doesn't, and the trade-offs that aren't in the docs.
The honest baseline
Naive RAG looks like:
- Split documents into chunks
- Embed each chunk with a model (OpenAI, Cohere, BGE)
- Store embeddings in a vector DB
- At query time: embed the question, find top-k similar chunks, stuff them into the LLM prompt
- Return the answer
This works for 70% of questions when:
- Documents are clean and reasonably homogeneous
- Questions are short and well-formed
- Your domain isn't full of jargon
- Users don't ask multi-hop questions
For the remaining 30%, things break. That's where the work is.
Chunking — the most underrated decision
The default tutorial says "split by 500 tokens with 50 overlap." This is almost always wrong.
Better defaults:
- Markdown / structured docs: split by heading hierarchy. A chunk = one logical section. Use the heading text as a pre-amble so context is retained.
- Long-form text (articles, transcripts): semantic chunking — split where topics change. Libraries like
LangChainsemantic chunker work, or use a small LLM as a splitter. - Code or contracts: chunk by structural unit (function, clause). Keep parent context.
- Tables: don't chunk. Embed each row as its own chunk with column headers as context.
We've seen retrieval quality go from 60% to 88% just by fixing chunking. It's the highest-leverage change you can make.
Embeddings — pick once, evaluate constantly
In 2026, the default is OpenAI text-embedding-3-small (1536 dim) or text-embedding-3-large (3072 dim). Both are good. large is better but 6× the storage cost.
For French content specifically, BGE-M3 and mistral-embed are competitive with OpenAI and sometimes better on French-only corpora. Test both on your data before committing.
The mistake we made: picking embeddings based on benchmarks instead of our own data. MTEB scores don't translate to your domain. Build a small eval set (50 questions, ground-truth answers) and test 2–3 embedding models. Pick what wins on YOUR set.
Vector store — pgvector is enough until it isn't
We default to pgvector on Postgres for SMBs. Reasons:
- You probably already have Postgres
- Up to ~5M vectors with HNSW index, queries are fast (<100ms)
- Filtering on metadata is just SQL — no vendor-specific query DSL
- Cheaper and simpler than dedicated vector DBs at this scale
Switch to Pinecone or Weaviate when:
- You hit 10M+ vectors
- You need sub-50ms p95 latency at high QPS
- You want serverless scaling
- You need hybrid (vector + lexical) search out of the box
For 90% of SMB RAG projects, pgvector is the right call.
Reranking — the easy 10% gain
After top-k retrieval, rerank the results with a cross-encoder (Cohere Rerank, BGE-reranker, or text-embedding-3-large second pass). Top-50 → rerank → top-10. Almost always improves precision.
Cost: one extra API call per query. Latency: +100–300ms. Worth it.
Hybrid search — don't skip this
Pure vector search misses exact-match queries (product codes, names, numbers). Pure keyword search misses semantic similarity. Combine both with Reciprocal Rank Fusion or a learned ranker.
In Postgres: pgvector for vectors + tsvector for full-text search, fused at query time. We ship this pattern by default for SMB RAG.
Evaluation — the part everyone skips
You can't improve what you don't measure. Build an eval set:
- Collect 50–100 real user questions (from logs, or synthetic from your corpus)
- Have a domain expert write the correct answer for each
- Run your RAG pipeline, score each answer with an LLM-as-judge (GPT-4o or Claude) on:
- Faithfulness (no hallucination)
- Relevance (answers the question)
- Citations (correct source chunks)
Run this eval after every change. If the score drops, revert.
Tools: Ragas, DeepEval, or roll your own with a JSON eval set. Don't skip this. It's the difference between a RAG system that gets better and one that drifts.
Hallucination control
Three layers:
- Prompt-level: instruct the LLM to answer ONLY from the context, and to say "I don't know" if the context doesn't contain the answer.
- Citation enforcement: require the LLM to cite chunk IDs for each claim. Reject answers without citations.
- Post-hoc verification: a second LLM call that checks each claim against the source. Slower but catches the worst cases.
For high-stakes domains (legal, medical, finance), use all three.
Cost control
A typical RAG query at our scale:
- Embedding: $0.00001 (negligible)
- Vector search: $0 (self-hosted pgvector)
- Reranker: $0.001
- LLM call: $0.005–0.02 (depends on context size)
Your cost is dominated by the LLM, which is dominated by context size. Aggressive top-k truncation, prompt compression, or a smaller LLM (Claude Haiku, GPT-4o-mini) can cut costs 5–10× with minimal quality loss.
What a production RAG looks like
The system we shipped for Opportunix — which analyzes French public tenders for SMBs — runs:
- ~50k chunks (mostly tender documents)
- pgvector with HNSW
- BGE-M3 embeddings (better on French than OpenAI)
- Hybrid search (vector + tsvector)
- Cohere Rerank
- GPT-4o for synthesis with mandatory citations
- Posthog for query logs and feedback collection
- Weekly eval run on a 200-question set
P95 query latency: 2.1s. Hallucination rate (measured on eval set): under 4%.
When NOT to use RAG
- The corpus is small (<100 documents): just put it in the prompt
- The corpus changes hourly: caching strategies break, costs explode
- Users need answers from data that requires reasoning across many sources at once: RAG isn't a replacement for analytical agents
Closing thoughts
RAG is a tool, not a product. The hard parts are chunking, evaluation, and cost. The exciting parts (embeddings, vector DBs) are the easy parts.
If you're shipping RAG for an SMB, start with pgvector, BGE-M3 or text-embedding-3-small, hybrid search, a reranker, and an eval set. Iterate weekly.
Need help? Talk to us — we ship RAG systems for SMBs, public sector, and fintech.
Travailler avec Ikki
Besoin d'aide pour livrer ça en production ?
On conçoit, livre et opère des systèmes IA pour PME et entreprises. Agents vocaux, RAG, automatisation, web & mobile.
Autres articles
ElevenLabs vs Vapi vs Retell — Voice AI Platform Comparison 2026
Side-by-side comparison of the three leading voice AI platforms in 2026 — latency, languages, pricing, integrations, and what we ship in production at Ikki.
IA vocaleVoice AI Agent Cost: How Much Does It Really Cost in 2026?
Real-world numbers from voice AI projects we shipped: build cost, monthly run cost, hidden expenses, and how to avoid common pricing traps.