RAG·8 avril 2026·12 min de lecture

RAG Implementation Guide for SMBs (2026)

How to ship a Retrieval-Augmented Generation system that actually works for SMBs — chunking, embeddings, evaluation, and the mistakes that cost us six weeks.

Frédéric Magnin

Founder & AI Engineer at Ikki

RAG Implementation Guide for SMBs (2026)

Why this guide exists

Most RAG tutorials stop at "load documents → embed → search → ask LLM." That works for a demo. It does not work in production.

This guide is what we wished we'd had two years ago when we shipped our first RAG system. It's specifically aimed at SMBs — teams without an in-house ML platform, but with real users who'll notice when the agent hallucinates.

We'll cover what works, what doesn't, and the trade-offs that aren't in the docs.

The honest baseline

Naive RAG looks like:

Split documents into chunks
Embed each chunk with a model (OpenAI, Cohere, BGE)
Store embeddings in a vector DB
At query time: embed the question, find top-k similar chunks, stuff them into the LLM prompt
Return the answer

This works for 70% of questions when:

Documents are clean and reasonably homogeneous
Questions are short and well-formed
Your domain isn't full of jargon
Users don't ask multi-hop questions

For the remaining 30%, things break. That's where the work is.

Chunking — the most underrated decision

The default tutorial says "split by 500 tokens with 50 overlap." This is almost always wrong.

Better defaults:

Markdown / structured docs: split by heading hierarchy. A chunk = one logical section. Use the heading text as a pre-amble so context is retained.
Long-form text (articles, transcripts): semantic chunking — split where topics change. Libraries like LangChain semantic chunker work, or use a small LLM as a splitter.
Code or contracts: chunk by structural unit (function, clause). Keep parent context.
Tables: don't chunk. Embed each row as its own chunk with column headers as context.

We've seen retrieval quality go from 60% to 88% just by fixing chunking. It's the highest-leverage change you can make.

Embeddings — pick once, evaluate constantly

In 2026, the default is OpenAI text-embedding-3-small (1536 dim) or text-embedding-3-large (3072 dim). Both are good. large is better but 6× the storage cost.

For French content specifically, BGE-M3 and mistral-embed are competitive with OpenAI and sometimes better on French-only corpora. Test both on your data before committing.

The mistake we made: picking embeddings based on benchmarks instead of our own data. MTEB scores don't translate to your domain. Build a small eval set (50 questions, ground-truth answers) and test 2–3 embedding models. Pick what wins on YOUR set.

Vector store — pgvector is enough until it isn't

We default to pgvector on Postgres for SMBs. Reasons:

You probably already have Postgres
Up to ~5M vectors with HNSW index, queries are fast (<100ms)
Filtering on metadata is just SQL — no vendor-specific query DSL
Cheaper and simpler than dedicated vector DBs at this scale

Switch to Pinecone or Weaviate when:

You hit 10M+ vectors
You need sub-50ms p95 latency at high QPS
You want serverless scaling
You need hybrid (vector + lexical) search out of the box

For 90% of SMB RAG projects, pgvector is the right call.

Reranking — the easy 10% gain

After top-k retrieval, rerank the results with a cross-encoder (Cohere Rerank, BGE-reranker, or text-embedding-3-large second pass). Top-50 → rerank → top-10. Almost always improves precision.

Cost: one extra API call per query. Latency: +100–300ms. Worth it.

Hybrid search — don't skip this

Pure vector search misses exact-match queries (product codes, names, numbers). Pure keyword search misses semantic similarity. Combine both with Reciprocal Rank Fusion or a learned ranker.

In Postgres: pgvector for vectors + tsvector for full-text search, fused at query time. We ship this pattern by default for SMB RAG.

Evaluation — the part everyone skips

You can't improve what you don't measure. Build an eval set:

Collect 50–100 real user questions (from logs, or synthetic from your corpus)
Have a domain expert write the correct answer for each
Run your RAG pipeline, score each answer with an LLM-as-judge (GPT-4o or Claude) on:
- Faithfulness (no hallucination)
- Relevance (answers the question)
- Citations (correct source chunks)

Run this eval after every change. If the score drops, revert.

Tools: Ragas, DeepEval, or roll your own with a JSON eval set. Don't skip this. It's the difference between a RAG system that gets better and one that drifts.

Hallucination control

Three layers:

Prompt-level: instruct the LLM to answer ONLY from the context, and to say "I don't know" if the context doesn't contain the answer.
Citation enforcement: require the LLM to cite chunk IDs for each claim. Reject answers without citations.
Post-hoc verification: a second LLM call that checks each claim against the source. Slower but catches the worst cases.

For high-stakes domains (legal, medical, finance), use all three.

Cost control

A typical RAG query at our scale:

Embedding: $0.00001 (negligible)
Vector search: $0 (self-hosted pgvector)
Reranker: $0.001
LLM call: $0.005–0.02 (depends on context size)

Your cost is dominated by the LLM, which is dominated by context size. Aggressive top-k truncation, prompt compression, or a smaller LLM (Claude Haiku, GPT-4o-mini) can cut costs 5–10× with minimal quality loss.

What a production RAG looks like

The system we shipped for Opportunix — which analyzes French public tenders for SMBs — runs:

~50k chunks (mostly tender documents)
pgvector with HNSW
BGE-M3 embeddings (better on French than OpenAI)
Hybrid search (vector + tsvector)
Cohere Rerank
GPT-4o for synthesis with mandatory citations
Posthog for query logs and feedback collection
Weekly eval run on a 200-question set

P95 query latency: 2.1s. Hallucination rate (measured on eval set): under 4%.

When NOT to use RAG

The corpus is small (<100 documents): just put it in the prompt
The corpus changes hourly: caching strategies break, costs explode
Users need answers from data that requires reasoning across many sources at once: RAG isn't a replacement for analytical agents

Closing thoughts

RAG is a tool, not a product. The hard parts are chunking, evaluation, and cost. The exciting parts (embeddings, vector DBs) are the easy parts.

If you're shipping RAG for an SMB, start with pgvector, BGE-M3 or text-embedding-3-small, hybrid search, a reranker, and an eval set. Iterate weekly.

Need help? Talk to us — we ship RAG systems for SMBs, public sector, and fintech.

Travailler avec Ikki

Besoin d'aide pour livrer ça en production ?

On conçoit, livre et opère des systèmes IA pour PME et entreprises. Agents vocaux, RAG, automatisation, web & mobile.

Démarrer un projet Voir nos réalisations

Autres articles

IA vocale

ElevenLabs vs Vapi vs Retell — Voice AI Platform Comparison 2026

Side-by-side comparison of the three leading voice AI platforms in 2026 — latency, languages, pricing, integrations, and what we ship in production at Ikki.

IA vocale

Voice AI Agent Cost: How Much Does It Really Cost in 2026?

Real-world numbers from voice AI projects we shipped: build cost, monthly run cost, hidden expenses, and how to avoid common pricing traps.