Tous les articles
RAG··12 min de lecture

RAG Implementation Guide for SMBs (2026)

How to ship a Retrieval-Augmented Generation system that actually works for SMBs — chunking, embeddings, evaluation, and the mistakes that cost us six weeks.

Frédéric Magnin
Founder & AI Engineer at Ikki
RAG Implementation Guide for SMBs (2026)

Why this guide exists

Most RAG tutorials stop at "load documents → embed → search → ask LLM." That works for a demo. It does not work in production.

This guide is what we wished we'd had two years ago when we shipped our first RAG system. It's specifically aimed at SMBs — teams without an in-house ML platform, but with real users who'll notice when the agent hallucinates.

We'll cover what works, what doesn't, and the trade-offs that aren't in the docs.

The honest baseline

Naive RAG looks like:

  1. Split documents into chunks
  2. Embed each chunk with a model (OpenAI, Cohere, BGE)
  3. Store embeddings in a vector DB
  4. At query time: embed the question, find top-k similar chunks, stuff them into the LLM prompt
  5. Return the answer

This works for 70% of questions when:

  • Documents are clean and reasonably homogeneous
  • Questions are short and well-formed
  • Your domain isn't full of jargon
  • Users don't ask multi-hop questions

For the remaining 30%, things break. That's where the work is.

Chunking — the most underrated decision

The default tutorial says "split by 500 tokens with 50 overlap." This is almost always wrong.

Better defaults:

  • Markdown / structured docs: split by heading hierarchy. A chunk = one logical section. Use the heading text as a pre-amble so context is retained.
  • Long-form text (articles, transcripts): semantic chunking — split where topics change. Libraries like LangChain semantic chunker work, or use a small LLM as a splitter.
  • Code or contracts: chunk by structural unit (function, clause). Keep parent context.
  • Tables: don't chunk. Embed each row as its own chunk with column headers as context.

We've seen retrieval quality go from 60% to 88% just by fixing chunking. It's the highest-leverage change you can make.

Embeddings — pick once, evaluate constantly

In 2026, the default is OpenAI text-embedding-3-small (1536 dim) or text-embedding-3-large (3072 dim). Both are good. large is better but 6× the storage cost.

For French content specifically, BGE-M3 and mistral-embed are competitive with OpenAI and sometimes better on French-only corpora. Test both on your data before committing.

The mistake we made: picking embeddings based on benchmarks instead of our own data. MTEB scores don't translate to your domain. Build a small eval set (50 questions, ground-truth answers) and test 2–3 embedding models. Pick what wins on YOUR set.

Vector store — pgvector is enough until it isn't

We default to pgvector on Postgres for SMBs. Reasons:

  • You probably already have Postgres
  • Up to ~5M vectors with HNSW index, queries are fast (<100ms)
  • Filtering on metadata is just SQL — no vendor-specific query DSL
  • Cheaper and simpler than dedicated vector DBs at this scale

Switch to Pinecone or Weaviate when:

  • You hit 10M+ vectors
  • You need sub-50ms p95 latency at high QPS
  • You want serverless scaling
  • You need hybrid (vector + lexical) search out of the box

For 90% of SMB RAG projects, pgvector is the right call.

Reranking — the easy 10% gain

After top-k retrieval, rerank the results with a cross-encoder (Cohere Rerank, BGE-reranker, or text-embedding-3-large second pass). Top-50 → rerank → top-10. Almost always improves precision.

Cost: one extra API call per query. Latency: +100–300ms. Worth it.

Hybrid search — don't skip this

Pure vector search misses exact-match queries (product codes, names, numbers). Pure keyword search misses semantic similarity. Combine both with Reciprocal Rank Fusion or a learned ranker.

In Postgres: pgvector for vectors + tsvector for full-text search, fused at query time. We ship this pattern by default for SMB RAG.

Evaluation — the part everyone skips

You can't improve what you don't measure. Build an eval set:

  1. Collect 50–100 real user questions (from logs, or synthetic from your corpus)
  2. Have a domain expert write the correct answer for each
  3. Run your RAG pipeline, score each answer with an LLM-as-judge (GPT-4o or Claude) on:
    • Faithfulness (no hallucination)
    • Relevance (answers the question)
    • Citations (correct source chunks)

Run this eval after every change. If the score drops, revert.

Tools: Ragas, DeepEval, or roll your own with a JSON eval set. Don't skip this. It's the difference between a RAG system that gets better and one that drifts.

Hallucination control

Three layers:

  1. Prompt-level: instruct the LLM to answer ONLY from the context, and to say "I don't know" if the context doesn't contain the answer.
  2. Citation enforcement: require the LLM to cite chunk IDs for each claim. Reject answers without citations.
  3. Post-hoc verification: a second LLM call that checks each claim against the source. Slower but catches the worst cases.

For high-stakes domains (legal, medical, finance), use all three.

Cost control

A typical RAG query at our scale:

  • Embedding: $0.00001 (negligible)
  • Vector search: $0 (self-hosted pgvector)
  • Reranker: $0.001
  • LLM call: $0.005–0.02 (depends on context size)

Your cost is dominated by the LLM, which is dominated by context size. Aggressive top-k truncation, prompt compression, or a smaller LLM (Claude Haiku, GPT-4o-mini) can cut costs 5–10× with minimal quality loss.

What a production RAG looks like

The system we shipped for Opportunix — which analyzes French public tenders for SMBs — runs:

  • ~50k chunks (mostly tender documents)
  • pgvector with HNSW
  • BGE-M3 embeddings (better on French than OpenAI)
  • Hybrid search (vector + tsvector)
  • Cohere Rerank
  • GPT-4o for synthesis with mandatory citations
  • Posthog for query logs and feedback collection
  • Weekly eval run on a 200-question set

P95 query latency: 2.1s. Hallucination rate (measured on eval set): under 4%.

When NOT to use RAG

  • The corpus is small (<100 documents): just put it in the prompt
  • The corpus changes hourly: caching strategies break, costs explode
  • Users need answers from data that requires reasoning across many sources at once: RAG isn't a replacement for analytical agents

Closing thoughts

RAG is a tool, not a product. The hard parts are chunking, evaluation, and cost. The exciting parts (embeddings, vector DBs) are the easy parts.

If you're shipping RAG for an SMB, start with pgvector, BGE-M3 or text-embedding-3-small, hybrid search, a reranker, and an eval set. Iterate weekly.

Need help? Talk to us — we ship RAG systems for SMBs, public sector, and fintech.


Travailler avec Ikki

Besoin d'aide pour livrer ça en production ?

On conçoit, livre et opère des systèmes IA pour PME et entreprises. Agents vocaux, RAG, automatisation, web & mobile.

Autres articles

SHIP LOG

SHIP-0247·CODEMACHIA·v1.4.22026-05-04 02:21 UTC