All articles
RAG··12 min read

RAG Implementation Guide for SMBs (2026)

How to ship a Retrieval-Augmented Generation system that actually works for SMBs — chunking, embeddings, evaluation, and the mistakes that cost us six weeks.

Frédéric Magnin
Founder & AI Engineer at Ikki
RAG Implementation Guide for SMBs (2026)

Why this guide exists

Most RAG tutorials stop at "load documents → embed → search → ask LLM." That works for a demo. It does not work in production.

This guide is what we wished we'd had two years ago when we shipped our first RAG system. It's specifically aimed at SMBs — teams without an in-house ML platform, but with real users who'll notice when the agent hallucinates.

We'll cover what works, what doesn't, and the trade-offs that aren't in the docs.

The honest baseline

Naive RAG looks like:

  1. Split documents into chunks
  2. Embed each chunk with a model (OpenAI, Cohere, BGE)
  3. Store embeddings in a vector DB
  4. At query time: embed the question, find top-k similar chunks, stuff them into the LLM prompt
  5. Return the answer

This works for 70% of questions when:

  • Documents are clean and reasonably homogeneous
  • Questions are short and well-formed
  • Your domain isn't full of jargon
  • Users don't ask multi-hop questions

For the remaining 30%, things break. That's where the work is.

Chunking — the most underrated decision

The default tutorial says "split by 500 tokens with 50 overlap." This is almost always wrong.

Better defaults:

  • Markdown / structured docs: split by heading hierarchy. A chunk = one logical section. Use the heading text as a pre-amble so context is retained.
  • Long-form text (articles, transcripts): semantic chunking — split where topics change. Libraries like LangChain semantic chunker work, or use a small LLM as a splitter.
  • Code or contracts: chunk by structural unit (function, clause). Keep parent context.
  • Tables: don't chunk. Embed each row as its own chunk with column headers as context.

We've seen retrieval quality go from 60% to 88% just by fixing chunking. It's the highest-leverage change you can make.

Embeddings — pick once, evaluate constantly

In 2026, the default is OpenAI text-embedding-3-small (1536 dim) or text-embedding-3-large (3072 dim). Both are good. large is better but 6× the storage cost.

For French content specifically, BGE-M3 and mistral-embed are competitive with OpenAI and sometimes better on French-only corpora. Test both on your data before committing.

The mistake we made: picking embeddings based on benchmarks instead of our own data. MTEB scores don't translate to your domain. Build a small eval set (50 questions, ground-truth answers) and test 2–3 embedding models. Pick what wins on YOUR set.

Vector store — pgvector is enough until it isn't

We default to pgvector on Postgres for SMBs. Reasons:

  • You probably already have Postgres
  • Up to ~5M vectors with HNSW index, queries are fast (<100ms)
  • Filtering on metadata is just SQL — no vendor-specific query DSL
  • Cheaper and simpler than dedicated vector DBs at this scale

Switch to Pinecone or Weaviate when:

  • You hit 10M+ vectors
  • You need sub-50ms p95 latency at high QPS
  • You want serverless scaling
  • You need hybrid (vector + lexical) search out of the box

For 90% of SMB RAG projects, pgvector is the right call.

Reranking — the easy 10% gain

After top-k retrieval, rerank the results with a cross-encoder (Cohere Rerank, BGE-reranker, or text-embedding-3-large second pass). Top-50 → rerank → top-10. Almost always improves precision.

Cost: one extra API call per query. Latency: +100–300ms. Worth it.

Hybrid search — don't skip this

Pure vector search misses exact-match queries (product codes, names, numbers). Pure keyword search misses semantic similarity. Combine both with Reciprocal Rank Fusion or a learned ranker.

In Postgres: pgvector for vectors + tsvector for full-text search, fused at query time. We ship this pattern by default for SMB RAG.

Evaluation — the part everyone skips

You can't improve what you don't measure. Build an eval set:

  1. Collect 50–100 real user questions (from logs, or synthetic from your corpus)
  2. Have a domain expert write the correct answer for each
  3. Run your RAG pipeline, score each answer with an LLM-as-judge (GPT-4o or Claude) on:
    • Faithfulness (no hallucination)
    • Relevance (answers the question)
    • Citations (correct source chunks)

Run this eval after every change. If the score drops, revert.

Tools: Ragas, DeepEval, or roll your own with a JSON eval set. Don't skip this. It's the difference between a RAG system that gets better and one that drifts.

Hallucination control

Three layers:

  1. Prompt-level: instruct the LLM to answer ONLY from the context, and to say "I don't know" if the context doesn't contain the answer.
  2. Citation enforcement: require the LLM to cite chunk IDs for each claim. Reject answers without citations.
  3. Post-hoc verification: a second LLM call that checks each claim against the source. Slower but catches the worst cases.

For high-stakes domains (legal, medical, finance), use all three.

Cost control

A typical RAG query at our scale:

  • Embedding: $0.00001 (negligible)
  • Vector search: $0 (self-hosted pgvector)
  • Reranker: $0.001
  • LLM call: $0.005–0.02 (depends on context size)

Your cost is dominated by the LLM, which is dominated by context size. Aggressive top-k truncation, prompt compression, or a smaller LLM (Claude Haiku, GPT-4o-mini) can cut costs 5–10× with minimal quality loss.

What a production RAG looks like

The system we shipped for Opportunix — which analyzes French public tenders for SMBs — runs:

  • ~50k chunks (mostly tender documents)
  • pgvector with HNSW
  • BGE-M3 embeddings (better on French than OpenAI)
  • Hybrid search (vector + tsvector)
  • Cohere Rerank
  • GPT-4o for synthesis with mandatory citations
  • Posthog for query logs and feedback collection
  • Weekly eval run on a 200-question set

P95 query latency: 2.1s. Hallucination rate (measured on eval set): under 4%.

When NOT to use RAG

  • The corpus is small (<100 documents): just put it in the prompt
  • The corpus changes hourly: caching strategies break, costs explode
  • Users need answers from data that requires reasoning across many sources at once: RAG isn't a replacement for analytical agents

Closing thoughts

RAG is a tool, not a product. The hard parts are chunking, evaluation, and cost. The exciting parts (embeddings, vector DBs) are the easy parts.

If you're shipping RAG for an SMB, start with pgvector, BGE-M3 or text-embedding-3-small, hybrid search, a reranker, and an eval set. Iterate weekly.

Need help? Talk to us — we ship RAG systems for SMBs, public sector, and fintech.


Work with Ikki

Need help shipping this in production?

We design, build and operate AI systems for SMBs and enterprises. Voice agents, RAG, automation, web & mobile.

More articles

SHIP LOG

SHIP-0247·CODEMACHIA·v1.4.22026-05-04 02:27 UTC