RAG·April 8, 2026·12 min read

RAG Implementation Guide for SMBs (2026)

How to ship a Retrieval-Augmented Generation system that actually works for SMBs — chunking, embeddings, evaluation, and the mistakes that cost us six weeks.

Frédéric Magnin

Founder & AI Engineer at Ikki

RAG Implementation Guide for SMBs (2026)

Why this guide exists

Most RAG tutorials stop at "load documents → embed → search → ask LLM." That works for a demo. It does not work in production.

This guide is what we wished we'd had two years ago when we shipped our first RAG system. It's specifically aimed at SMBs — teams without an in-house ML platform, but with real users who'll notice when the agent hallucinates.

We'll cover what works, what doesn't, and the trade-offs that aren't in the docs.

The honest baseline

Naive RAG looks like:

Split documents into chunks
Embed each chunk with a model (OpenAI, Cohere, BGE)
Store embeddings in a vector DB
At query time: embed the question, find top-k similar chunks, stuff them into the LLM prompt
Return the answer

This works for 70% of questions when:

Documents are clean and reasonably homogeneous
Questions are short and well-formed
Your domain isn't full of jargon
Users don't ask multi-hop questions

For the remaining 30%, things break. That's where the work is.

Chunking — the most underrated decision

The default tutorial says "split by 500 tokens with 50 overlap." This is almost always wrong.

Better defaults:

Markdown / structured docs: split by heading hierarchy. A chunk = one logical section. Use the heading text as a pre-amble so context is retained.
Long-form text (articles, transcripts): semantic chunking — split where topics change. Libraries like LangChain semantic chunker work, or use a small LLM as a splitter.
Code or contracts: chunk by structural unit (function, clause). Keep parent context.
Tables: don't chunk. Embed each row as its own chunk with column headers as context.

We've seen retrieval quality go from 60% to 88% just by fixing chunking. It's the highest-leverage change you can make.

Embeddings — pick once, evaluate constantly

In 2026, the default is OpenAI text-embedding-3-small (1536 dim) or text-embedding-3-large (3072 dim). Both are good. large is better but 6× the storage cost.

For French content specifically, BGE-M3 and mistral-embed are competitive with OpenAI and sometimes better on French-only corpora. Test both on your data before committing.

The mistake we made: picking embeddings based on benchmarks instead of our own data. MTEB scores don't translate to your domain. Build a small eval set (50 questions, ground-truth answers) and test 2–3 embedding models. Pick what wins on YOUR set.

Vector store — pgvector is enough until it isn't

We default to pgvector on Postgres for SMBs. Reasons:

You probably already have Postgres
Up to ~5M vectors with HNSW index, queries are fast (<100ms)
Filtering on metadata is just SQL — no vendor-specific query DSL
Cheaper and simpler than dedicated vector DBs at this scale

Switch to Pinecone or Weaviate when:

You hit 10M+ vectors
You need sub-50ms p95 latency at high QPS
You want serverless scaling
You need hybrid (vector + lexical) search out of the box

For 90% of SMB RAG projects, pgvector is the right call.

Reranking — the easy 10% gain

After top-k retrieval, rerank the results with a cross-encoder (Cohere Rerank, BGE-reranker, or text-embedding-3-large second pass). Top-50 → rerank → top-10. Almost always improves precision.

Cost: one extra API call per query. Latency: +100–300ms. Worth it.

Hybrid search — don't skip this

Pure vector search misses exact-match queries (product codes, names, numbers). Pure keyword search misses semantic similarity. Combine both with Reciprocal Rank Fusion or a learned ranker.

In Postgres: pgvector for vectors + tsvector for full-text search, fused at query time. We ship this pattern by default for SMB RAG.

Evaluation — the part everyone skips

You can't improve what you don't measure. Build an eval set:

Collect 50–100 real user questions (from logs, or synthetic from your corpus)
Have a domain expert write the correct answer for each
Run your RAG pipeline, score each answer with an LLM-as-judge (GPT-4o or Claude) on:
- Faithfulness (no hallucination)
- Relevance (answers the question)
- Citations (correct source chunks)

Run this eval after every change. If the score drops, revert.

Tools: Ragas, DeepEval, or roll your own with a JSON eval set. Don't skip this. It's the difference between a RAG system that gets better and one that drifts.

Hallucination control

Three layers:

Prompt-level: instruct the LLM to answer ONLY from the context, and to say "I don't know" if the context doesn't contain the answer.
Citation enforcement: require the LLM to cite chunk IDs for each claim. Reject answers without citations.
Post-hoc verification: a second LLM call that checks each claim against the source. Slower but catches the worst cases.

For high-stakes domains (legal, medical, finance), use all three.

Cost control

A typical RAG query at our scale:

Embedding: $0.00001 (negligible)
Vector search: $0 (self-hosted pgvector)
Reranker: $0.001
LLM call: $0.005–0.02 (depends on context size)

Your cost is dominated by the LLM, which is dominated by context size. Aggressive top-k truncation, prompt compression, or a smaller LLM (Claude Haiku, GPT-4o-mini) can cut costs 5–10× with minimal quality loss.

What a production RAG looks like

The system we shipped for Opportunix — which analyzes French public tenders for SMBs — runs:

~50k chunks (mostly tender documents)
pgvector with HNSW
BGE-M3 embeddings (better on French than OpenAI)
Hybrid search (vector + tsvector)
Cohere Rerank
GPT-4o for synthesis with mandatory citations
Posthog for query logs and feedback collection
Weekly eval run on a 200-question set

P95 query latency: 2.1s. Hallucination rate (measured on eval set): under 4%.

When NOT to use RAG

The corpus is small (<100 documents): just put it in the prompt
The corpus changes hourly: caching strategies break, costs explode
Users need answers from data that requires reasoning across many sources at once: RAG isn't a replacement for analytical agents

Closing thoughts

RAG is a tool, not a product. The hard parts are chunking, evaluation, and cost. The exciting parts (embeddings, vector DBs) are the easy parts.

If you're shipping RAG for an SMB, start with pgvector, BGE-M3 or text-embedding-3-small, hybrid search, a reranker, and an eval set. Iterate weekly.

Need help? Talk to us — we ship RAG systems for SMBs, public sector, and fintech.

Work with Ikki

Need help shipping this in production?

We design, build and operate AI systems for SMBs and enterprises. Voice agents, RAG, automation, web & mobile.

Start a project See our work

Voice AI

ElevenLabs vs Vapi vs Retell — Voice AI Platform Comparison 2026

Side-by-side comparison of the three leading voice AI platforms in 2026 — latency, languages, pricing, integrations, and what we ship in production at Ikki.

Voice AI

Voice AI Agent Cost: How Much Does It Really Cost in 2026?

Real-world numbers from voice AI projects we shipped: build cost, monthly run cost, hidden expenses, and how to avoid common pricing traps.

RAG Implementation Guide for SMBs (2026)

Need help shipping this in production?

More articles

ElevenLabs vs Vapi vs Retell — Voice AI Platform Comparison 2026

Voice AI Agent Cost: How Much Does It Really Cost in 2026?