All articles
RAG··12 min read

RAG vs Agentic — How to Choose, and How to Ship RAG When You Need It (2026)

Most teams reach for RAG by default. Most don't need it. Here's how to decide between RAG and an agentic + tool-call architecture, and how to ship RAG correctly when it's the right call.

Ikki
Last verified · April 8, 2026
RAG vs Agentic — How to Choose, and How to Ship RAG When You Need It (2026)

Why this guide exists

Most RAG tutorials stop at "load documents → embed → search → ask LLM." That works for a demo. It does not work in production.

But there's a step earlier that nobody covers: should you be building RAG at all? Across the products we've shipped, RAG was rarely the right default — only one out of the portfolio genuinely needed it (a long-form narrative engine on a large markdown corpus where every question is semantic). For everything else — the SaaS dashboards, the workflow agents, the operational copilots — an agentic architecture with structured tool calls to the existing database beat RAG on every metric we cared about.

This guide is what we wished we'd had two years ago. It covers (1) when to choose RAG vs an agentic + tool-call approach, and (2) how to ship RAG well when it's the right call.

It's aimed at SMBs — teams without an in-house ML platform, but with real users who'll notice when the agent hallucinates.

When NOT to RAG — the agentic alternative

Before you embed a single chunk, ask: is your data textual and unstructured, or relational and already in a database?

If it's relational, RAG is almost always the wrong tool. The instinctive flow — chunk all your tables and emails into 500-token slices, embed them, search over them — destroys the structure that made the data useful in the first place. A question like "which records are missing a required field this month" is a db.find(), not a vector similarity search over fragmented text.

The alternative we ship by default for SaaS / internal-ops products:

  • An agent built on Anthropic's claude-agent-sdk (or equivalent), with strongly typed tool calls that translate the user's question into structured queries on the existing data store
  • Documentary context (policy docs, SOPs, knowledge base) lives in a small set of markdown files and is injected directly into the system prompt with prompt caching — not chunked, not embedded
  • Tools return structured data that the LLM formats; no semantic retrieval over fragmented text is involved

This pattern handles "how many active records this month", "which account missed last month's invoice", "show me the unsigned contracts from this week" with O(1) precision and zero hallucination risk on numbers. RAG would lose every one of those queries.

A grid for choosing

SignalRAGAgentic + tool calls
Data lives in a structured DB (Mongo, Postgres)
Data is unstructured text (transcripts, contracts, lore, articles)
Documentary corpus < 1 MB total❌ (inject in prompt with caching)
Documentary corpus 1–5 MB⚠️ borderline✅ if questions are relational, otherwise RAG
Documentary corpus > 5 MB of non-relational text
Questions are relational ("how many", "which one", "compute X")
Questions need paraphrase / semantic similarity ("what was the chapter where..." )
Need multilingual semantic search on UGC
Need answers cited to specific chunks of source documents⚠️ harder

The decision is per-product, not per-agency. A long-form narrative engine over a large markdown corpus is a textbook RAG case. An ops-management bot on a structured schema is a textbook agentic case.

If you're tempted to RAG everything: stop, list your top 20 user questions, and check how many of them are actually relational. The number is usually higher than expected.

The honest RAG baseline

If you've checked the grid and RAG is the right call, here's how to do it without losing six weeks.

Naive RAG looks like:

  1. Split documents into chunks
  2. Embed each chunk with a model (OpenAI, Cohere, BGE)
  3. Store embeddings in a vector DB
  4. At query time: embed the question, find top-k similar chunks, stuff them into the LLM prompt
  5. Return the answer

This works for 70% of questions when:

  • Documents are clean and reasonably homogeneous
  • Questions are short and well-formed
  • Your domain isn't full of jargon
  • Users don't ask multi-hop questions

For the remaining 30%, things break. That's where the work is.

Chunking — the most underrated decision

The default tutorial says "split by 500 tokens with 50 overlap." This is almost always wrong.

Better defaults:

  • Markdown / structured docs: split by heading hierarchy. A chunk = one logical section. Use the heading text as a pre-amble so context is retained.
  • Long-form text (articles, transcripts): semantic chunking — split where topics change. We use a small LLM as the splitter (Claude Haiku or gpt-4o-mini); off-the-shelf semantic chunkers from frameworks work too if you'd rather not roll your own.
  • Code or contracts: chunk by structural unit (function, clause). Keep parent context.
  • Tables: don't chunk. Embed each row as its own chunk with column headers as context.

We've seen retrieval quality go from 60% to 88% just by fixing chunking. It's the highest-leverage change you can make.

Embeddings — pick once, evaluate constantly

In 2026 the realistic managed options are OpenAI text-embedding-3-small / text-embedding-3-large, Voyage AI's voyage-3 family, Mistral mistral-embed, and Google's gemini-embedding. All four are good. Pick the one that wins on your data, not on MTEB.

A note on what to avoid for SMBs: hosting a model like BGE-M3 yourself. The open-source quality is excellent but you're paying for a GPU (or eating 200ms+ of CPU latency per embedding) for marginal gains over text-embedding-3-small at $0.02 per million tokens, managed, sub-50ms. Self-hosted embeddings make sense for very large enterprises with privacy mandates or millions of QPS — not for the projects this guide is for.

For French-heavy corpora, mistral-embed is the strongest test point — it's competitive with OpenAI on FR-only data. We've used it inside French OCR pipelines on long-form regulatory PDFs; even there it's worth A/B-ing against text-embedding-3-small on YOUR documents before locking in.

The mistake to avoid: picking embeddings based on benchmarks instead of your own data. MTEB scores don't translate to your domain. Build a small eval set (50 questions, ground-truth answers) and test 2–3 embedding models. Pick what wins on YOUR set.

Vector store — use what your app already runs on

Most SMB RAG projects don't need a dedicated vector DB. Match the store to the existing stack:

  • MongoDB Atlas Vector Search when the app is on MongoDB. Atlas exposes $vectorSearch as an aggregation stage with HNSW indexes, metadata filtering as standard Mongo query syntax, and hybrid search via Atlas full-text. Zero new infra. (We use $vectorSearch in production today on non-RAG matching use cases — feature vectors on structured records — so we know the operational profile well.)
  • pgvector when the app is on Postgres. Same story, different DB.
  • Pinecone is a strong choice for narrative / unstructured-text products, especially when paired with LangChain. For corpora that are large, textual, and semantic — long-form narrative engines, content libraries, knowledge bases of articles — it pairs naturally with RecursiveCharacterTextSplitter and a managed embedding model.

Step up to dedicated vector DBs (Pinecone, Weaviate, Qdrant) when:

  • You hit 10M+ vectors
  • You need sub-50ms p95 latency at high QPS
  • You want serverless scaling and hybrid search baked in

For 90% of SMB RAG projects, the database the app already runs on is the right call. Don't introduce Postgres just to use pgvector if your stack is Mongo — that's a useless infra fork. Atlas Vector Search is comparable in capability for the volumes SMBs ship.

Reranking — only when you have something to rerank

The standard advice is "always rerank with Cohere or a cross-encoder". The honest version is: rerank when you have more than ~50 candidates after retrieval. Below that threshold, the LLM can read all of them and the extra API call buys you nothing.

If you do need to rerank: a small "judge" LLM call (Haiku, gpt-4o-mini) is usually enough — top-50 → rerank → top-10. A managed reranker like Cohere's makes sense above ~200 candidates or with high-QPS latency budgets. Below those thresholds it's overkill.

Cost: one extra API call per query. Latency: +100–300ms. Often worth it. Sometimes not.

Hybrid search — don't skip this

Pure vector search misses exact-match queries (product codes, names, numbers). Pure keyword search misses semantic similarity. Combine both with Reciprocal Rank Fusion or a learned ranker.

In MongoDB: Atlas Vector Search + Atlas full-text search, fused at query time. In Postgres: pgvector for vectors + tsvector for full-text. The MongoDB variant pairs naturally with apps already running on Mongo (no extra infra), which is the most common starting point.

Evaluation — the part everyone skips

You can't improve what you don't measure. Build an eval set:

  1. Collect 50–100 real user questions (from logs, or synthetic from your corpus)
  2. Have a domain expert write the correct answer for each
  3. Run your RAG pipeline, score each answer with an LLM-as-judge (GPT-4o or Claude) on:
    • Faithfulness (no hallucination)
    • Relevance (answers the question)
    • Citations (correct source chunks)

Run this eval after every change. If the score drops, revert.

Tools: Ragas, DeepEval, or roll your own with a JSON eval set. Don't skip this. It's the difference between a RAG system that gets better and one that drifts.

Hallucination control

Three layers:

  1. Prompt-level: instruct the LLM to answer ONLY from the context, and to say "I don't know" if the context doesn't contain the answer.
  2. Citation enforcement: require the LLM to cite chunk IDs for each claim. Reject answers without citations.
  3. Post-hoc verification: a second LLM call that checks each claim against the source. Slower but catches the worst cases.

For high-stakes domains (legal, medical, finance), use all three.

Cost control

A typical RAG query at our scale:

  • Embedding: $0.00001 (negligible)
  • Vector search: $0 (already running on MongoDB Atlas, included in your tier)
  • Reranker: $0.001
  • LLM call: $0.005–0.02 (depends on context size)

Your cost is dominated by the LLM, which is dominated by context size. Aggressive top-k truncation, prompt compression, or a smaller LLM (Claude Haiku, GPT-4o-mini) can cut costs 5–10× with minimal quality loss.

A production RAG case — when it's the right call

The clearest "RAG is the right call" pattern we ship looks like this: a long-form narrative or content product with a large, growing markdown corpus. New entries are produced regularly, must stay coherent with everything the corpus has previously established, and the questions are inherently semantic ("given everything we know about this character / topic / arc, generate the next chapter / answer / continuation").

This is a textbook RAG case: the corpus is textual, large, and the questions are semantic, not relational. There's nothing to db.find() for "produce a new chapter that respects the lore".

A reasonable stack on this profile:

  • A managed vector store (Pinecone, or Atlas Vector Search if you're already on MongoDB)
  • RecursiveCharacterTextSplitter with chunkSize: 1000 and chunkOverlap: 100 — basic but effective for narrative markdown
  • A managed embedding model (Google or OpenAI), benchmarked against your own corpus before locking in
  • Top-5 retrieval as narrative context, injected into the prompt
  • A reasoning-friendly LLM with thinking enabled for the generation pass; if the product needs visual continuity, a separate vision call on reference images
  • Hardcoded narrative guardrails in the system prompt (forbidden tropes, mandatory anchors, voice constraints)

This profile is the exception. Most products get an agentic + tool-call architecture instead — and they're more reliable and cheaper to operate for it.

The lesson: let the data shape decide the architecture, not the team's preferred stack.

When NOT to use RAG

  • The corpus is small (<100 documents): just put it in the prompt
  • The corpus changes hourly: caching strategies break, costs explode
  • Users need answers from data that requires reasoning across many sources at once: RAG isn't a replacement for analytical agents

Closing thoughts

RAG is a tool, not a product. The first job is to decide whether your problem actually needs it. Most don't — most need an agent that can interrogate the data you already have, with tool calls and prompt caching, and the documentary context injected directly. RAG enters the picture when you have a real corpus of unstructured text whose semantic content matters.

If you've decided RAG is the right call: start with whatever DB your app already uses (Atlas Vector Search or pgvector), text-embedding-3-small or mistral-embed for embeddings (don't self-host BGE for an SMB), hybrid search, rerank only above ~50 candidates, and an eval set from day one. Iterate weekly.

If you're not sure which side of the decision you're on, or if you want a sanity check on a stack that's drifting toward over-engineering, get in touch. We've shipped both, and we've shipped enough of the wrong one in the wrong place to know the difference.


Work with Ikki

Not sure if you need RAG?

Send us your top 20 user questions. We'll tell you whether you need retrieval, tools, hybrid — or none of the above.

More articles

SHIP LOG

SHIP-0247·CODEMACHIA·v1.4.22026-06-18 14:22 UTC