Lessons from Shipping AI Products
What we learned shipping voice agents, RAG platforms, fintech engines, civic AI, and immersive web — the patterns that worked, the ones that didn't, and the things nobody told us.

Why this post
Over the last 24 months we've shipped a portfolio of products to production. They span fintech, civic tech, home services SaaS, fantasy sports prediction, immersive sci-fi, autonomous trading, and event PWAs — most with AI in the loop, one fully transactional.
This post is what we wish we'd known before product 1. It's biased toward what surprised us — the patterns that worked across the portfolio, and the ones that broke when we tried to copy them blindly.
The shape of the portfolio
The products span very different domains — narrative and content engines, B2B analytics platforms, ops-facing SaaS with voice or chat in the loop, prediction-and-reasoning UX, civic-tech tooling, multi-tenant PWAs, autonomous loops on time-series data.
They look unrelated. They're not. The same set of lessons applied to all of them.
Lesson 1: the AI part is rarely the hard part
When we started, we assumed shipping AI products meant deep ML problems — model selection, fine-tuning, eval. In reality, the AI is 15-25% of the work. The rest is:
- Auth, billing, multi-tenancy, RBAC
- Data pipelines (ingest, normalize, version)
- Observability (logs, traces, eval drift, cost monitoring)
- UI/UX for a product whose behavior is non-deterministic
- Customer onboarding and guardrails
- Compliance (GDPR for EU, sector-specific for fintech and civic)
Most "AI startups" ship AI demos. To ship AI products, the surrounding 75% has to be there. We've watched competitors with much fancier models lose to teams with simpler models and better infrastructure.
Lesson 2: defaults matter more than capabilities
On a voice-agent project we spent two weeks tuning the LLM prompt. We then realized 80% of the conversation quality came from one decision: the agent's first sentence.
On a RAG platform: the chunking strategy mattered more than the embedding model.
On a quantitative engine: the position-sizing default mattered more than the prediction model.
The lesson: in AI products, defaults are the product. Picking sensible defaults and letting users override them is almost always better than asking them to configure 14 things up front.
Lesson 3: build the eval before the feature
This was a hard lesson. On one of the RAG products, we shipped the system, demoed it, and then had no way to know if it was getting better or worse over time. We added eval six weeks in. By then, we'd made changes we couldn't measure.
Now, on every AI product, we build the eval first:
- A small (50-100 example) curated dataset
- A scoring function (LLM-as-judge or rule-based)
- A CI hook that runs eval on every PR
Without this, you're flying blind. Cost: 1-2 days. Value: every subsequent decision is informed.
Lesson 4: production traffic teaches you everything
We had a synthetic test set for one of the voice agents. It scored 92% in offline eval. We launched. The first day, real users broke it in ways we hadn't anticipated:
- Background noise (kids, traffic, dogs)
- Strong regional accents we hadn't included in the test set
- Users who interrupted constantly
- Users who answered questions before they were asked
The fix wasn't a better model. It was better production logging so we could see what was actually happening, then iterating on edge cases week by week. Six weeks after launch, the agent was at 96% completion rate. That gain came from real traffic, not synthetic eval.
Tooling we use now: Posthog for transcripts + replay, custom dashboards for cost and latency drift, manual review of 10 random calls per week.
Lesson 5: the cost surprise is always voice
Across the portfolio, the products that fire the most LLM calls per minute (autonomous loops, batch analytics) are not the most expensive ones to operate. The voice products are — even though they make 1/100th the LLM calls. They cost a few times more per active user.
Why? Voice. Voice synthesis is 10–100× more expensive than the LLM brain. We talked about this in our voice AI cost guide.
Lesson: model your unit economics carefully if voice is in the loop. The LLM is rarely the cost driver — voice and telephony are.
Lesson 6: pick a stack and reuse it
For our first products we picked technologies à la carte. Different frontend (React then Vue), different backend (Express then Fastify), different DBs.
The result: every onboarding ramp was three weeks. Code didn't transfer between projects. Bug fixes didn't compound.
We later picked a stack and stuck to it: Nuxt 4 + Fastify + MongoDB + BullMQ + Redis + Vercel/DO. We wrote shared utilities (auth middleware, multi-tenancy, billing, observability hooks) once, copied them across products.
This is unfashionable advice. The right framework choice depends on the team. But the underlying principle — stop optimizing per-project, start optimizing across-projects — is the highest-leverage decision an agency can make.
Lesson 7: the AI feature is the easy lock-in story to tell, but it's almost never the moat
We've pitched clients "you'll have an AI moat." We were wrong every time.
Real moats we've watched build up:
- Data ownership: a year of clean labeled data is harder to copy than a model
- Workflow integration: once the agent is wired into your CRM, telephony, and back office, switching cost is high
- Trust: SLAs and uptime track records compound
The AI part is replaceable. The everything-around-it isn't. We now sell the integration, not the model.
Lesson 8: ship behind a flag, always
Every AI feature we ship goes behind a flag (LaunchDarkly, GrowthBook, or homebrew). Reasons:
- Bad output discovered in prod → flip the flag, no rollback
- A/B test the AI version vs the rule-based fallback → measure the lift
- Roll out to 1% of traffic for cost / quality monitoring before going wide
- Sales conversations: "we shipped it last week, currently in beta with our top customers"
This single practice has saved us six times. It's table stakes.
Lesson 9: dual-write the data layer
For four products, we use LLMs in the write path: an agent processes input, structures it, and writes to the DB. The temptation is to trust the LLM output.
Don't. We always dual-write: store both the raw input AND the LLM-structured output, with a version. When (not if) the prompt changes and the structure shifts, we can re-run on raw data without losing history.
This adds 10% to storage. It's worth it 100% of the time.
Lesson 10: the team you ship with matters more than the model
We've shipped products with junior teams, with senior teams, with mixed teams. The single biggest predictor of project success was not the model, framework, or budget — it was whether the team had someone who'd shipped before.
Specifically: someone who'd seen a system through launch, scale, and sunset. They know which corners are safe to cut and which ones break later. That person is the difference between a 6-week project and a 6-month one.
Hire for that. Invest in growing it.
Lesson 11: choose agentic over RAG when your data is relational
This is the most counter-current lesson in the list, and the one that has saved us the most weeks.
Across the portfolio, only one product ended up running real RAG — a narrative-engine corpus where questions are genuinely semantic ("produce a chapter consistent with everything we've established about this character"). The rest handle questions like "how many active records this month", "which account missed last month's invoice", "what's the next scheduled item for this user". Those are relational questions, not semantic ones — and a db.find() call from an agent beats a vector retrieval over fragmented text every time on accuracy, latency, and cost.
The architecture that worked across the six: an agent built on claude-agent-sdk (or plain Anthropic Messages API for synchronous flows), with strongly typed tool calls that translate the user's question into structured queries. Documentary context (SOPs, policies, escalation rules) lives in a small set of markdown files injected directly into the system prompt, with cache_control: ephemeral to keep the cost flat across the day. No embeddings, no chunking, no reranker.
Before you embed a single chunk, list your top 20 user questions. If most are relational, you don't need RAG. You need an agent with tools.
Lesson 12: forced tool calling is the cure for the "almost-right" sentence
The failure mode that takes down most production chatbots isn't outright hallucination — it's the sentence that sounds confident and is almost right: a fake escalation, an invented SLA, a signature that impersonates an ops manager, a "I'll get back to you by 5pm" promise the system can't keep. Standard prompt engineering doesn't fix this. Adding more rules to the system prompt rarely helps — the LLM is great at producing prose that feels compliant.
The fix that worked for us: don't let the LLM write the user-facing prose at all.
We use Anthropic's tool_choice: { type: 'any' } to force the model to pick exactly one tool from a closed palette per turn. The LLM's only output is a tool name and structured arguments. The reply that lands on the user's screen is generated by deterministic templates filled with the structured args. Versioned, reviewable, A/B-able, and structurally impossible to hallucinate.
When the chosen tool is "stay silent" (acknowledgments, salutations), a smaller second model (Haiku) runs a safety re-check: "is there an actionable signal in the last message that the first model just suppressed?" If yes, the silence is overridden. This catches the rare cases where Sonnet decides "merci, à demain" is just an ack but the message earlier in the same turn contained "I'm sick".
This pattern is more work to design than a free-text chatbot. It's much less work to operate. We'd recommend it for any conversational agent talking to operators, customers, or workers in a context where wrong answers are expensive.
Closing thoughts
If you're shipping your first AI product, the temptation is to spend 80% of your time on the AI. Resist it.
Spend 25% on the AI, 25% on the eval, 25% on the surrounding infrastructure, and 25% on listening to the first 100 real users. That's how products ship.
If you'd like help applying these lessons to your project — get in touch. We've made these mistakes so you don't have to.
Work with Ikki
Shipping your first AI product?
We review your spec, your architecture and your eval plan before you build. Save 6 weeks of mistakes.
More articles
The Week a Government Cut Off Anthropic's Best Model
Fable 5 suspended by US export control on June 12, two legacy models retired June 15, and the SDK shipped model fallback for every failure mode. Same week: the risk and the answer. Here's how to harden your stack.
AgentsThe Anthropic SDK Middleware: Stop Writing Your Own Tracing Wrappers
The Anthropic SDK shipped a native middleware API, the agent SDK pushed 10 releases in 7 days, and Nuxt 4.4.7 is a security hotfix. Quarterly dependency reviews are now too slow for production AI.