Skip to main content
Back to the shipping log
Engineering9 min read

RAG vs fine-tuning in 2026: default to RAG

Default to RAG. Reach for fine-tuning when retrieval can't close the gap. The six-axis decision matrix we run on every client LLM build.

HTThe Hayaiti team
#llm#rag#ai

Built on tools you trust

Vercel
Stripe
Cloudflare
GitHub
Linear
Slack
Resend
Sentry
Postgres
PostHog
Loom
Notion

← swipe · 12 tools →

The short version

Default to RAG. Reach for fine-tuning when you've measured a specific gap that retrieval can't close.

That's the headline. The rest of this post is the decision matrix we use as our build playbook, and the ways we've watched both approaches fail in production from prior careers.

What each one actually is

RAG (retrieval-augmented generation) is plumbing. You take the user's question, fetch relevant documents from a vector store or keyword index, stuff them into the prompt, and let a foundation model answer using that context. The model itself is unchanged.

Fine-tuning changes the model. You take a base model and train it on examples of input → desired output until it shifts its weights. The model now answers differently — even with the same prompt.

These solve different problems. People conflate them because both make LLM output "more domain-specific," but the mechanism is completely different.

The decision matrix

We score every LLM feature on six axes before picking an approach:

  • Data freshness. Does the answer change weekly? Daily? RAG wins on anything that updates faster than a fine-tune cycle. Fine-tuning a model on Q1 docs and serving it in Q3 will quietly hallucinate stale policy.
  • Factuality + attribution. RAG can cite the source it pulled. A fine-tuned model can't — the knowledge is baked into weights. If the use case is "what does our internal policy say about X," you almost always want the citation.
  • Latency budget. RAG adds a retrieval step (usually 50-200ms). Fine-tuned models skip it. For sub-200ms total budgets, the math matters.
  • Cost at scale. RAG is cheap to set up, expensive per call (longer context = more tokens). Fine-tuning is expensive to set up, cheaper per call. There's a crossover point — usually around 1M+ requests on the same domain.
  • Knowledge boundary. Is the knowledge a finite, documented corpus (good for RAG) or a fuzzy "speak in our voice" thing (good for FT)?
  • Behavior vs facts. Fine-tuning is great at *behavior* — JSON formatting, classification labels, tone, refusal patterns. RAG is great at *facts* — what's in the docs, the contract, the catalog.

Recommended defaults

Here's what we actually pick by use case:

  • Customer-support bot over your help center? RAG. The docs change weekly, you need citations, and the latency is fine.
  • Internal Q&A over a wiki or Notion? RAG. Same reasons.
  • Domain classifier (intent → label)? Fine-tune. The label space is fixed, you need consistent output, latency matters, and there's no factual lookup involved.
  • JSON extraction from messy text? Fine-tune (or just structured output via the API). Behavior, not facts.
  • "Write code in our codebase style"? RAG over the codebase. Maybe fine-tune later if you have 10K+ accepted PRs to train on.
  • A model that talks like your CEO? Fine-tune. Voice is behavior.

Where each one fails

RAG fails quietly when retrieval is bad. If your chunks are too big, you waste tokens. Too small, you lose context. Wrong embedding model for the domain (e.g. medical text on a general embedder), and your top-5 results are noise. The model still answers — confidently — just wrong. We've seen RAG systems where the retrieval recall was under 40% and nobody noticed for months because the model "sounded right."

Fine-tuning fails when the data is wrong. Train on 500 examples where the "correct" label was actually inconsistent across labelers, and the model learns the inconsistency. Train on synthetic data that doesn't reflect production distribution, and it overfits to a world that doesn't exist. The model won't tell you. It'll just degrade silently on real traffic.

Both fail at evaluation. "Looks fine to a human reading 10 samples" is not an eval. Build a holdout set. Score with at least one adversarial check. Re-score every time you change anything.

The hybrid case

The interesting systems are usually hybrid: a fine-tuned classifier that decides *which* RAG corpus to query, then a base model that generates the answer with retrieved context. Or a small fine-tuned reranker on top of vector search to push the right chunks into the top-3.

We rarely see "pure FT" or "pure RAG" win on hard problems. The win comes from picking the right tool for each step.

What we ship

Most LLM features we build for clients are RAG-first. Faster to ship, easier to debug, citations come for free. We add fine-tuning when we have:

  1. A measurable gap retrieval can't close (usually behavior, not facts).
  2. Enough labeled data (500+ high-quality examples minimum).
  3. A stable evaluation harness so we'll know if it gets worse.

If you're trying to figure out which approach fits your use case, the free audit covers it for AI features too — we'll tell you which one we'd reach for and why, in writing, in 24 hours.

HT

The Hayaiti team

Hayaiti

Hayaiti is a productized engineering studio. We ship web, software, iOS, and cybersecurity work on fixed prices and calendar-day timelines. The team takes turns on the shipping log.

More from the shipping log

Want help shipping this?

We turn posts like this into production code. Fixed price. Calendar-day timelines. Source code in your repo on day one.