LLM Architecture Patterns That Scale

When the demo looked perfect but production disagreed
A team ships a conversational assistant that crushes it in demos. Retrieval pulls back relevant context, the model answers with confidence, the PM is happy. Three weeks after launch the support queue fills up with confident wrong answers. They dig in and find the cause: one choice made during the prototype. Every user query, no matter the type, runs through the same retrieval path with the same prompt template.
That choice was fine for a demo. It's wrong for a production system serving real people who want different things.
Below are the patterns that separate a demo-grade LLM system from one you can run in production.
The query router pattern
The most basic decision in a system that scales is whether every request should take the same path. In most real apps, it shouldn't.
A query router reads the incoming request, figures out intent, and hands it to a specialized handler:
- Factual lookups go to retrieval-augmented generation with strict grounding
- Analytical questions go to chains that reason in steps and show their intermediate work
- Chit-chat goes to a simpler completion flow with no expensive retrieval
Routing doesn't need an LLM. A lightweight classifier trained on a few hundred labeled examples routes accurately and costs a fraction of full inference. Save the generative models for the paths where nothing else will do.
Retrieval pipeline design
Your output is only as good as the context you feed it. Most teams obsess over the embedding model and skip the three things that matter more: how you chunk, how you score, and how you rerank.
Chunking drives precision. Small overlapping chunks with sentence-level boundaries beat paragraph or page-sized chunks for question answering, consistently. The right size depends on your corpus, but 200 to 400 tokens with a 20% overlap is a sane place to start and measure from.
Scoring should blend dense and sparse signals. Dense retrieval alone whiffs on keyword-critical queries. BM25 alone whiffs on semantic variation. Hybrid retrieval with weights you can tune usually beats either one by itself, especially on technical docs.
Reranking is the lever nobody pulls. Run a cross-encoder reranker over your top-20 candidates before you pick the final context, and you'll often see the grounded answer rate jump 15 to 30 percent. The compute cost is small next to generation.
Guardrail placement
Guardrails belong at two points in the request lifecycle, not one.
Input guardrails check the query before it hits anything expensive:
- Length bounds stop prompt injection by context saturation
- Topic classification flags off-policy requests before retrieval runs
- PII detection gates queries that would bake sensitive input into the prompt
Output guardrails check the response before the user sees it:
- Grounding checks confirm claims against the retrieved context
- Policy filters catch responses that should go to a human first
- Confidence thresholds pull low-certainty answers back to a canned response
The mistake is treating guardrails as an output-only job. Input validation is cheaper and faster, and it keeps material out of the pipeline that should never have been there.

Fallback chain design
A production LLM system needs a fallback chain, written down. When primary generation fails or comes back shaky, what happens next?
A minimal three-tier chain:
- Primary: full RAG with reranking and the primary model
- Secondary: simplified retrieval on a cheaper model, returning only the single most confident matched document
- Tertiary: a deterministic response that points the user toward other support
That third tier isn't a failure state. It's a feature. A system that quietly routes to a human when it can't answer with confidence earns more trust than one that invents an answer and states it flatly.
Observability requirements
You can't improve what you can't see. LLM observability asks for more than standard service instrumentation.
At a minimum, capture per request:
- The retrieval candidates and their scores
- How much of the context window you used
- Which model got called, with input and output token counts kept separate
- The results of each guardrail evaluation
- The final answer the user received
Roll these into two dashboards. One for operations: latency, error rate, guardrail trigger rate. One for quality: grounded answer rate plus satisfaction proxies like follow-up query rate.
When quality slips, you want to know whether it's retrieval, context selection, or generation within minutes. Not days.
Caching strategy
Inference costs money. A thoughtful cache cuts operating costs without hurting quality.
Semantic caching matches a new query against recent query-response pairs by embedding similarity. When the match clears your threshold and the source documents haven't changed, you serve the cached answer straight back. It works well for FAQ-style traffic where lots of people ask the same thing in different words.
Retrieval caching keeps query caching separate from generation caching. Caching the retrieved chunks by document id and query type is often safer than caching whole generated answers. You cut retrieval cost without risking stale generated content.
Version and rollback strategy
LLM systems are harder to roll back than ordinary apps. A prompt tweak, an embedding model swap, a chunk boundary shift, or a reranker threshold change can each move output quality on its own.
Version each of them separately:
- Prompts get semantic version numbers and live in source control
- Embedding models get versioned and evaluated before you migrate the index
- Reranker thresholds are config values, not hard-coded constants
- Rollback procedures are documented and tested before the first production deploy
Teams that version these pieces independently can find and undo a quality regression in minutes. Teams that treat the whole thing as one artifact spend days digging out.
For small and medium-sized businesses
If you're running a smaller shop, the payoff here is concrete. You move faster, you carry less operational risk, and your budget goes further. Nobody's asking you to chase every new tool. The move is to put web platform improvements and AI-assisted workflows exactly where they change the numbers.
Start with one workflow where the economics are clear. Set a baseline. Improve it in 30-day chunks. Risk stays contained while your team builds the confidence and the skills to do more.
Production AI Launch Helpers
As an Amazon Associate I earn from qualifying purchases.
- Designing Machine Learning Systems by Chip HuyenA solid reference for shipping AI systems that survive real product constraints.View on Amazon →
- Building LLM Applications for ProductionUseful guidance for taking an LLM idea from demo to something dependable.View on Amazon →
- AccelerateA practical book for keeping AI delivery fast, disciplined, and measurable.View on Amazon →
- The Phoenix ProjectStill valuable when AI work has to fit into real operations and incident response.View on Amazon →