What is an AI gateway and do I need one?

An AI gateway is a broker service between every application and every model provider, handling authentication, logging, rate limits, cost attribution, policy enforcement, and failover in one place. You need one the moment a second team or a second provider appears. Without it, each application reimplements those controls its own way, which in practice means most of them implement nothing and you find out during an incident.

Should we fine-tune a model or use RAG?

Default to retrieval. RAG handles knowledge that changes, keeps data governable and citable, and needs no training pipeline. Fine-tuning earns its cost when the problem is form rather than facts: consistent style, a constrained output format, a narrow classification task, or latency budgets that rule out long contexts. Teams that start with fine-tuning routinely discover they have encoded last year's knowledge in a format they can no longer update.

Where does human-in-the-loop actually belong?

At the point of irreversibility. Drafting, summarizing, classifying, and recommending can run autonomously with sampled review. Sending money, contacting a customer, changing a record of legal significance, or executing a destructive operation gets a human gate. Tie the gate to the action class, not to the AI feature, because gates attached to features rather than consequences erode into rubber stamps within a quarter.

How many model providers should an enterprise use?

Deliberately more than one, chaotically no more than a few. Two or three API providers plus one validated open-weight option covers capability differences, price movement, and provider outages. The gateway is what makes the portfolio cheap to hold; without one, every additional provider multiplies your integration and audit surface instead of your leverage. The eval suite is what makes switching real rather than theoretical.

Enterprise AI architecture patterns that hold up

Five enterprise AI architecture patterns — gateway, retrieval grounding, human-in-the-loop, evals, model portfolio — when each applies, plus the anti-patterns.

By Pavel Glukhikh November 24, 2025 7 min read

Executive summary

Enterprise AI architecture is the set of structural decisions that let an organization run AI systems with the same reliability, security, and cost discipline as the rest of its estate. Five patterns cover most of what actually works: a gateway in front of all model access, retrieval grounding for knowledge tasks, human-in-the-loop gates at points of irreversibility, eval pipelines wired into delivery, and a deliberately managed model portfolio. None of them is exotic, which is the point. This article explains each pattern, when it applies, when it is overkill, and the anti-patterns, direct-to-API sprawl, fine-tuning as a first resort, agent maximalism, that generate most of the cleanup work I see.

Enterprise AI architecture is mostly not about models.

The model is a component you rent or run. The architecture is everything that makes it safe to depend on: where requests enter, how knowledge gets in, where humans sit, how behavior is tested, and how you avoid marrying a single provider. Get those five decisions right and the model becomes swappable, which is exactly what you want, because it will need swapping sooner than you think. Providers reprice, deprecate, and leapfrog each other on a quarterly cadence. Your architecture is what decides whether that cadence is a line item or a crisis.

On accounts I’ve led, the AI conversations that went badly were never about model choice. They were about the forty applications calling three providers directly with keys in config files, and nobody able to say what it cost or what data left the building. Architecture is how you avoid ever sitting in that meeting.

Pattern 1: The gateway

Put a service you control between every application and every model. All model traffic, internal tools, product features, experiments, flows through one broker that handles:

Authentication and key custody. Applications authenticate to the gateway with their own identity; provider keys live in one vault, not forty repos.
Logging and audit. Every prompt and completion recorded once, consistently, under one retention and access policy.
Cost attribution and quotas. Per-team, per-application metering. The first month of gateway data almost always finds one experiment quietly burning half the budget. It is never the experiment anyone suspected.
Policy enforcement. Allowed models per data classification, PII redaction, data-handling rules, enforced in code at the choke point.
Provider abstraction and failover. Routing, retries, and the ability to move workloads when a provider degrades or reprices.

This is the same architectural instinct as an API gateway in front of microservices or a firewall between network zones: create one enforcement point you can reason about. It applies from the second team onward. The only context where I skip it is a single-team product with one provider, and even then the client goes behind an interface so a gateway can arrive later without an application rewrite.

One warning. A gateway that adds 300 ms and a queue for “prompt review” is not infrastructure, it is bureaucracy with an API. If your gateway slows teams down more than a load balancer does, they will route around it, and then you have sprawl again, except now with a false sense of coverage.

Pattern 2: Retrieval grounding

When the task depends on knowledge, retrieve it at request time instead of hoping the model knows it. Retrieval-augmented generation keeps knowledge current without retraining and makes answers citable to sources. The under-appreciated part is that it keeps knowledge governable: a document you can remove from an index is a fact you can retract. Knowledge baked into weights is neither.

Use retrieval grounding when answers must reflect data that changes (policies, tickets, product docs), when answers must be attributable (“per section 4.2 of the current SOW”), or when access to knowledge must respect the caller’s permissions.

Skip it when the task is transformation rather than knowledge. Summarizing text the user supplied, reformatting, classifying: retrieval adds nothing there except latency and a new failure surface. And treat fine-tuning as the specialist tool it is, for form and format rather than facts. The full pipeline design, chunking tradeoffs and permission-aware retrieval included, is in RAG architecture for the enterprise.

Pattern 3: Human-in-the-loop gates

Place a human between the model and any irreversible action. The architecture question is not whether humans should review AI output. It is where review is load-bearing and where it is theater, because a gate that is theater is worse than no gate: it produces an approval log that implies oversight nobody actually performed.

Action class	Examples	Control
Reversible, internal	Draft, summarize, tag, route	Autonomous + sampled review
Reversible, external	Suggested reply a human sends	Human approves by design
Irreversible or regulated	Payments, records, customer comms at scale, destructive ops	Explicit HITL gate, logged approver

Two design rules keep gates honest. The approver must see enough context to actually judge; an approve button next to an answer with no sources trains people to click approve, and training people to click approve is precisely what you built the gate to prevent.

Second, measure the override rate. A gate approving 99.9% of actions is either guarding a genuinely good system, which you verify with sampling, or it has decayed into a rubber stamp. It is usually the rubber stamp. Override-rate telemetry belongs on the same dashboards as your other observability signals, because that is what it is.

Pattern 4: Eval pipelines in the delivery path

Behavioral tests gate releases the way unit tests do. Every prompt change, model upgrade, or retrieval-index rebuild runs against a golden set before it ships; production outputs are sampled and scored after.

Here is what happens without this. A provider ships a routine model version bump, and the tone and format of your customer-facing output changes overnight. No deploy on your side, no code change, every health check green. I have watched exactly that play out in systems with no evals to catch it, and the first detection mechanism was a customer.

A vendor’s silent upgrade is your unannounced production change. Evals are how you find out before your users do.

The pattern is important enough to get its own article, evaluating AI systems in production. Architecturally the requirement is compact: evals run in CI, eval data is version-controlled, and the gateway from Pattern 1 supplies the production traffic sample worth scoring.

Pattern 5: Model portfolio management

Treat models like any other supplier category: a managed portfolio, not a marriage. Capabilities shift quarterly. Prices move. Providers have outages and deprecation schedules, and none of them consult your roadmap first.

The portfolio approach in practice:

Two or three API providers, with the gateway abstracting the differences that can be abstracted and evals catching the ones that cannot.
One open-weight escape hatch, validated against your highest-sensitivity or highest-volume workload. The reasoning is the same as the cloud vs on-prem decision framework: the option has value even if you never exercise it, and the vendor across the negotiating table knows whether you hold it.
Right-sized models per task. Classification and extraction on a small cheap model, complex reasoning on a frontier one. Routing by task is routinely a 10x cost difference at volume, and it is the least glamorous 10x available anywhere in the stack.
A deprecation calendar. Every model in the portfolio has an owner watching its lifecycle, because providers retire versions on their schedule, not yours.

What makes a portfolio real rather than aspirational is the eval suite. Switching providers is only cheap if you can prove equivalence in an afternoon instead of a quarter of UAT. Without evals, the portfolio is a slide.

Anti-patterns I keep meeting

Direct-to-API sprawl. Every team with a corporate card integrates a provider directly. You discover it through the finance report or, worse, through a data-handling question nobody can answer. The fix is the gateway plus an amnesty period: punish nothing, migrate everything. Punishment drives the next integration further underground.

Fine-tuning as a first resort. Teams reach for training because it feels like real ML work. The result is a snapshot of last year’s knowledge, a pipeline to maintain, and no citation trail. Retrieval first. Tune for form later, if the evals say you need to.

Agent maximalism. Wiring a model to a dozen tools with standing credentials and calling it automation. Autonomy has to be earned per action class through eval evidence, with tool scopes issued per task. The LLM security threat model spells out why standing tool access plus untrusted input is a breach waiting for a prompt.

The pilot that never graduates. A demo becomes load-bearing without ever passing through the patterns above. The test is simple: if people would notice its absence, it is production. Retrofit the inventory entry, the evals, and the gateway routing now, on your schedule, rather than during the incident, on the incident’s schedule.

Decisions to write down

For each AI system, record four things: which patterns apply and which were deliberately skipped, with the reason; the action classes and where the HITL gates sit; the primary and fallback models, with the eval evidence behind the fallback; and the data classification the system is approved to touch. Four short decision records. The “deliberately skipped” entries are the ones your successor will thank you for.

None of these patterns is exotic, and that is the point worth sitting with. Enterprise AI architecture succeeds by applying the boring disciplines, choke points, grounding, gates, tests, supplier management, to a component the whole industry is currently tempted to treat as magic. Components come and go. The disciplines are what hold.

Frequently asked questions

What is an AI gateway and do I need one?: An AI gateway is a broker service between every application and every model provider, handling authentication, logging, rate limits, cost attribution, policy enforcement, and failover in one place. You need one the moment a second team or a second provider appears. Without it, each application reimplements those controls its own way, which in practice means most of them implement nothing and you find out during an incident.
Should we fine-tune a model or use RAG?: Default to retrieval. RAG handles knowledge that changes, keeps data governable and citable, and needs no training pipeline. Fine-tuning earns its cost when the problem is form rather than facts: consistent style, a constrained output format, a narrow classification task, or latency budgets that rule out long contexts. Teams that start with fine-tuning routinely discover they have encoded last year's knowledge in a format they can no longer update.
Where does human-in-the-loop actually belong?: At the point of irreversibility. Drafting, summarizing, classifying, and recommending can run autonomously with sampled review. Sending money, contacting a customer, changing a record of legal significance, or executing a destructive operation gets a human gate. Tie the gate to the action class, not to the AI feature, because gates attached to features rather than consequences erode into rubber stamps within a quarter.
How many model providers should an enterprise use?: Deliberately more than one, chaotically no more than a few. Two or three API providers plus one validated open-weight option covers capability differences, price movement, and provider outages. The gateway is what makes the portfolio cheap to hold; without one, every additional provider multiplies your integration and audit surface instead of your leverage. The eval suite is what makes switching real rather than theoretical.