Private inference belongs in the routing layer

Private inference is often treated as a separate infrastructure strategy: cloud versus on-premise, API versus local model, hosted versus firm-controlled execution.

That framing misses the more useful point: for legal work, private inference is a routing decision. It is one possible route for work that carries a higher confidentiality, privilege, regulatory or client sensitivity profile. Some tasks should run in a firm-controlled environment, some may need a private cloud or isolated execution layer, some are fine on a standard commercial API with the right logging, review and release controls, and some should not use AI at all.

The harder question is not whether a firm believes in private inference, but when the work actually requires it.

Trust becomes weaker as work gets closer to the matter

Most current legal AI deployments still rely on a combination of trust, contracts, access controls and vendor assurances. For a large amount of work, that may be reasonable. Public legal research, internal drafting support, low-sensitivity summarisation and general knowledge tasks do not all need the same infrastructure posture.

The position changes as the work moves closer to live client material, where internal investigations, regulatory exposure analysis, pre-transaction material, privileged notes, litigation strategy and sensitive board-level advice carry a different risk profile. At that point, the issue is not only whether a provider has good contractual terms, but whether the firm can justify where the data went, who could access it, what technical boundaries applied and why that route was acceptable for that task.

Private inference matters because it gives firms a stronger route for work where trust alone is not enough.

The route should depend on task shape, not model preference

The routing layer should decide where work runs based on the nature of the task, not on a general preference for one model or deployment pattern.

That decision should take account of:

the client and matter sensitivity
the jurisdiction and data handling constraints
the type of document or material being processed
the intended destination of the output
whether the work may later be relevant to disclosure, regulatory review or internal investigation
the level of reliance being placed on the result
whether human review is required before the output can be used

A model summarising a public consultation paper is not the same as a model reviewing privileged investigation material. A tool helping a lawyer draft an internal first pass is not the same as a system producing client-facing advice. The same model may be acceptable in one route and unacceptable in another, depending on context, destination and consequence.

That is why private inference belongs inside the routing layer, rather than sitting outside the workflow as a specialist environment that people remember to use only when a matter feels unusually sensitive.

Private inference is a control, not a guarantee

Private inference can strengthen the technical boundary around sensitive work. In some designs, models run inside isolated environments where even the infrastructure provider cannot inspect the data being processed. In others, the value is firm-controlled deployment, local execution or tighter control over retention, access and monitoring.

That is a meaningful shift from pure policy control to technical control, but it does not make the workflow safe by itself. A private model can still produce poor analysis. A local deployment can still be badly tested. An on-premise system can still expose risk if the output is copied into the wrong place, sent to the wrong audience or relied on without review.

Private inference reduces one category of risk, but it does not remove the need for judgement, evaluation or execution gates.

Smaller models may be the operational answer

There is a tendency to treat private inference as a debate about whether open or smaller models can compete with frontier systems. For many legal workflows, that is the wrong comparison.

Many legal workflows are structured, repetitive and bounded. They do not always require the strongest available general-purpose model. They require a model that performs a defined task consistently, within a controlled workflow, with known failure modes and measurable review outcomes.

For some work, a frontier model will still be the right route. For other work, a smaller model running in a controlled environment may be better aligned to the job because it is faster, cheaper, easier to constrain and easier to evaluate against a narrow task.

The routing layer should make that choice explicit. Capability matters, but so do confidentiality, cost, latency, evidence, repeatability and control.

Evaluation becomes part of the routing decision

Once a firm chooses to run work through a model it controls, the burden shifts. It is no longer enough to rely on a provider’s general capability claims or a broad sense that the output looks good.

The firm needs evidence that the route works for the task, which means defining the work clearly, testing against representative examples, setting measurable thresholds and monitoring whether the system degrades over time. Evaluation should not only ask whether the first answer looked plausible. It should measure rework, variance, missed issues, escalation rates and whether reviewers are repeatedly correcting the same failure patterns.

This matters for all AI routes, not only private inference. The difference is that private inference makes the firm’s choices more visible. If the firm selects a smaller model, a private deployment or a particular controlled route, it should be able to explain why that route was appropriate and what evidence supported the decision.

What the routing layer should record

A useful routing layer should not simply send the task to a model and store the answer. It should record the decision around the work, including:

the task type
the matter or workflow context
the sensitivity profile
the permitted inference environment
the model or tool route selected
the policy version applied
the evidence or test basis supporting that route
the review requirement
the destination of the output
whether an execution gate approved, blocked or escalated use

That record matters because legal AI governance cannot depend on isolated prompt logs. The question is not only "what did the model say?" It is "why was this route allowed for this work, and what controls applied before the result was used?"

The real question

Private inference should not become a blanket answer. Not every task needs it, and treating all legal AI work as equally sensitive creates cost, delay and unnecessary complexity.

The opposite mistake is worse: pushing every task through the same hosted model because it is convenient, then treating contractual comfort as enough for material that deserves stronger protection.

Private inference should be available where the work demands it, bypassed where it does not, and tied to evidence rather than instinct. It should sit alongside commercial APIs, private cloud deployments, local models, retrieval systems, human review and blocked routes as part of the firm’s wider orchestration model.

The real control point is not the infrastructure choice in isolation, but the routing decision that explains why a particular route was allowed for a particular piece of work.