Evaluating legal AI beyond accuracy

Who it is for

Teams responsible for deciding whether an AI workflow can move from testing into live legal work.

If your current evaluation is based on a handful of prompts and a general sense that "it looks right", this will challenge that.

Why it matters

Accuracy is an input metric.

Legal risk shows up in how work unfolds:

what gets missed
what gets assumed
what gets escalated too late
what gets used without enough evidence

A system can produce convincing text and still fail operationally.

Most failures are not obvious in a single response. They emerge across a sequence of decisions, often under time pressure, often with partial context.

If you only test outputs in isolation, you are not testing the system you are actually deploying.

The shift to make

Stop evaluating answers.

Start evaluating decisions in context.

That means:

what the task is trying to achieve
what state the matter is in
what the output will be used for
what happens if the output is wrong or incomplete

Evaluation needs to reflect how legal work actually moves, not how prompts are written.

Practical steps

1. Define the task in terms of the decision it supports

Do not start with "summarise this document".

Start with:

what decision depends on this output?
what happens if the output is incomplete or wrong?
what level of reliance is expected?

A clause summary used for internal orientation is not the same as one feeding into advice.

This definition anchors everything that follows.

2. Build scenario sets that reflect real work

Most teams overfit to prompt variations.

Instead, structure scenarios across:

matter phase (intake, review, negotiation, execution, monitoring)
sensitivity (standard, confidential, privileged, high-risk)
reliance level (inform-only, assistive, adopted, binding)

This gives you coverage of how the system behaves under different conditions, not just different wording.

A small, well-structured scenario bank is more valuable than hundreds of loosely defined prompts.

3. Score what actually matters

Text quality alone is insufficient.

Score across dimensions that reflect legal use:

Factuality
Are statements grounded in the source material?
Completeness
What was omitted that should have been captured?
Traceability
Can outputs be linked back to source clauses or evidence?
Uncertainty handling
Does the system flag ambiguity, gaps, or assumptions?
Policy compliance
Does the output respect defined constraints, including jurisdiction and confidentiality?

You are not looking for perfect scores. You are looking for predictable behaviour.

4. Track operational metrics

This is where most evaluation frameworks fall short.

You need to measure how the system behaves once humans interact with it:

Rework rate
How often outputs need correction before use
Reviewer disagreement
Where human reviewers diverge on whether something is acceptable
Escalation frequency
How often outputs trigger higher-level review
Time-to-release
How long it takes for an output to pass through gates and be used

These metrics tell you whether the system reduces or introduces friction.

5. Test stability over time

One-off evaluation gives false confidence.

Rerun the same scenarios:

across model updates
across prompt or policy changes
at regular intervals

You are looking for drift:

silent degradation
inconsistent behaviour
changes in what gets omitted or flagged

Stability is as important as raw performance.

6. Define explicit go or no-go thresholds

Most pilots fail here.

Without thresholds, decisions default to enthusiasm or pressure to ship.

Define in advance:

minimum acceptable scores across dimensions
maximum tolerated rework or escalation rates
conditions that require redesign rather than iteration

If thresholds are not met, the system does not move forward. That needs to be a real outcome, not a theoretical one.

Minimum artefacts

A workable evaluation setup requires a small set of concrete artefacts:

Evaluation rubric
The scoring model across factuality, completeness, traceability, uncertainty, and compliance
Scenario bank
A structured set of test cases aligned to matter phase, sensitivity, and reliance
Failure taxonomy
A defined set of failure types with clear escalation paths
Monitoring plan
How performance will be tracked and reassessed post-release

If these are not written down, evaluation will drift into opinion.

Where this usually breaks

Common patterns:

Evaluation focuses on best-case examples rather than realistic scenarios
Failure is treated as an exception rather than something to classify and learn from
Prompt tweaks are used to mask underlying issues in task design
Operational metrics are ignored because they are harder to measure
Model updates are deployed without re-running baseline scenarios

None of these are technical limitations. They are process gaps.

What good looks like

A defensible evaluation setup has a few clear properties:

Tasks are defined in terms of downstream decisions, not abstract outputs
Scenarios reflect how matters evolve over time
Scoring captures omission and uncertainty, not just correctness
Human interaction is measured, not assumed
Stability is tested continuously, not once
Release decisions are tied to evidence, not confidence

At that point, evaluation becomes part of the system, not a one-off exercise.

Testing cadence

A simple cadence is enough if it is followed consistently:

Pre-pilot baseline
Establish initial performance across the scenario bank
Weekly pilot review
Track operational metrics and emerging failure patterns
Pre-production decision
Assess against defined thresholds
Monthly production checks
Re-run scenarios and review drift

Skip any of these and issues will accumulate quietly.

Checklist

Tasks are defined in terms of decisions and reliance
Scenario bank reflects matter phase, sensitivity, and usage
Scoring includes completeness and uncertainty, not just accuracy
Operational metrics are tracked during pilot
Stability tests are scheduled and executed
Go or no-go thresholds are defined and enforced
Failures are classified and linked to escalation paths
Production monitoring is planned before release

Pair this with matter-state workflows.

If your scenarios do not reflect how work changes over time, your evaluation will miss the failures that actually matter.