Who it is for
Teams responsible for deciding whether an AI workflow can move from testing into live legal work.
If your current evaluation is based on a handful of prompts and a general sense that "it looks right", this will challenge that.
Why it matters
Accuracy is an input metric.
Legal risk shows up in how work unfolds:
- what gets missed
- what gets assumed
- what gets escalated too late
- what gets used without enough evidence
A system can produce convincing text and still fail operationally.
Most failures are not obvious in a single response. They emerge across a sequence of decisions, often under time pressure, often with partial context.
If you only test outputs in isolation, you are not testing the system you are actually deploying.
The shift to make
Stop evaluating answers.
Start evaluating decisions in context.
That means:
- what the task is trying to achieve
- what state the matter is in
- what the output will be used for
- what happens if the output is wrong or incomplete
Evaluation needs to reflect how legal work actually moves, not how prompts are written.
Practical steps
1. Define the task in terms of the decision it supports
Do not start with "summarise this document".
Start with:
- what decision depends on this output?
- what happens if the output is incomplete or wrong?
- what level of reliance is expected?
A clause summary used for internal orientation is not the same as one feeding into advice.
This definition anchors everything that follows.
2. Build scenario sets that reflect real work
Most teams overfit to prompt variations.
Instead, structure scenarios across:
- matter phase (intake, review, negotiation, execution, monitoring)
- sensitivity (standard, confidential, privileged, high-risk)
- reliance level (inform-only, assistive, adopted, binding)
This gives you coverage of how the system behaves under different conditions, not just different wording.
A small, well-structured scenario bank is more valuable than hundreds of loosely defined prompts.
3. Score what actually matters
Text quality alone is insufficient.
Score across dimensions that reflect legal use:
-
Factuality
Are statements grounded in the source material? -
Completeness
What was omitted that should have been captured? -
Traceability
Can outputs be linked back to source clauses or evidence? -
Uncertainty handling
Does the system flag ambiguity, gaps, or assumptions? -
Policy compliance
Does the output respect defined constraints, including jurisdiction and confidentiality?
You are not looking for perfect scores. You are looking for predictable behaviour.
4. Track operational metrics
This is where most evaluation frameworks fall short.
You need to measure how the system behaves once humans interact with it:
-
Rework rate
How often outputs need correction before use -
Reviewer disagreement
Where human reviewers diverge on whether something is acceptable -
Escalation frequency
How often outputs trigger higher-level review -
Time-to-release
How long it takes for an output to pass through gates and be used
These metrics tell you whether the system reduces or introduces friction.
5. Test stability over time
One-off evaluation gives false confidence.
Rerun the same scenarios:
- across model updates
- across prompt or policy changes
- at regular intervals
You are looking for drift:
- silent degradation
- inconsistent behaviour
- changes in what gets omitted or flagged
Stability is as important as raw performance.
6. Define explicit go or no-go thresholds
Most pilots fail here.
Without thresholds, decisions default to enthusiasm or pressure to ship.
Define in advance:
- minimum acceptable scores across dimensions
- maximum tolerated rework or escalation rates
- conditions that require redesign rather than iteration
If thresholds are not met, the system does not move forward. That needs to be a real outcome, not a theoretical one.
Minimum artefacts
A workable evaluation setup requires a small set of concrete artefacts:
-
Evaluation rubric
The scoring model across factuality, completeness, traceability, uncertainty, and compliance -
Scenario bank
A structured set of test cases aligned to matter phase, sensitivity, and reliance -
Failure taxonomy
A defined set of failure types with clear escalation paths -
Monitoring plan
How performance will be tracked and reassessed post-release
If these are not written down, evaluation will drift into opinion.
Where this usually breaks
Common patterns:
- Evaluation focuses on best-case examples rather than realistic scenarios
- Failure is treated as an exception rather than something to classify and learn from
- Prompt tweaks are used to mask underlying issues in task design
- Operational metrics are ignored because they are harder to measure
- Model updates are deployed without re-running baseline scenarios
None of these are technical limitations. They are process gaps.
What good looks like
A defensible evaluation setup has a few clear properties:
- Tasks are defined in terms of downstream decisions, not abstract outputs
- Scenarios reflect how matters evolve over time
- Scoring captures omission and uncertainty, not just correctness
- Human interaction is measured, not assumed
- Stability is tested continuously, not once
- Release decisions are tied to evidence, not confidence
At that point, evaluation becomes part of the system, not a one-off exercise.
Testing cadence
A simple cadence is enough if it is followed consistently:
-
Pre-pilot baseline
Establish initial performance across the scenario bank -
Weekly pilot review
Track operational metrics and emerging failure patterns -
Pre-production decision
Assess against defined thresholds -
Monthly production checks
Re-run scenarios and review drift
Skip any of these and issues will accumulate quietly.
Checklist
- Tasks are defined in terms of decisions and reliance
- Scenario bank reflects matter phase, sensitivity, and usage
- Scoring includes completeness and uncertainty, not just accuracy
- Operational metrics are tracked during pilot
- Stability tests are scheduled and executed
- Go or no-go thresholds are defined and enforced
- Failures are classified and linked to escalation paths
- Production monitoring is planned before release
Related
Pair this with matter-state workflows.
If your scenarios do not reflect how work changes over time, your evaluation will miss the failures that actually matter.