Orchestrate.legal

Evaluating legal AI beyond accuracy

A practical framework for testing whether AI outputs can be relied on, not just whether they look correct.

evaluationWorking

This reflects current thinking and may change as the model develops.

Who it is for

Teams responsible for deciding whether an AI workflow can move from testing into live legal work.

If your current evaluation is based on a handful of prompts and a general sense that "it looks right", this will challenge that.

Why it matters

Accuracy is an input metric.

Legal risk shows up in how work unfolds:

  • what gets missed
  • what gets assumed
  • what gets escalated too late
  • what gets used without enough evidence

A system can produce convincing text and still fail operationally.

Most failures are not obvious in a single response. They emerge across a sequence of decisions, often under time pressure, often with partial context.

If you only test outputs in isolation, you are not testing the system you are actually deploying.

The shift to make

Stop evaluating answers.

Start evaluating decisions in context.

That means:

  • what the task is trying to achieve
  • what state the matter is in
  • what the output will be used for
  • what happens if the output is wrong or incomplete

Evaluation needs to reflect how legal work actually moves, not how prompts are written.

Practical steps

1. Define the task in terms of the decision it supports

Do not start with "summarise this document".

Start with:

  • what decision depends on this output?
  • what happens if the output is incomplete or wrong?
  • what level of reliance is expected?

A clause summary used for internal orientation is not the same as one feeding into advice.

This definition anchors everything that follows.

2. Build scenario sets that reflect real work

Most teams overfit to prompt variations.

Instead, structure scenarios across:

  • matter phase (intake, review, negotiation, execution, monitoring)
  • sensitivity (standard, confidential, privileged, high-risk)
  • reliance level (inform-only, assistive, adopted, binding)

This gives you coverage of how the system behaves under different conditions, not just different wording.

A small, well-structured scenario bank is more valuable than hundreds of loosely defined prompts.

3. Score what actually matters

Text quality alone is insufficient.

Score across dimensions that reflect legal use:

  • Factuality
    Are statements grounded in the source material?

  • Completeness
    What was omitted that should have been captured?

  • Traceability
    Can outputs be linked back to source clauses or evidence?

  • Uncertainty handling
    Does the system flag ambiguity, gaps, or assumptions?

  • Policy compliance
    Does the output respect defined constraints, including jurisdiction and confidentiality?

You are not looking for perfect scores. You are looking for predictable behaviour.

4. Track operational metrics

This is where most evaluation frameworks fall short.

You need to measure how the system behaves once humans interact with it:

  • Rework rate
    How often outputs need correction before use

  • Reviewer disagreement
    Where human reviewers diverge on whether something is acceptable

  • Escalation frequency
    How often outputs trigger higher-level review

  • Time-to-release
    How long it takes for an output to pass through gates and be used

These metrics tell you whether the system reduces or introduces friction.

5. Test stability over time

One-off evaluation gives false confidence.

Rerun the same scenarios:

  • across model updates
  • across prompt or policy changes
  • at regular intervals

You are looking for drift:

  • silent degradation
  • inconsistent behaviour
  • changes in what gets omitted or flagged

Stability is as important as raw performance.

6. Define explicit go or no-go thresholds

Most pilots fail here.

Without thresholds, decisions default to enthusiasm or pressure to ship.

Define in advance:

  • minimum acceptable scores across dimensions
  • maximum tolerated rework or escalation rates
  • conditions that require redesign rather than iteration

If thresholds are not met, the system does not move forward. That needs to be a real outcome, not a theoretical one.

Minimum artefacts

A workable evaluation setup requires a small set of concrete artefacts:

  • Evaluation rubric
    The scoring model across factuality, completeness, traceability, uncertainty, and compliance

  • Scenario bank
    A structured set of test cases aligned to matter phase, sensitivity, and reliance

  • Failure taxonomy
    A defined set of failure types with clear escalation paths

  • Monitoring plan
    How performance will be tracked and reassessed post-release

If these are not written down, evaluation will drift into opinion.

Where this usually breaks

Common patterns:

  • Evaluation focuses on best-case examples rather than realistic scenarios
  • Failure is treated as an exception rather than something to classify and learn from
  • Prompt tweaks are used to mask underlying issues in task design
  • Operational metrics are ignored because they are harder to measure
  • Model updates are deployed without re-running baseline scenarios

None of these are technical limitations. They are process gaps.

What good looks like

A defensible evaluation setup has a few clear properties:

  • Tasks are defined in terms of downstream decisions, not abstract outputs
  • Scenarios reflect how matters evolve over time
  • Scoring captures omission and uncertainty, not just correctness
  • Human interaction is measured, not assumed
  • Stability is tested continuously, not once
  • Release decisions are tied to evidence, not confidence

At that point, evaluation becomes part of the system, not a one-off exercise.

Testing cadence

A simple cadence is enough if it is followed consistently:

  1. Pre-pilot baseline
    Establish initial performance across the scenario bank

  2. Weekly pilot review
    Track operational metrics and emerging failure patterns

  3. Pre-production decision
    Assess against defined thresholds

  4. Monthly production checks
    Re-run scenarios and review drift

Skip any of these and issues will accumulate quietly.

Checklist

  • Tasks are defined in terms of decisions and reliance
  • Scenario bank reflects matter phase, sensitivity, and usage
  • Scoring includes completeness and uncertainty, not just accuracy
  • Operational metrics are tracked during pilot
  • Stability tests are scheduled and executed
  • Go or no-go thresholds are defined and enforced
  • Failures are classified and linked to escalation paths
  • Production monitoring is planned before release

Related

Pair this with matter-state workflows.

If your scenarios do not reflect how work changes over time, your evaluation will miss the failures that actually matter.