Inspect AI: the UK AISI's eval framework everyone copied

There is a quiet pattern in 2026’s AI infrastructure landscape that nobody seems to have written up properly. The most important open-source projects in agent evaluation are not coming from the labs, the hyperscalers, or the VC-funded eval startups. They are coming from a government safety institute that releases everything under MIT, hosts the code on the UK government’s GitHub org, and ships it with the same seriousness that drug regulators ship clinical-trial protocols.

That project is Inspect AI, released by the UK AI Safety Institute in May 2024. By the spring of 2026 it is the framework Anthropic, OpenAI, and Google DeepMind use for the safety evaluations they publish on their system cards. It is what the UK AISI itself uses for the pre-deployment testing that frontier labs voluntarily submit their models to. It is increasingly what enterprise teams use when they build their own evaluation pipelines.

This post is about why it won, what its architecture actually is, and where the alternatives — OpenAI’s evals, EleutherAI’s lm-eval-harness, Braintrust — still have an edge.

The shape of an Inspect AI evaluation

The architecture is the centrepiece, so it goes first. An Inspect eval is a Task composed of three things plus a target model:

The four pieces of an Inspect evaluation. Each one is composable and packageable as a normal Python module — you can publish a Solver and share it the way you’d publish a NumPy function.

A minimal eval, copied roughly from the Inspect docs:

from inspect_ai import Task, task
from inspect_ai.dataset import csv_dataset
from inspect_ai.scorer import match
from inspect_ai.solver import chain_of_thought, generate

@task
def grade_school_math() -> Task:
    return Task(
        dataset=csv_dataset("gsm8k.csv"),
        solver=[chain_of_thought(), generate()],
        scorer=match(numeric=True),
    )

Run it from the command line:

inspect eval grade_school_math.py --model anthropic/claude-4.5-sonnet
inspect eval grade_school_math.py --model openai/gpt-5
inspect eval grade_school_math.py --model vllm/local-qwen-72b

What that gets you is a structured eval log (in .eval format, viewable in the bundled Inspect View web UI or a VS Code extension), a token cost report, and a scored result across the dataset. The model flag is the only thing that changes between providers — Anthropic, OpenAI, Google, Groq, Mistral, xAI, AWS Bedrock, Azure AI, Together, Cloudflare, Goodfire, plus local vLLM, Ollama, and llama-cpp are all first-class. That model-agnosticism turns out to be the single biggest reason labs picked Inspect over rolling their own.

The architecture’s deeper move: each piece is a normal Python module you can publish and share. A Solver is a function. A Scorer is a function. A Dataset is a function. The composition is just Python. This matters because eval suites get reused across labs — the Inspect Evals repo (maintained jointly by the UK AISI, Arcadia Impact, and the Vector Institute) is now over 100 implemented benchmarks, all sharing the same primitives.

How the labs use it

The most striking endorsement of Inspect is what shows up in the system cards. The Claude Sonnet 4.5 system card and the Claude Opus 4.5 system card both describe pre-deployment safety evaluations run through Inspect: multi-turn agentic evals, sandboxed code execution scenarios, deception probes. The evaluation harness for these isn’t proprietary — it’s the same Inspect Python that anyone can install with pip.

OpenAI’s recent safety reports follow the same pattern. Google DeepMind uses Inspect for capability and safety evaluations on Gemini. The common thread: when three frontier labs all need to publish reproducible evaluation methodology, and one of them is going to be the “independent third party” (the UK AISI itself), Inspect is the lingua franca that lets them compare results without arguing about harness differences.

What Inspect specifically does for safety teams that hand-rolled harnesses don’t:

Sandboxed execution. Built-in Docker sandbox per sample, with optional Kubernetes or Proxmox adapters for evals that run risky tools. This matters because most serious safety evals involve running model-generated code or letting the model touch a file system — and you do not want that on your laptop.
Multi-turn and agentic workflows. Solvers can be arbitrary multi-step Python with tool calls and intermediate scoring. The bibliographies in Anthropic’s system cards now include named “agentic eval” tasks that are Solvers Anthropic published back to Inspect Evals.
A standard log format. Every Inspect run produces a .eval file with the full transcript, tool calls, scores, and metadata. The Inspect View UI renders these into something an auditor can read. No proprietary serialisation.
Subset and tag filtering. When you want to re-run only the “biorisk” subset of a thousand-sample eval, you don’t write a SQL query — you pass --tags biorisk on the command line.

The combination of those four turns Inspect from “a Python library” into “what you write when you want a regulator to take your eval seriously.” The UK AISI has been running pre-deployment evaluations on frontier models with this stack since 2024, and the labs that submit to those evaluations have an obvious incentive to use the same framework internally.

The alternatives, honestly

Three other frameworks deserve naming, because each one wins in a specific neighbourhood Inspect doesn’t quite cover:

OpenAI Evals — the original open-source eval framework, registry-based, designed around reproducible benchmark-style evals. Simpler than Inspect. Works beautifully for “I want to run a single eval against several OpenAI models.” Less flexible for multi-turn agentic evals or custom solvers. Adoption has plateaued — it solved the 2023 problem cleanly and didn’t extend into the 2025 problem of agent evals.
lm-evaluation-harness (EleutherAI) — the academic default. Hundreds of pre-built benchmarks, the harness that every open-weight model release runs against, the basis of most leaderboards. If you need to report on standard benchmarks (MMLU, GSM8K, HumanEval, etc.) for a paper, this is the right tool. It’s less suited to bespoke business evals because the abstraction is benchmark-shaped, not task-shaped.
Braintrust — the commercial eval platform. Dataset-first, with annotation queues, dataset versioning, sandboxed Python scorers and a polished UI. The trade-off is obvious — it’s a SaaS, the data leaves your environment, and you pay for the platform. Many teams pair Inspect (for the unit-level CI checks) with Braintrust or LangSmith (for the production trace annotation and dataset management).

There is also a long tail of newer eval-focused libraries — DeepEval, Promptfoo, Pydantic Evals — each of which carves out a useful niche. None of them have hit Inspect’s combination of “rigorous primitives + sandboxing + provider neutrality

government-institute backing.”

The 2026 working pattern looks roughly like this:

The 2026 eval stack is rarely one tool. Inspect AI is the rigorous-eval anchor; commercial platforms own the annotation/dataset UX; lm-eval-harness owns the benchmark-leaderboard story.

Replicating a frontier-lab safety eval

A useful concrete exercise: take a published safety evaluation from a recent system card and reproduce it in Inspect. Most of them are easier than they look, because the architecture mirrors how the labs describe the evals on the page.

Imagine reproducing the “agentic CTF challenge” eval pattern that shows up in Anthropic’s and DeepMind’s recent reports — where the model is placed in a sandboxed shell with a tool budget and asked to solve a capture-the-flag style task. Roughly:

from inspect_ai import Task, task
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import includes
from inspect_ai.solver import basic_agent, system_message
from inspect_ai.tool import bash, python

@task
def ctf_challenges() -> Task:
    return Task(
        dataset=json_dataset("ctf_set.jsonl"),
        solver=basic_agent(
            init=system_message("You are a CTF solver..."),
            tools=[bash(), python()],
            max_attempts=12,
        ),
        scorer=includes(),
        sandbox="docker",
    )

The interesting bits:

basic_agent is the built-in ReAct-style solver that handles the tool-loop, observations, and reflection. Three lines instead of three hundred.
tools=[bash(), python()] are Inspect’s built-in sandboxed tools. Each tool call runs inside the per-sample Docker container; there is no way for the model to escape into your laptop.
sandbox="docker" is the activation flag. Add a compose.yaml for more complex environments (the Inspect sandbox docs walk through Kubernetes deployments for multi-VM evaluations).

In practice, the “replication” exercise becomes a curriculum exercise. Anthropic’s safety team contributed back several of their evals to the public Inspect Evals repo, so reproducing an eval is often git clone && inspect eval. That’s what “everyone copied” looks like in 2026 — the labs don’t just use Inspect, they publish their evals back into the shared registry.

What the wider lesson is

Inspect AI is the most successful example of a pattern that I think will define a lot of 2026–2028 infrastructure. The pattern: a government safety institute, well-funded and politically motivated to make AI evaluation legible, ships open-source infrastructure with the quality and seriousness that the labs themselves can’t (because they’re busy shipping models) and the eval-startup ecosystem won’t (because their business model doesn’t permit it).

The result is a kind of regulatory-flywheel open source:

Labs adopt Inspect because using the same framework as the regulator reduces friction with pre-deployment evaluation.
Researchers adopt Inspect because the labs publish their evals in Inspect format.
Enterprises adopt Inspect because the open-source ecosystem of evals is now richer in Inspect than in any commercial framework.
The UK AISI keeps building Inspect because it makes its core mission easier.

It’s the kind of feedback loop that makes a project the de facto standard in three years from a base of zero. We are already most of the way there.

What to take away

Three things:

Inspect AI is the eval framework to learn in 2026 if you don’t know any of them yet. Tasks + Solvers + Scorers + Models is the right abstraction; the sandboxing is what most home-brewed harnesses get wrong; and the ecosystem of published evals is uniquely rich.
The alternatives still matter in their lanes. OpenAI Evals for registry-based simple evals, lm-eval-harness for benchmark leaderboards, Braintrust/LangSmith for the dataset annotation and CI integration story. The 2026 eval stack is usually two tools, not one.
A government safety institute is now producing the canonical open-source infrastructure for the field. That’s not a sentence anyone would have predicted in 2022, but it explains why Inspect won, and probably tells us something about where the next wave of shared LLM infra will come from.

The frontier labs’ safety teams are running Inspect. The UK AISI is running Inspect. Enterprise eval engineers are quietly reaching for it because it’s the closest thing the open-source world has to a standard. The lesson is the one that always shows up when good infrastructure wins — the project that solved the problem rigorously, under the right licence, with the right institutional backing, became the default. There isn’t a more interesting story than that.

Further reading: the Inspect AI documentation, the Inspect Evals registry, the UK AISI pre-deployment evaluation page, and the Claude Sonnet 4.5 system card as a concrete example of an Inspect-driven safety report.