MedCalc-Bench
Evaluates whether language models can perform clinical calculations from free-text patient notes. Each sample provides a patient note and asks the model to compute a specific clinical value across 55 calculators (lab tests, physical quantities, risk/severity scores, dosages, and dates). Rule-based calculators are scored by exact match; equation-based calculators are scored within the authors’ tolerance band. Faithful Inspect AI port of the NeurIPS 2024 benchmark.
Overview
⚠️ External evaluation. Code lives in an upstream repository. inspect_evals lists it for discoverability; review the upstream repo and pinned commit before running.
Source: azrabano23/medcalc-bench-inspect@e497f0e
Evaluates whether language models can perform clinical calculations from free-text patient notes. Each sample provides a patient note and asks the model to compute a specific clinical value across 55 calculators (lab tests, physical quantities, risk/severity scores, dosages, and dates). Rule-based calculators are scored by exact match; equation-based calculators are scored within the authors’ tolerance band. Faithful Inspect AI port of the NeurIPS 2024 benchmark.
Usage
Installation
This is an externally-maintained evaluation. Clone the upstream repository at the pinned commit and install its dependencies:
git clone https://github.com/azrabano23/medcalc-bench-inspect
cd medcalc-bench-inspect
git checkout e497f0e327b2e9245f3e832badb320ce84d6ddb5
uv syncRunning evaluations
CLI
uv run inspect eval src/medcalc_bench/medcalc_bench.py@medcalc_bench --model openai/gpt-5-nanoPython
from inspect_ai import eval
from medcalc_bench.medcalc_bench import medcalc_bench
eval(medcalc_bench(), model="openai/gpt-5-nano")View logs
uv run inspect viewMore information
For the dataset, scorer, task parameters, and validation, see the upstream repo: azrabano23/medcalc-bench-inspect.
Options
You can control a variety of options from the command line. For example:
uv run inspect eval src/medcalc_bench/medcalc_bench.py@medcalc_bench --limit 10
uv run inspect eval src/medcalc_bench/medcalc_bench.py@medcalc_bench --max-connections 10
uv run inspect eval src/medcalc_bench/medcalc_bench.py@medcalc_bench --temperature 0.5See uv run inspect eval --help for all available options.
More command-line options: Inspect docs ↗
Evaluation Report
Timestamp: June 2026
Commit: 5508e9b
uv run inspect eval src/medcalc_bench/medcalc_bench.py@medcalc_bench --model openai/gpt-4o-mini --limit 100| Model | Provider | Accuracy | Stderr | Time |
|---|---|---|---|---|
| openai/gpt-4o-mini | OpenAI | 0.590 | 0.049 | 1m 5s |
Notes:
- Run on 100 samples from the test split (zero_shot_cot prompt), 1 epoch.