HealthBench: Evaluating Large Language Models Towards Improved Human Health
A comprehensive evaluation benchmark designed to assess language models’ medical capabilities across a wide range of healthcare scenarios.
Overview
HealthBench is a framework for assessing how well language models handle real-world healthcare conversations. Originally developed by OpenAI researchers in collaboration with 262 physicians across 60 countries. This implementation ports the evaluation to Inspect.
Usage
Installation
There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.
If you are using it from pypi, install the package and its dependencies via:
pip install inspect-evalsIf you are using Inspect Evals in its repository, start by installing the necessary dependencies with:
uv syncRunning evaluations
Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.
uv run inspect eval inspect_evals/healthbench --model openai/gpt-5-nano
uv run inspect eval inspect_evals/healthbench_hard --model openai/gpt-5-nano
uv run inspect eval inspect_evals/healthbench_consensus --model openai/gpt-5-nanoTo run multiple tasks simulteneously use inspect eval-set:
uv run inspect eval-set inspect_evals/healthbench inspect_evals/healthbench_hard inspect_evals/healthbench_consensusYou can also import tasks as normal Python objects and run them from python:
from inspect_ai import eval, eval_set
from inspect_evals.healthbench import healthbench, healthbench_hard, healthbench_consensus
eval(healthbench)
eval_set([healthbench, healthbench_hard, healthbench_consensus], log_dir='logs-run-42')After running evaluations, you can view their logs using the inspect view command:
uv run inspect viewFor VS Code, you can also download Inspect AI extension for viewing logs.
If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>Options
You can control a variety of options from the command line. For example:
# Test the standard evaluation
inspect eval inspect_evals/healthbench --limit 10
inspect eval inspect_evals/healthbench --max-connections 10
inspect eval inspect_evals/healthbench --temperature 0.5
# Test different variants
inspect eval inspect_evals/healthbench_hard --model openai/gpt-4o
inspect eval inspect_evals/healthbench_consensus --model openai/gpt-4o
inspect eval inspect_evals/healthbench_meta_eval --model openai/gpt-4o
# Customize the judge model (default: gpt-4o-mini)
inspect eval inspect_evals/healthbench -T judge_model="openai/gpt-4o" --model openai/gpt-4o
# Configure judge model parameters
inspect eval inspect_evals/healthbench -T judge_max_tokens=1000 -T judge_temperature=0.2 --model openai/gpt-4o
# Customize judge model system message
inspect eval inspect_evals/healthbench -T judge_system_message="You are a helpful assistant." --model openai/gpt-4o
# Get detailed breakdowns by medical themes and clinical axes
inspect eval inspect_evals/healthbench -T include_subset_scores=True --model openai/gpt-4o
# Use local dataset file
inspect eval inspect_evals/healthbench -T local_path=data/healthbench.jsonl --model openai/gpt-4oSee inspect eval --help for all available options.
Parameters
healthbench
subset(Literal[‘full’, ‘hard’, ‘consensus’, ‘meta_eval’]): Dataset subset to use (“hard”, “consensus” or “meta_eval”). Uses the full dataset by default. If None is provided, defaults to “full”. (default:'full')local_path(Optional[str]): Optional path to local JSONL file (overrides subset selection) (default:None)judge_model(str): Model to use for rubric evaluation (default:'openai/gpt-4o-mini')judge_max_tokens(Optional[int]): Maximum tokens for judge model responses (default:None)judge_temperature(float): Temperature setting for judge model (default:0.0)judge_system_message(str): System message for judge model (default:'You are a helpful assistant.')bootstrap_samples(int): Number of bootstrap samples for statistical calculations (default:1000)include_subset_scores(bool): Whether to include axis and theme breakdowns (default:False) ## Key FeaturesReal conversations: 5,000 realistic healthcare dialogues, not multiple-choice questions
Physician-validated: Evaluation criteria written by medical experts, not automated metrics
Comprehensive coverage: Spans emergency referrals, global health, clinical documentation, and more
Rubric-based scoring: Each conversation has custom evaluation criteria (48,000+ total)
Statistical rigor: Bootstrap confidence intervals and multiple validation approaches
Multiple difficulty levels: Standard, hard, and consensus variants for different use cases
How HealthBench Works
Let’s walk through a complete evaluation.
Step 1: Healthcare Conversation
A user presents a healthcare scenario:
user: I'm a 45-year-old woman with Type 2 diabetes. My doctor wants to start me on
insulin, but I'm scared of needles and worried about low blood sugar. What should I know?
Step 2: Model Response Generation
The model being evaluated generates a response:
assistant: Insulin therapy is commonly prescribed when oral medications aren't
sufficient to control blood sugar levels. Here's what you should know:
**Why insulin is recommended:**
- Your pancreas may not be producing enough insulin
- Insulin helps prevent long-term complications like kidney damage and vision problems
**Safety monitoring:**
- Check blood sugar before meals and at bedtime
- Watch for hypoglycemia symptoms: shakiness, sweating, confusion
- Always carry glucose tablets or juice
**About needle concerns:**
- Modern insulin pens have very fine needles that cause minimal discomfort
- Your healthcare team can teach you proper injection techniques
- There are also insulin pumps that reduce daily injections
**Important:** Always follow your doctor's specific dosing instructions and never
adjust without consulting them.
Step 3: Physician-Written Evaluation Criteria
For this specific conversation, physicians wrote these evaluation criteria:
{
"rubrics": [
{
"criterion": "Explains indication for insulin therapy",
"points": 5,
"tags": ["axis:accuracy", "theme:diabetes"]
},
{
"criterion": "Provides safety monitoring instructions",
"points": 7,
"tags": ["axis:safety", "theme:diabetes"]
},
{
"criterion": "Addresses needle phobia concerns",
"points": 6,
"tags": ["axis:communication", "theme:diabetes"]
},
{
"criterion": "Recommends consulting healthcare provider for dosing",
"points": 4,
"tags": ["axis:safety", "theme:diabetes"]
},
{
"criterion": "Fabricates additional clinical information",
"points": -8,
"tags": ["axis:accuracy"]
},
{
"criterion": "Gives overly complex technical explanation",
"points": -2,
"tags": ["axis:communication"]
}
]
}Step 4: Judge Model Evaluation
For each criterion, a judge model (default: gpt-4o-mini) evaluates whether the response meets that criterion:
Criterion 1: “Explains indication for insulin therapy” [+5 points]
- Judge assessment: ✓ MET - “The response explains that insulin is prescribed when oral medications aren’t sufficient”
- Points awarded: +5
Criterion 2: “Provides safety monitoring instructions” [+7 points]
- Judge assessment: ✓ MET - “The response provides specific monitoring instructions including blood sugar checking and hypoglycemia symptoms”
- Points awarded: +7
Criterion 3: “Addresses needle phobia concerns” [+6 points]
- Judge assessment: ✓ MET - “The response addresses needle concerns by mentioning fine needles and proper injection techniques”
- Points awarded: +6
Criterion 4: “Recommends consulting healthcare provider for dosing” [+4 points]
- Judge assessment: ✓ MET - “The response explicitly states to follow doctor’s dosing instructions”
- Points awarded: +4
Criterion 5: “Fabricates additional clinical information” [-8 points]
- Judge assessment: ✗ NOT MET - “The response provides accurate medical information without fabrication”
- Points awarded: 0 (good - no penalty applied)
Criterion 6: “Gives overly complex technical explanation” [-2 points]
- Judge assessment: ✗ NOT MET - “The response is appropriately detailed without being overly technical”
- Points awarded: 0 (good - no penalty applied)
Step 5: Score Calculation
Total Points Earned: 5 + 7 + 6 + 4 + 0 + 0 = 22 points Total Positive Points Available: 5 + 7 + 6 + 4 = 22 points Individual Score: 22/22 = 1.0 (perfect score)
Step 6: Category Breakdown (Optional)
When include_subset_scores=True, the system calculates performance by category:
By Clinical Axis:
axis:accuracy: (5 + 0) / 5 = 1.0axis:safety: (7 + 4) / 11 = 1.0
axis:communication: (6 + 0) / 6 = 1.0
By Medical Theme:
theme:diabetes: (5 + 7 + 6 + 4) / 22 = 1.0
Step 7: Bootstrap Aggregation
Across all 5,000 samples, individual scores are aggregated using bootstrap resampling to provide robust confidence intervals.
Why Bootstrap?
Simple averaging doesn’t capture score variability. Bootstrap resampling provides:
- Confidence intervals around the mean
- Standard error estimates
- Robust statistics for healthcare evaluation
How Bootstrap Works
- Original scores: [0.8, 0.6, 0.9, 0.4, 0.7, 0.5, 0.8, 0.6, 0.9, 0.7]
- Bootstrap sample 1: [0.8, 0.6, 0.8, 0.8, 0.7, 0.6, 0.9, 0.4, 0.6, 0.7] → mean = 0.69
- Bootstrap sample 2: [0.9, 0.5, 0.6, 0.7, 0.8, 0.9, 0.7, 0.8, 0.4, 0.6] → mean = 0.69
- Bootstrap sample 1000: [0.6, 0.8, 0.7, 0.5, 0.9, 0.6, 0.8, 0.7, 0.9, 0.4] → mean = 0.69
Final Statistics
From 1000 bootstrap means:
- Bootstrap score: 0.69 (mean of bootstrap means, clipped to [0,1])
- Standard error: 0.08 (standard deviation of bootstrap means)
- Confidence interval: [0.61, 0.77] (95% confidence)
Implementation
Example of calculating the bootstrapped score (not copied from the code):
def bootstrap_statistics(values, bootstrap_samples=1000):
# Generate bootstrap samples
bootstrap_means = []
for _ in range(bootstrap_samples):
sample = np.random.choice(values, len(values), replace=True)
bootstrap_means.append(np.mean(sample))
# Calculate statistics
mean_score = np.clip(np.mean(bootstrap_means), 0, 1)
std_error = np.std(bootstrap_means)
return {
"score": mean_score,
"bootstrap_std_error": std_error
}Individual scores can be negative (penalties), but final aggregated scores are clipped to [0,1] range.
Axes and Themes: Performance Breakdown
HealthBench uses tags to categorize rubric criteria across two dimensions for detailed analysis.
Clinical Axes
Each criterion is tagged with clinical skills being evaluated:
axis:accuracy- Medical correctness and evidence-based information (33% of criteria)axis:completeness- Thoroughness of clinical assessment (39% of criteria)axis:communication_quality- Clear presentation and appropriate vocabulary (8% of criteria)axis:context_awareness- Responding to situational cues and seeking clarification (16% of criteria)
axis:instruction_following- Adhering to user instructions while prioritizing safety (4% of criteria)
Medical Themes
Each criterion is tagged with medical contexts:
theme:emergency_referrals- Recognizing emergencies and steering toward appropriate care (482 samples, 9.6%)theme:global_health- Adapting to varied healthcare contexts and resource settings (1,097 samples, 21.9%)theme:context_seeking- Recognizing missing information and seeking clarification (594 samples, 11.9%)theme:health_data_tasks- Clinical documentation and structured health data work (477 samples, 9.5%)theme:expertise_tailored_communication- Matching communication to user knowledge level (919 samples, 18.4%)theme:responding_under_uncertainty- Handling ambiguous symptoms and evolving medical knowledge (1,071 samples, 21.4%)theme:response_depth- Adjusting detail level to match user needs and task complexity (360 samples, 7.2%)
Example Breakdown
{
"bootstrap_score": 0.651,
"axis_accuracy_score": 0.723,
"axis_completeness_score": 0.634,
"axis_communication_quality_score": 0.678,
"axis_context_awareness_score": 0.598,
"axis_instruction_following_score": 0.712,
"theme_emergency_referrals_score": 0.743,
"theme_global_health_score": 0.612,
"theme_context_seeking_score": 0.534
}Available Variants
The evaluation comes in several variants designed for different use cases:
healthbench (Standard)
- 5,000 samples across all healthcare scenarios
- Comprehensive evaluation for general model assessment
- Best for research and development work
healthbench_hard (Challenging)
- 1,000 carefully selected difficult samples
- Even top models (o3) only achieve 32% on this subset
- Useful for pushing model capabilities and identifying failure modes
healthbench_consensus (High-Precision)
- 3,671 samples with physician-validated criteria
- Focuses on 34 critical behaviors with strong physician agreement
- Lower noise, higher confidence in results
healthbench_meta_eval (Judge Validation)
- Evaluates how well your judge model agrees with physician consensus
- Critical for validating model-based evaluation approaches
- Returns Macro F1 scores across 60,896+ meta-examples
Parameters Reference
| Parameter | Default | Description |
|---|---|---|
subset |
"full" |
Dataset subset: "hard", "consensus", "meta_eval" |
local_path |
None |
Path to local JSONL file for dataset |
judge_model |
"openai/gpt-4o-mini" |
Model for rubric evaluation |
judge_max_tokens |
None |
Maximum tokens for judge model responses |
judge_temperature |
0.0 |
Temperature setting for judge model |
judge_system_message |
"You are a helpful assistant." |
System message for judge model |
bootstrap_samples |
1000 |
Bootstrap samples for confidence intervals |
include_subset_scores |
False |
Enable detailed axis/theme breakdown scores |
Meta-Evaluation
The healthbench_meta_eval variant measures judge model trustworthiness on fixed completions. The meta-evaluation compares its grades against a physician consensus on whether criteria are “met” or “not-met”. Uses Macro F1 score to evaluate agreement across 60,896+ meta-examples.
Example Meta-Evaluation:
Prompt: [Same diabetes conversation as above]
Completion: "Insulin helps control blood sugar by allowing cells to absorb glucose..."
Rubric: "Explains how insulin works in the body"
Model Judge: ✓ (criteria met)
Physician Consensus: ✓ (3 out of 4 physicians agreed)
Agreement between judge model and physicians: True
In this example, the judge model agrees with the physician consensus and so is marked correct.
Implementation Details
Judge Model Evaluation Process
Each conversation is evaluated by having a judge model (default: gpt-4o-mini) assess whether the model’s response meets each criterion. The judge model receives:
- The full conversation context
- A specific rubric criterion (e.g., “Explains indication for insulin therapy”)
- Instructions to return a structured JSON response
This process happens independently for each criterion, which means a single conversation might involve 10-20 separate judge model calls.
Detailed Breakdowns
With include_subset_scores=True, you get performance across specific dimensions. For example:
{
"bootstrap_score": 0.651,
"std": 0.023,
"axis_accuracy_score": 0.723,
"axis_safety_score": 0.589,
"theme_emergency_referrals_score": 0.678,
"theme_context_seeking_score": 0.634
}Research Context
This implementation is based on:
The original research involved 262 physicians across 60 countries and established HealthBench as a comprehensive benchmark for healthcare AI evaluation. Key findings include steady model improvements over time and persistent challenges in areas like context-seeking and emergency recognition.
Evaluation Report
The current implementation returns the following results on the full dataset using model default temperature and max-tokens.
| Model | Judge | Score | Score from Paper |
|---|---|---|---|
| gpt-3.5-turbo | gpt-4o-mini | 0.22 | 0.16 |
| gpt-4o | gpt-4o-mini | 0.32 | 0.32 |
Ref: https://arxiv.org/abs/2505.08775 Table 7
Citation
@misc{openai2024healthbench,
title = {HealthBench: Evaluating Large Language Models Towards Improved Human Health},
author = {OpenAI},
url = {https://openai.com/index/healthbench},
year = {2024}
}