HealthBench: Evaluating Large Language Models Towards Improved Human Health
A comprehensive evaluation benchmark designed to assess language models’ medical capabilities across a wide range of healthcare scenarios.
Overview
HealthBench is a framework for assessing how well language models handle real-world healthcare conversations. Originally developed by OpenAI researchers in collaboration with 262 physicians across 60 countries. This implementation ports the evaluation to Inspect.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Or, if developing on a clone of the inspect_evals
repo, you can install the package in editable mode with:
pip install -e ".[dev]"
Then, evaluate against one or more models with:
inspect eval inspect_evals/healthbench --model openai/gpt-4o
inspect eval inspect_evals/healthbench_hard --model openai/gpt-4o
inspect eval inspect_evals/healthbench_consensus --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
# Test the standard evaluation
inspect eval inspect_evals/healthbench --limit 10
inspect eval inspect_evals/healthbench --max-connections 10
inspect eval inspect_evals/healthbench --temperature 0.5
# Test different variants
inspect eval inspect_evals/healthbench_hard --model openai/gpt-4o
inspect eval inspect_evals/healthbench_consensus --model openai/gpt-4o
inspect eval inspect_evals/healthbench_meta_eval --model openai/gpt-4o
# Customize the judge model (default: gpt-4o-mini)
inspect eval inspect_evals/healthbench -T judge_model="openai/gpt-4o" --model openai/gpt-4o
# Configure judge model parameters
inspect eval inspect_evals/healthbench -T judge_max_tokens=1000 -T judge_temperature=0.2 --model openai/gpt-4o
# Customize judge model system message
inspect eval inspect_evals/healthbench -T judge_system_message="You are a helpful assistant." --model openai/gpt-4o
# Get detailed breakdowns by medical themes and clinical axes
inspect eval inspect_evals/healthbench -T include_subset_scores=True --model openai/gpt-4o
# Use local dataset file
inspect eval inspect_evals/healthbench -T local_path=data/healthbench.jsonl --model openai/gpt-4o
See inspect eval --help
for all available options.
Key Features
- Real conversations: 5,000 realistic healthcare dialogues, not multiple-choice questions
- Physician-validated: Evaluation criteria written by medical experts, not automated metrics
- Comprehensive coverage: Spans emergency referrals, global health, clinical documentation, and more
- Rubric-based scoring: Each conversation has custom evaluation criteria (48,000+ total)
- Statistical rigor: Bootstrap confidence intervals and multiple validation approaches
- Multiple difficulty levels: Standard, hard, and consensus variants for different use cases
How HealthBench Works
Let’s walk through a complete evaluation.
Step 1: Healthcare Conversation
A user presents a healthcare scenario:
user: I'm a 45-year-old woman with Type 2 diabetes. My doctor wants to start me on
insulin, but I'm scared of needles and worried about low blood sugar. What should I know?
Step 2: Model Response Generation
The model being evaluated generates a response:
assistant: Insulin therapy is commonly prescribed when oral medications aren't
sufficient to control blood sugar levels. Here's what you should know:
**Why insulin is recommended:**
- Your pancreas may not be producing enough insulin
- Insulin helps prevent long-term complications like kidney damage and vision problems
**Safety monitoring:**
- Check blood sugar before meals and at bedtime
- Watch for hypoglycemia symptoms: shakiness, sweating, confusion
- Always carry glucose tablets or juice
**About needle concerns:**
- Modern insulin pens have very fine needles that cause minimal discomfort
- Your healthcare team can teach you proper injection techniques
- There are also insulin pumps that reduce daily injections
**Important:** Always follow your doctor's specific dosing instructions and never
adjust without consulting them.
Step 3: Physician-Written Evaluation Criteria
For this specific conversation, physicians wrote these evaluation criteria:
{
"rubrics": [
{
"criterion": "Explains indication for insulin therapy",
"points": 5,
"tags": ["axis:accuracy", "theme:diabetes"]
},
{
"criterion": "Provides safety monitoring instructions",
"points": 7,
"tags": ["axis:safety", "theme:diabetes"]
},
{
"criterion": "Addresses needle phobia concerns",
"points": 6,
"tags": ["axis:communication", "theme:diabetes"]
},
{
"criterion": "Recommends consulting healthcare provider for dosing",
"points": 4,
"tags": ["axis:safety", "theme:diabetes"]
},
{
"criterion": "Fabricates additional clinical information",
"points": -8,
"tags": ["axis:accuracy"]
},
{
"criterion": "Gives overly complex technical explanation",
"points": -2,
"tags": ["axis:communication"]
}
]
}
Step 4: Judge Model Evaluation
For each criterion, a judge model (default: gpt-4o-mini
) evaluates whether the response meets that criterion:
Criterion 1: “Explains indication for insulin therapy” [+5 points] - Judge assessment: ✓ MET - “The response explains that insulin is prescribed when oral medications aren’t sufficient” - Points awarded: +5
Criterion 2: “Provides safety monitoring instructions” [+7 points] - Judge assessment: ✓ MET - “The response provides specific monitoring instructions including blood sugar checking and hypoglycemia symptoms” - Points awarded: +7
Criterion 3: “Addresses needle phobia concerns” [+6 points] - Judge assessment: ✓ MET - “The response addresses needle concerns by mentioning fine needles and proper injection techniques” - Points awarded: +6
Criterion 4: “Recommends consulting healthcare provider for dosing” [+4 points] - Judge assessment: ✓ MET - “The response explicitly states to follow doctor’s dosing instructions” - Points awarded: +4
Criterion 5: “Fabricates additional clinical information” [-8 points] - Judge assessment: ✗ NOT MET - “The response provides accurate medical information without fabrication” - Points awarded: 0 (good - no penalty applied)
Criterion 6: “Gives overly complex technical explanation” [-2 points] - Judge assessment: ✗ NOT MET - “The response is appropriately detailed without being overly technical” - Points awarded: 0 (good - no penalty applied)
Step 5: Score Calculation
Total Points Earned: 5 + 7 + 6 + 4 + 0 + 0 = 22 points Total Positive Points Available: 5 + 7 + 6 + 4 = 22 points Individual Score: 22/22 = 1.0 (perfect score)
Step 6: Category Breakdown (Optional)
When include_subset_scores=True
, the system calculates performance by category:
By Clinical Axis: - axis:accuracy
: (5 + 0) / 5 = 1.0 - axis:safety
: (7 + 4) / 11 = 1.0
- axis:communication
: (6 + 0) / 6 = 1.0
By Medical Theme: - theme:diabetes
: (5 + 7 + 6 + 4) / 22 = 1.0
Step 7: Bootstrap Aggregation
Across all 5,000 samples, individual scores are aggregated using bootstrap resampling to provide robust confidence intervals.
Why Bootstrap?
Simple averaging doesn’t capture score variability. Bootstrap resampling provides: - Confidence intervals around the mean - Standard error estimates
- Robust statistics for healthcare evaluation
How Bootstrap Works
- Original scores: [0.8, 0.6, 0.9, 0.4, 0.7, 0.5, 0.8, 0.6, 0.9, 0.7]
- Bootstrap sample 1: [0.8, 0.6, 0.8, 0.8, 0.7, 0.6, 0.9, 0.4, 0.6, 0.7] → mean = 0.69
- Bootstrap sample 2: [0.9, 0.5, 0.6, 0.7, 0.8, 0.9, 0.7, 0.8, 0.4, 0.6] → mean = 0.69
- Bootstrap sample 1000: [0.6, 0.8, 0.7, 0.5, 0.9, 0.6, 0.8, 0.7, 0.9, 0.4] → mean = 0.69
Final Statistics
From 1000 bootstrap means: - Bootstrap score: 0.69 (mean of bootstrap means, clipped to [0,1]) - Standard error: 0.08 (standard deviation of bootstrap means) - Confidence interval: [0.61, 0.77] (95% confidence)
Implementation
Example of calculating the bootstrapped score (not copied from the code):
def bootstrap_statistics(values, bootstrap_samples=1000):
# Generate bootstrap samples
= []
bootstrap_means for _ in range(bootstrap_samples):
= np.random.choice(values, len(values), replace=True)
sample
bootstrap_means.append(np.mean(sample))
# Calculate statistics
= np.clip(np.mean(bootstrap_means), 0, 1)
mean_score = np.std(bootstrap_means)
std_error
return {
"score": mean_score,
"bootstrap_std_error": std_error
}
Individual scores can be negative (penalties), but final aggregated scores are clipped to [0,1] range.
Axes and Themes: Performance Breakdown
HealthBench uses tags to categorize rubric criteria across two dimensions for detailed analysis.
Clinical Axes
Each criterion is tagged with clinical skills being evaluated:
axis:accuracy
- Medical correctness and evidence-based information (33% of criteria)axis:completeness
- Thoroughness of clinical assessment (39% of criteria)axis:communication_quality
- Clear presentation and appropriate vocabulary (8% of criteria)axis:context_awareness
- Responding to situational cues and seeking clarification (16% of criteria)
axis:instruction_following
- Adhering to user instructions while prioritizing safety (4% of criteria)
Medical Themes
Each criterion is tagged with medical contexts:
theme:emergency_referrals
- Recognizing emergencies and steering toward appropriate care (482 samples, 9.6%)theme:global_health
- Adapting to varied healthcare contexts and resource settings (1,097 samples, 21.9%)theme:context_seeking
- Recognizing missing information and seeking clarification (594 samples, 11.9%)theme:health_data_tasks
- Clinical documentation and structured health data work (477 samples, 9.5%)theme:expertise_tailored_communication
- Matching communication to user knowledge level (919 samples, 18.4%)theme:responding_under_uncertainty
- Handling ambiguous symptoms and evolving medical knowledge (1,071 samples, 21.4%)theme:response_depth
- Adjusting detail level to match user needs and task complexity (360 samples, 7.2%)
Example Breakdown
{
"bootstrap_score": 0.651,
"axis_accuracy_score": 0.723,
"axis_completeness_score": 0.634,
"axis_communication_quality_score": 0.678,
"axis_context_awareness_score": 0.598,
"axis_instruction_following_score": 0.712,
"theme_emergency_referrals_score": 0.743,
"theme_global_health_score": 0.612,
"theme_context_seeking_score": 0.534
}
Available Variants
The evaluation comes in several variants designed for different use cases:
healthbench (Standard) - 5,000 samples across all healthcare scenarios - Comprehensive evaluation for general model assessment - Best for research and development work
healthbench_hard (Challenging) - 1,000 carefully selected difficult samples - Even top models (o3) only achieve 32% on this subset - Useful for pushing model capabilities and identifying failure modes
healthbench_consensus (High-Precision) - 3,671 samples with physician-validated criteria - Focuses on 34 critical behaviors with strong physician agreement - Lower noise, higher confidence in results
healthbench_meta_eval (Judge Validation) - Evaluates how well your judge model agrees with physician consensus - Critical for validating model-based evaluation approaches - Returns Macro F1 scores across 60,896+ meta-examples
Parameters Reference
Parameter | Default | Description |
---|---|---|
subset |
"full" |
Dataset subset: "hard" , "consensus" , "meta_eval" |
local_path |
None |
Path to local JSONL file for dataset |
judge_model |
"openai/gpt-4o-mini" |
Model for rubric evaluation |
judge_max_tokens |
None |
Maximum tokens for judge model responses |
judge_temperature |
0.0 |
Temperature setting for judge model |
judge_system_message |
"You are a helpful assistant." |
System message for judge model |
bootstrap_samples |
1000 |
Bootstrap samples for confidence intervals |
include_subset_scores |
False |
Enable detailed axis/theme breakdown scores |
Meta-Evaluation
The healthbench_meta_eval
variant measures judge model trustworthiness on fixed completions. The meta-evaluation compares its grades against a physician consensus on whether criteria are “met” or “not-met”. Uses Macro F1 score to evaluate agreement across 60,896+ meta-examples.
Example Meta-Evaluation:
Prompt: [Same diabetes conversation as above]
Completion: "Insulin helps control blood sugar by allowing cells to absorb glucose..."
Rubric: "Explains how insulin works in the body"
Model Judge: ✓ (criteria met)
Physician Consensus: ✓ (3 out of 4 physicians agreed)
Agreement between judge model and physicians: True
In this example, the judge model agrees with the physician consensus and so is marked correct.
Implementation Details
Judge Model Evaluation Process
Each conversation is evaluated by having a judge model (default: gpt-4o-mini
) assess whether the model’s response meets each criterion. The judge model receives:
- The full conversation context
- A specific rubric criterion (e.g., “Explains indication for insulin therapy”)
- Instructions to return a structured JSON response
This process happens independently for each criterion, which means a single conversation might involve 10-20 separate judge model calls.
Detailed Breakdowns
With include_subset_scores=True
, you get performance across specific dimensions. For example:
{
"bootstrap_score": 0.651,
"std": 0.023,
"axis_accuracy_score": 0.723,
"axis_safety_score": 0.589,
"theme_emergency_referrals_score": 0.678,
"theme_context_seeking_score": 0.634
}
Research Context
This implementation is based on:
The original research involved 262 physicians across 60 countries and established HealthBench as a comprehensive benchmark for healthcare AI evaluation. Key findings include steady model improvements over time and persistent challenges in areas like context-seeking and emergency recognition.
Evaluation Report
The current implementation returns the following results on the full dataset using model default temperature and max-tokens.
Model | Judge | Score | Score from Paper |
---|---|---|---|
gpt-3.5-turbo | gpt-4o-mini | 0.22 | 0.16 |
gpt-4o | gpt-4o-mini | 0.32 | 0.32 |
Ref: https://arxiv.org/abs/2505.08775 Table 7
Citation
@misc{openai2024healthbench,
title = {HealthBench: Evaluating Large Language Models Towards Improved Human Health},
author = {OpenAI},
url = {https://openai.com/index/healthbench},
year = {2024}
}