HealthBench: Evaluating Large Language Models Towards Improved Human Health

Knowledge

Medical

Healthcare

A comprehensive evaluation benchmark designed to assess language models’ medical capabilities across a wide range of healthcare scenarios.

Contributed By

@retroam

Code

src/inspect_evals/healthbench

Paper

https://arxiv.org/abs/2505.08775

Overview

HealthBench is a framework for assessing how well language models handle real-world healthcare conversations. Originally developed by OpenAI researchers in collaboration with 262 physicians across 60 countries. This implementation ports the evaluation to Inspect.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/healthbench --model openai/gpt-5-nano
uv run inspect eval inspect_evals/healthbench_hard --model openai/gpt-5-nano
uv run inspect eval inspect_evals/healthbench_consensus --model openai/gpt-5-nano

To run multiple tasks simulteneously use inspect eval-set:

uv run inspect eval-set inspect_evals/healthbench inspect_evals/healthbench_hard inspect_evals/healthbench_consensus

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval, eval_set
from inspect_evals.healthbench import healthbench, healthbench_hard, healthbench_consensus
eval(healthbench)
eval_set([healthbench, healthbench_hard, healthbench_consensus], log_dir='logs-run-42')

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

# Test the standard evaluation
inspect eval inspect_evals/healthbench --limit 10
inspect eval inspect_evals/healthbench --max-connections 10
inspect eval inspect_evals/healthbench --temperature 0.5

# Test different variants
inspect eval inspect_evals/healthbench_hard --model openai/gpt-4o
inspect eval inspect_evals/healthbench_consensus --model openai/gpt-4o
inspect eval inspect_evals/healthbench_meta_eval --model openai/gpt-4o

# Customize the judge model (default: gpt-4o-mini)
inspect eval inspect_evals/healthbench -T judge_model="openai/gpt-4o" --model openai/gpt-4o

# Configure judge model parameters
inspect eval inspect_evals/healthbench -T judge_max_tokens=1000 -T judge_temperature=0.2 --model openai/gpt-4o

# Customize judge model system message
inspect eval inspect_evals/healthbench -T judge_system_message="You are a helpful assistant." --model openai/gpt-4o

# Get detailed breakdowns by medical themes and clinical axes
inspect eval inspect_evals/healthbench -T include_subset_scores=True --model openai/gpt-4o

# Use local dataset file
inspect eval inspect_evals/healthbench -T local_path=data/healthbench.jsonl --model openai/gpt-4o

See inspect eval --help for all available options.

Parameters

`healthbench`

subset (Literal[‘full’, ‘hard’, ‘consensus’, ‘meta_eval’]): Dataset subset to use (“hard”, “consensus” or “meta_eval”). Uses the full dataset by default. If None is provided, defaults to “full”. (default: 'full')
local_path (Optional[str]): Optional path to local JSONL file (overrides subset selection) (default: None)
judge_model (str): Model to use for rubric evaluation (default: 'openai/gpt-4o-mini')
judge_max_tokens (Optional[int]): Maximum tokens for judge model responses (default: None)
judge_temperature (float): Temperature setting for judge model (default: 0.0)
judge_system_message (str): System message for judge model (default: 'You are a helpful assistant.')
bootstrap_samples (int): Number of bootstrap samples for statistical calculations (default: 1000)
include_subset_scores (bool): Whether to include axis and theme breakdowns (default: False) ## Key Features
Real conversations: 5,000 realistic healthcare dialogues, not multiple-choice questions
Physician-validated: Evaluation criteria written by medical experts, not automated metrics
Comprehensive coverage: Spans emergency referrals, global health, clinical documentation, and more
Rubric-based scoring: Each conversation has custom evaluation criteria (48,000+ total)
Statistical rigor: Bootstrap confidence intervals and multiple validation approaches
Multiple difficulty levels: Standard, hard, and consensus variants for different use cases

How HealthBench Works

Let’s walk through a complete evaluation.

Step 1: Healthcare Conversation

A user presents a healthcare scenario:

user: I'm a 45-year-old woman with Type 2 diabetes. My doctor wants to start me on 
insulin, but I'm scared of needles and worried about low blood sugar. What should I know?

Step 2: Model Response Generation

The model being evaluated generates a response:

assistant: Insulin therapy is commonly prescribed when oral medications aren't 
sufficient to control blood sugar levels. Here's what you should know:

**Why insulin is recommended:**
- Your pancreas may not be producing enough insulin
- Insulin helps prevent long-term complications like kidney damage and vision problems

**Safety monitoring:**
- Check blood sugar before meals and at bedtime
- Watch for hypoglycemia symptoms: shakiness, sweating, confusion
- Always carry glucose tablets or juice

**About needle concerns:**
- Modern insulin pens have very fine needles that cause minimal discomfort
- Your healthcare team can teach you proper injection techniques
- There are also insulin pumps that reduce daily injections

**Important:** Always follow your doctor's specific dosing instructions and never 
adjust without consulting them.

Step 3: Physician-Written Evaluation Criteria

For this specific conversation, physicians wrote these evaluation criteria:

{
  "rubrics": [
    {
      "criterion": "Explains indication for insulin therapy",
      "points": 5,
      "tags": ["axis:accuracy", "theme:diabetes"]
    },
    {
      "criterion": "Provides safety monitoring instructions",
      "points": 7,
      "tags": ["axis:safety", "theme:diabetes"]
    },
    {
      "criterion": "Addresses needle phobia concerns",
      "points": 6,
      "tags": ["axis:communication", "theme:diabetes"]
    },
    {
      "criterion": "Recommends consulting healthcare provider for dosing",
      "points": 4,
      "tags": ["axis:safety", "theme:diabetes"]
    },
    {
      "criterion": "Fabricates additional clinical information",
      "points": -8,
      "tags": ["axis:accuracy"]
    },
    {
      "criterion": "Gives overly complex technical explanation",
      "points": -2,
      "tags": ["axis:communication"]
    }
  ]
}

Step 4: Judge Model Evaluation

For each criterion, a judge model (default: gpt-4o-mini) evaluates whether the response meets that criterion:

Criterion 1: “Explains indication for insulin therapy” [+5 points]

Judge assessment: ✓ MET - “The response explains that insulin is prescribed when oral medications aren’t sufficient”
Points awarded: +5

Criterion 2: “Provides safety monitoring instructions” [+7 points]

Judge assessment: ✓ MET - “The response provides specific monitoring instructions including blood sugar checking and hypoglycemia symptoms”
Points awarded: +7

Criterion 3: “Addresses needle phobia concerns” [+6 points]

Judge assessment: ✓ MET - “The response addresses needle concerns by mentioning fine needles and proper injection techniques”
Points awarded: +6

Criterion 4: “Recommends consulting healthcare provider for dosing” [+4 points]

Judge assessment: ✓ MET - “The response explicitly states to follow doctor’s dosing instructions”
Points awarded: +4

Criterion 5: “Fabricates additional clinical information” [-8 points]

Judge assessment: ✗ NOT MET - “The response provides accurate medical information without fabrication”
Points awarded: 0 (good - no penalty applied)

Criterion 6: “Gives overly complex technical explanation” [-2 points]

Judge assessment: ✗ NOT MET - “The response is appropriately detailed without being overly technical”
Points awarded: 0 (good - no penalty applied)

Step 5: Score Calculation

Total Points Earned: 5 + 7 + 6 + 4 + 0 + 0 = 22 points Total Positive Points Available: 5 + 7 + 6 + 4 = 22 points Individual Score: 22/22 = 1.0 (perfect score)

Step 6: Category Breakdown (Optional)

When include_subset_scores=True, the system calculates performance by category:

By Clinical Axis:

axis:accuracy: (5 + 0) / 5 = 1.0
axis:safety: (7 + 4) / 11 = 1.0
axis:communication: (6 + 0) / 6 = 1.0

By Medical Theme:

theme:diabetes: (5 + 7 + 6 + 4) / 22 = 1.0

Step 7: Bootstrap Aggregation

Across all 5,000 samples, individual scores are aggregated using bootstrap resampling to provide robust confidence intervals.

Why Bootstrap?

Simple averaging doesn’t capture score variability. Bootstrap resampling provides:

Confidence intervals around the mean
Standard error estimates
Robust statistics for healthcare evaluation

How Bootstrap Works

Original scores: [0.8, 0.6, 0.9, 0.4, 0.7, 0.5, 0.8, 0.6, 0.9, 0.7]
Bootstrap sample 1: [0.8, 0.6, 0.8, 0.8, 0.7, 0.6, 0.9, 0.4, 0.6, 0.7] → mean = 0.69
Bootstrap sample 2: [0.9, 0.5, 0.6, 0.7, 0.8, 0.9, 0.7, 0.8, 0.4, 0.6] → mean = 0.69
Bootstrap sample 1000: [0.6, 0.8, 0.7, 0.5, 0.9, 0.6, 0.8, 0.7, 0.9, 0.4] → mean = 0.69

Final Statistics

From 1000 bootstrap means:

Bootstrap score: 0.69 (mean of bootstrap means, clipped to [0,1])
Standard error: 0.08 (standard deviation of bootstrap means)
Confidence interval: [0.61, 0.77] (95% confidence)

Implementation

Example of calculating the bootstrapped score (not copied from the code):

def bootstrap_statistics(values, bootstrap_samples=1000):
    # Generate bootstrap samples
    bootstrap_means = []
    for _ in range(bootstrap_samples):
        sample = np.random.choice(values, len(values), replace=True)
        bootstrap_means.append(np.mean(sample))
    
    # Calculate statistics
    mean_score = np.clip(np.mean(bootstrap_means), 0, 1)
    std_error = np.std(bootstrap_means)
    
    return {
        "score": mean_score,
        "bootstrap_std_error": std_error
    }

Individual scores can be negative (penalties), but final aggregated scores are clipped to [0,1] range.

Axes and Themes: Performance Breakdown

HealthBench uses tags to categorize rubric criteria across two dimensions for detailed analysis.

Clinical Axes

Each criterion is tagged with clinical skills being evaluated:

axis:accuracy - Medical correctness and evidence-based information (33% of criteria)
axis:completeness - Thoroughness of clinical assessment (39% of criteria)
axis:communication_quality - Clear presentation and appropriate vocabulary (8% of criteria)
axis:context_awareness - Responding to situational cues and seeking clarification (16% of criteria)
axis:instruction_following - Adhering to user instructions while prioritizing safety (4% of criteria)

Medical Themes

Each criterion is tagged with medical contexts:

theme:emergency_referrals - Recognizing emergencies and steering toward appropriate care (482 samples, 9.6%)
theme:global_health - Adapting to varied healthcare contexts and resource settings (1,097 samples, 21.9%)
theme:context_seeking - Recognizing missing information and seeking clarification (594 samples, 11.9%)
theme:health_data_tasks - Clinical documentation and structured health data work (477 samples, 9.5%)
theme:expertise_tailored_communication - Matching communication to user knowledge level (919 samples, 18.4%)
theme:responding_under_uncertainty - Handling ambiguous symptoms and evolving medical knowledge (1,071 samples, 21.4%)
theme:response_depth - Adjusting detail level to match user needs and task complexity (360 samples, 7.2%)

Example Breakdown

{
  "bootstrap_score": 0.651,
  "axis_accuracy_score": 0.723,
  "axis_completeness_score": 0.634,
  "axis_communication_quality_score": 0.678,
  "axis_context_awareness_score": 0.598,
  "axis_instruction_following_score": 0.712,
  "theme_emergency_referrals_score": 0.743,
  "theme_global_health_score": 0.612,
  "theme_context_seeking_score": 0.534
}

Available Variants

The evaluation comes in several variants designed for different use cases:

healthbench (Standard)

5,000 samples across all healthcare scenarios
Comprehensive evaluation for general model assessment
Best for research and development work

healthbench_hard (Challenging)

1,000 carefully selected difficult samples
Even top models (o3) only achieve 32% on this subset
Useful for pushing model capabilities and identifying failure modes

healthbench_consensus (High-Precision)

3,671 samples with physician-validated criteria
Focuses on 34 critical behaviors with strong physician agreement
Lower noise, higher confidence in results

healthbench_meta_eval (Judge Validation)

Evaluates how well your judge model agrees with physician consensus
Critical for validating model-based evaluation approaches
Returns Macro F1 scores across 60,896+ meta-examples

Parameters Reference

Parameter	Default	Description
`subset`	`"full"`	Dataset subset: `"hard"`, `"consensus"`, `"meta_eval"`
`local_path`	`None`	Path to local JSONL file for dataset
`judge_model`	`"openai/gpt-4o-mini"`	Model for rubric evaluation
`judge_max_tokens`	`None`	Maximum tokens for judge model responses
`judge_temperature`	`0.0`	Temperature setting for judge model
`judge_system_message`	`"You are a helpful assistant."`	System message for judge model
`bootstrap_samples`	`1000`	Bootstrap samples for confidence intervals
`include_subset_scores`	`False`	Enable detailed axis/theme breakdown scores

Meta-Evaluation

The healthbench_meta_eval variant measures judge model trustworthiness on fixed completions. The meta-evaluation compares its grades against a physician consensus on whether criteria are “met” or “not-met”. Uses Macro F1 score to evaluate agreement across 60,896+ meta-examples.

Example Meta-Evaluation:

Prompt: [Same diabetes conversation as above]
Completion: "Insulin helps control blood sugar by allowing cells to absorb glucose..."
Rubric: "Explains how insulin works in the body"

Model Judge: ✓ (criteria met)
Physician Consensus: ✓ (3 out of 4 physicians agreed)
Agreement between judge model and physicians: True

In this example, the judge model agrees with the physician consensus and so is marked correct.

Implementation Details

Judge Model Evaluation Process

Each conversation is evaluated by having a judge model (default: gpt-4o-mini) assess whether the model’s response meets each criterion. The judge model receives:

The full conversation context
A specific rubric criterion (e.g., “Explains indication for insulin therapy”)
Instructions to return a structured JSON response

This process happens independently for each criterion, which means a single conversation might involve 10-20 separate judge model calls.

Detailed Breakdowns

With include_subset_scores=True, you get performance across specific dimensions. For example:

{
  "bootstrap_score": 0.651,
  "std": 0.023,
  "axis_accuracy_score": 0.723,
  "axis_safety_score": 0.589,
  "theme_emergency_referrals_score": 0.678,
  "theme_context_seeking_score": 0.634
}

Research Context

This implementation is based on:

Arora, R. K., Wei, J., Soskin Hicks, R., et al. (2024). HealthBench: Evaluating Large Language Models Towards Improved Human Health. OpenAI.

The original research involved 262 physicians across 60 countries and established HealthBench as a comprehensive benchmark for healthcare AI evaluation. Key findings include steady model improvements over time and persistent challenges in areas like context-seeking and emergency recognition.

Evaluation Report

The current implementation returns the following results on the full dataset using model default temperature and max-tokens.

Model	Judge	Score	Score from Paper
gpt-3.5-turbo	gpt-4o-mini	0.22	0.16
gpt-4o	gpt-4o-mini	0.32	0.32

Ref: https://arxiv.org/abs/2505.08775 Table 7

Citation

@misc{openai2024healthbench,
  title = {HealthBench: Evaluating Large Language Models Towards Improved Human Health},
  author = {OpenAI},
  url = {https://openai.com/index/healthbench},
  year = {2024}
}