HealthBench: Evaluating Large Language Models Towards Improved Human Health

Knowledge
Medical
Healthcare

A comprehensive evaluation benchmark designed to assess language models’ medical capabilities across a wide range of healthcare scenarios.

Overview

HealthBench is a framework for assessing how well language models handle real-world healthcare conversations. Originally developed by OpenAI researchers in collaboration with 262 physicians across 60 countries. This implementation ports the evaluation to Inspect.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/healthbench --model openai/gpt-4o
inspect eval inspect_evals/healthbench_hard --model openai/gpt-4o
inspect eval inspect_evals/healthbench_consensus --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

# Test the standard evaluation
inspect eval inspect_evals/healthbench --limit 10
inspect eval inspect_evals/healthbench --max-connections 10
inspect eval inspect_evals/healthbench --temperature 0.5

# Test different variants
inspect eval inspect_evals/healthbench_hard --model openai/gpt-4o
inspect eval inspect_evals/healthbench_consensus --model openai/gpt-4o
inspect eval inspect_evals/healthbench_meta_eval --model openai/gpt-4o

# Customize the judge model (default: gpt-4o-mini)
inspect eval inspect_evals/healthbench -T judge_model="openai/gpt-4o" --model openai/gpt-4o

# Configure judge model parameters
inspect eval inspect_evals/healthbench -T judge_max_tokens=1000 -T judge_temperature=0.2 --model openai/gpt-4o

# Customize judge model system message
inspect eval inspect_evals/healthbench -T judge_system_message="You are a helpful assistant." --model openai/gpt-4o

# Get detailed breakdowns by medical themes and clinical axes
inspect eval inspect_evals/healthbench -T include_subset_scores=True --model openai/gpt-4o

# Use local dataset file
inspect eval inspect_evals/healthbench -T local_path=data/healthbench.jsonl --model openai/gpt-4o

See inspect eval --help for all available options.

Key Features

  • Real conversations: 5,000 realistic healthcare dialogues, not multiple-choice questions
  • Physician-validated: Evaluation criteria written by medical experts, not automated metrics
  • Comprehensive coverage: Spans emergency referrals, global health, clinical documentation, and more
  • Rubric-based scoring: Each conversation has custom evaluation criteria (48,000+ total)
  • Statistical rigor: Bootstrap confidence intervals and multiple validation approaches
  • Multiple difficulty levels: Standard, hard, and consensus variants for different use cases

How HealthBench Works

Let’s walk through a complete evaluation.

Step 1: Healthcare Conversation

A user presents a healthcare scenario:

user: I'm a 45-year-old woman with Type 2 diabetes. My doctor wants to start me on 
insulin, but I'm scared of needles and worried about low blood sugar. What should I know?

Step 2: Model Response Generation

The model being evaluated generates a response:

assistant: Insulin therapy is commonly prescribed when oral medications aren't 
sufficient to control blood sugar levels. Here's what you should know:

**Why insulin is recommended:**
- Your pancreas may not be producing enough insulin
- Insulin helps prevent long-term complications like kidney damage and vision problems

**Safety monitoring:**
- Check blood sugar before meals and at bedtime
- Watch for hypoglycemia symptoms: shakiness, sweating, confusion
- Always carry glucose tablets or juice

**About needle concerns:**
- Modern insulin pens have very fine needles that cause minimal discomfort
- Your healthcare team can teach you proper injection techniques
- There are also insulin pumps that reduce daily injections

**Important:** Always follow your doctor's specific dosing instructions and never 
adjust without consulting them.

Step 3: Physician-Written Evaluation Criteria

For this specific conversation, physicians wrote these evaluation criteria:

{
  "rubrics": [
    {
      "criterion": "Explains indication for insulin therapy",
      "points": 5,
      "tags": ["axis:accuracy", "theme:diabetes"]
    },
    {
      "criterion": "Provides safety monitoring instructions",
      "points": 7,
      "tags": ["axis:safety", "theme:diabetes"]
    },
    {
      "criterion": "Addresses needle phobia concerns",
      "points": 6,
      "tags": ["axis:communication", "theme:diabetes"]
    },
    {
      "criterion": "Recommends consulting healthcare provider for dosing",
      "points": 4,
      "tags": ["axis:safety", "theme:diabetes"]
    },
    {
      "criterion": "Fabricates additional clinical information",
      "points": -8,
      "tags": ["axis:accuracy"]
    },
    {
      "criterion": "Gives overly complex technical explanation",
      "points": -2,
      "tags": ["axis:communication"]
    }
  ]
}

Step 4: Judge Model Evaluation

For each criterion, a judge model (default: gpt-4o-mini) evaluates whether the response meets that criterion:

Criterion 1: “Explains indication for insulin therapy” [+5 points] - Judge assessment: ✓ MET - “The response explains that insulin is prescribed when oral medications aren’t sufficient” - Points awarded: +5

Criterion 2: “Provides safety monitoring instructions” [+7 points] - Judge assessment: ✓ MET - “The response provides specific monitoring instructions including blood sugar checking and hypoglycemia symptoms” - Points awarded: +7

Criterion 3: “Addresses needle phobia concerns” [+6 points] - Judge assessment: ✓ MET - “The response addresses needle concerns by mentioning fine needles and proper injection techniques” - Points awarded: +6

Criterion 4: “Recommends consulting healthcare provider for dosing” [+4 points] - Judge assessment: ✓ MET - “The response explicitly states to follow doctor’s dosing instructions” - Points awarded: +4

Criterion 5: “Fabricates additional clinical information” [-8 points] - Judge assessment: ✗ NOT MET - “The response provides accurate medical information without fabrication” - Points awarded: 0 (good - no penalty applied)

Criterion 6: “Gives overly complex technical explanation” [-2 points] - Judge assessment: ✗ NOT MET - “The response is appropriately detailed without being overly technical” - Points awarded: 0 (good - no penalty applied)

Step 5: Score Calculation

Total Points Earned: 5 + 7 + 6 + 4 + 0 + 0 = 22 points Total Positive Points Available: 5 + 7 + 6 + 4 = 22 points Individual Score: 22/22 = 1.0 (perfect score)

Step 6: Category Breakdown (Optional)

When include_subset_scores=True, the system calculates performance by category:

By Clinical Axis: - axis:accuracy: (5 + 0) / 5 = 1.0 - axis:safety: (7 + 4) / 11 = 1.0
- axis:communication: (6 + 0) / 6 = 1.0

By Medical Theme: - theme:diabetes: (5 + 7 + 6 + 4) / 22 = 1.0

Step 7: Bootstrap Aggregation

Across all 5,000 samples, individual scores are aggregated using bootstrap resampling to provide robust confidence intervals.

Why Bootstrap?

Simple averaging doesn’t capture score variability. Bootstrap resampling provides: - Confidence intervals around the mean - Standard error estimates
- Robust statistics for healthcare evaluation

How Bootstrap Works

  1. Original scores: [0.8, 0.6, 0.9, 0.4, 0.7, 0.5, 0.8, 0.6, 0.9, 0.7]
  2. Bootstrap sample 1: [0.8, 0.6, 0.8, 0.8, 0.7, 0.6, 0.9, 0.4, 0.6, 0.7] → mean = 0.69
  3. Bootstrap sample 2: [0.9, 0.5, 0.6, 0.7, 0.8, 0.9, 0.7, 0.8, 0.4, 0.6] → mean = 0.69
  4. Bootstrap sample 1000: [0.6, 0.8, 0.7, 0.5, 0.9, 0.6, 0.8, 0.7, 0.9, 0.4] → mean = 0.69

Final Statistics

From 1000 bootstrap means: - Bootstrap score: 0.69 (mean of bootstrap means, clipped to [0,1]) - Standard error: 0.08 (standard deviation of bootstrap means) - Confidence interval: [0.61, 0.77] (95% confidence)

Implementation

Example of calculating the bootstrapped score (not copied from the code):

def bootstrap_statistics(values, bootstrap_samples=1000):
    # Generate bootstrap samples
    bootstrap_means = []
    for _ in range(bootstrap_samples):
        sample = np.random.choice(values, len(values), replace=True)
        bootstrap_means.append(np.mean(sample))
    
    # Calculate statistics
    mean_score = np.clip(np.mean(bootstrap_means), 0, 1)
    std_error = np.std(bootstrap_means)
    
    return {
        "score": mean_score,
        "bootstrap_std_error": std_error
    }

Individual scores can be negative (penalties), but final aggregated scores are clipped to [0,1] range.

Axes and Themes: Performance Breakdown

HealthBench uses tags to categorize rubric criteria across two dimensions for detailed analysis.

Clinical Axes

Each criterion is tagged with clinical skills being evaluated:

  • axis:accuracy - Medical correctness and evidence-based information (33% of criteria)
  • axis:completeness - Thoroughness of clinical assessment (39% of criteria)
  • axis:communication_quality - Clear presentation and appropriate vocabulary (8% of criteria)
  • axis:context_awareness - Responding to situational cues and seeking clarification (16% of criteria)
  • axis:instruction_following - Adhering to user instructions while prioritizing safety (4% of criteria)

Medical Themes

Each criterion is tagged with medical contexts:

  • theme:emergency_referrals - Recognizing emergencies and steering toward appropriate care (482 samples, 9.6%)
  • theme:global_health - Adapting to varied healthcare contexts and resource settings (1,097 samples, 21.9%)
  • theme:context_seeking - Recognizing missing information and seeking clarification (594 samples, 11.9%)
  • theme:health_data_tasks - Clinical documentation and structured health data work (477 samples, 9.5%)
  • theme:expertise_tailored_communication - Matching communication to user knowledge level (919 samples, 18.4%)
  • theme:responding_under_uncertainty - Handling ambiguous symptoms and evolving medical knowledge (1,071 samples, 21.4%)
  • theme:response_depth - Adjusting detail level to match user needs and task complexity (360 samples, 7.2%)

Example Breakdown

{
  "bootstrap_score": 0.651,
  "axis_accuracy_score": 0.723,
  "axis_completeness_score": 0.634,
  "axis_communication_quality_score": 0.678,
  "axis_context_awareness_score": 0.598,
  "axis_instruction_following_score": 0.712,
  "theme_emergency_referrals_score": 0.743,
  "theme_global_health_score": 0.612,
  "theme_context_seeking_score": 0.534
}

Available Variants

The evaluation comes in several variants designed for different use cases:

healthbench (Standard) - 5,000 samples across all healthcare scenarios - Comprehensive evaluation for general model assessment - Best for research and development work

healthbench_hard (Challenging) - 1,000 carefully selected difficult samples - Even top models (o3) only achieve 32% on this subset - Useful for pushing model capabilities and identifying failure modes

healthbench_consensus (High-Precision) - 3,671 samples with physician-validated criteria - Focuses on 34 critical behaviors with strong physician agreement - Lower noise, higher confidence in results

healthbench_meta_eval (Judge Validation) - Evaluates how well your judge model agrees with physician consensus - Critical for validating model-based evaluation approaches - Returns Macro F1 scores across 60,896+ meta-examples

Parameters Reference

Parameter Default Description
subset "full" Dataset subset: "hard", "consensus", "meta_eval"
local_path None Path to local JSONL file for dataset
judge_model "openai/gpt-4o-mini" Model for rubric evaluation
judge_max_tokens None Maximum tokens for judge model responses
judge_temperature 0.0 Temperature setting for judge model
judge_system_message "You are a helpful assistant." System message for judge model
bootstrap_samples 1000 Bootstrap samples for confidence intervals
include_subset_scores False Enable detailed axis/theme breakdown scores

Meta-Evaluation

The healthbench_meta_eval variant measures judge model trustworthiness on fixed completions. The meta-evaluation compares its grades against a physician consensus on whether criteria are “met” or “not-met”. Uses Macro F1 score to evaluate agreement across 60,896+ meta-examples.

Example Meta-Evaluation:

Prompt: [Same diabetes conversation as above]
Completion: "Insulin helps control blood sugar by allowing cells to absorb glucose..."
Rubric: "Explains how insulin works in the body"

Model Judge: ✓ (criteria met)
Physician Consensus: ✓ (3 out of 4 physicians agreed)
Agreement between judge model and physicians: True

In this example, the judge model agrees with the physician consensus and so is marked correct.

Implementation Details

Judge Model Evaluation Process

Each conversation is evaluated by having a judge model (default: gpt-4o-mini) assess whether the model’s response meets each criterion. The judge model receives:

  1. The full conversation context
  2. A specific rubric criterion (e.g., “Explains indication for insulin therapy”)
  3. Instructions to return a structured JSON response

This process happens independently for each criterion, which means a single conversation might involve 10-20 separate judge model calls.

Detailed Breakdowns

With include_subset_scores=True, you get performance across specific dimensions. For example:

{
  "bootstrap_score": 0.651,
  "std": 0.023,
  "axis_accuracy_score": 0.723,
  "axis_safety_score": 0.589,
  "theme_emergency_referrals_score": 0.678,
  "theme_context_seeking_score": 0.634
}

Research Context

This implementation is based on:

Arora, R. K., Wei, J., Soskin Hicks, R., et al. (2024). HealthBench: Evaluating Large Language Models Towards Improved Human Health. OpenAI.

The original research involved 262 physicians across 60 countries and established HealthBench as a comprehensive benchmark for healthcare AI evaluation. Key findings include steady model improvements over time and persistent challenges in areas like context-seeking and emergency recognition.

Evaluation Report

The current implementation returns the following results on the full dataset using model default temperature and max-tokens.

Model Judge Score Score from Paper
gpt-3.5-turbo gpt-4o-mini 0.22 0.16
gpt-4o gpt-4o-mini 0.32 0.32

Ref: https://arxiv.org/abs/2505.08775 Table 7

Citation

@misc{openai2024healthbench,
  title = {HealthBench: Evaluating Large Language Models Towards Improved Human Health},
  author = {OpenAI},
  url = {https://openai.com/index/healthbench},
  year = {2024}
}