VQA-RAD: Visual Question Answering for Radiology

Multimodal

inspect-evals

VQA-RAD is the first manually constructed VQA dataset in radiology, where clinicians asked naturally occurring questions about radiology images and provided reference answers. It contains 315 radiology images (head CTs/MRIs, chest X-rays, abdominal CTs) with question-answer pairs spanning 11 question types including modality, plane, organ system, abnormality, and object/condition presence. Questions are either closed-ended (yes/no) or open-ended free-text answers.

Contributed By

@MattFisher

Code

src/inspect_evals/vqa_rad

Paper

https://doi.org/10.1038/sdata.2018.251

Overview

VQA-RAD is the first manually constructed Visual Question Answering dataset in radiology, introduced by Lau et al. (2018). Clinicians were shown radiology images from MedPix and asked naturally occurring questions they would ask a colleague or radiologist — the kind of questions that arise in routine clinical decision making. The dataset covers 315 images (104 head CTs/MRIs, 107 chest X-rays, 104 abdominal CTs) with 2,244 question-answer pairs spanning 11 question types: modality, plane, organ system, abnormality, object/condition presence, positional reasoning, color, size, attribute (other), counting, and other.

This implementation evaluates models on the test split (451 QA pairs total). Questions are either closed-ended (yes/no answers) or open-ended (free-text answers). The model is prompted to provide its answer in ANSWER: $ANSWER format. Closed-ended questions are scored by exact string match; open-ended questions are scored by a tool-calling grader model that evaluates semantic equivalence, similar to the manual human judgment employed in the original paper. Each sample includes an answer_type metadata field ("closed" or "open") inferred from whether the ground-truth answer is yes/no.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/vqa_rad --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.vqa_rad import vqa_rad
eval(vqa_rad)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/vqa_rad --limit 10
uv run inspect eval inspect_evals/vqa_rad --max-connections 10
uv run inspect eval inspect_evals/vqa_rad --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`vqa_rad`

shuffle (bool): Whether to shuffle the dataset before evaluation. (default: False)

Dataset

The dataset is loaded from flaviagiammarino/vqa-rad on Hugging Face. The original dataset and images are available from the OSF project page (VQA_RAD Dataset Public.json). The images are sourced from MedPix, an open-access database. Each record has:

image: a radiology image (JPEG)
question: a clinician-generated question about the image
answer: the reference answer

The original OSF dataset includes per-sample metadata (question type, image organ, question ID) that is not present in the HuggingFace version. The bundled lookup files vqa_rad_question_type.json and vqa_rad_md5_to_organ.json are derived from that original data. Because the same question text appears for multiple different images and the HuggingFace dataset does not include question IDs, image organ is mapped via the MD5 hash of the image bytes.

Example from the dataset:

Field	Value
question	“are regions of the brain infarcted?”
answer	“yes”
answer_type (inferred)	closed

Field	Value
question	“which organ system is abnormal in this image?”
answer	“cardiovascular”
answer_type (inferred)	open

Scoring

The model is prompted to answer in the format ANSWER: $ANSWER. The scorer uses a hybrid approach:

Closed-ended (yes/no) questions: Exact string match after lowercasing.
Open-ended questions: A tool-calling grader model evaluates semantic equivalence, handling terminology variation common in radiology (e.g. “x-ray” vs “radiograph”, “CT” vs “computed tomography”). This is similar to the manual human judgment employed in the original paper.

To specify a grader model:

uv run inspect eval inspect_evals/vqa_rad \
    --model-role 'grader={"model": "openai/gpt-5-mini-2025-08-07"}'

Metrics reported:

accuracy: proportion of correct answers
accuracy_closed: proportion of correct answers for closed-ended questions
accuracy_open: proportion of correct answers for open-ended questions
stderr: standard error of the accuracy estimate

The answer_type field in sample metadata ("closed" for yes/no questions, "open" for all others) can be used to analyse performance separately on closed-ended vs open-ended questions.

The paper employs multiple scoring methods (simple accuracy, mean accuracy, and BLEU) to demonstrate the limitations of current metrics for medical VQA. The paper’s baselines in Tables 2–3 are evaluated on the 300 free-form test questions (not the full 451 which includes 151 paraphrased questions) using manual human judgment: in addition to exact match, full credit is given for semantically equivalent answers, and half credit for partial answers. The best-performing model (MCB_RAD) achieved:

Closed-ended: 60.6% simple accuracy (Table 2)
Open-ended: 25.4% simple accuracy (Table 3)

Evaluation Report

Implementation Details

The scorer extracts the model’s answer from the ANSWER: $ANSWER format using a regex pattern. For closed-ended questions, it compares case-insensitively to the reference string. For open-ended questions, a tool-calling grader model evaluates semantic equivalence, handling radiology terminology variation (e.g. “x-ray” / “radiograph”, “CT” / “computed tomography”). This approach is similar to the manual human judgment with semantic equivalence and partial credit employed in the original paper.

The original paper describes no system prompt — the baseline models (MCB, SAN) were CNN/LSTM networks that received raw image features directly. The radiologist persona prompt used here is our addition to adapt the task for instruction-tuned VLMs.

Results

Full test-split results (April 2026):

Model	Samples	Overall	Closed-ended	Open-ended
`openai/gpt-5.2-2025-12-11`	451	0.683	0.773	0.570
`anthropic/claude-opus-4-6`	451	0.670	0.781	0.530

Paper baselines (MCB_RAD, Lau et al. 2018, Tables 2–3, 300 free-form questions, manual judgment):

Metric	Closed-ended	Open-ended
MCB_RAD simple accuracy	0.606	0.254

Notes:

The paper’s baselines are not directly comparable to our results: the paper used manual human judgment (with semantic equivalence and partial credit) on 300 free-form questions, while our scorer runs on all 451 test questions (300 free-form + 151 paraphrased). Our graded approach is the closest analogue to the paper’s manual evaluation.
Closed-ended accuracy (78.1%) substantially exceeds the MCB_RAD baseline (60.6%), consistent with the large capability gap between modern VLMs and 2018-era CNN/LSTM models.
Open-ended accuracy (53–57%) far exceeds the MCB_RAD baseline (25.4%), demonstrating the capability of modern VLMs with semantic scoring.

Reproducibility Information

Date: April 2026
Eval version: 2 (comparability version), 2-B (full version)
Dataset: flaviagiammarino/vqa-rad @ bcf91e7654fb9d51c8ab6a5b82cacf3fafd2fae9, test split
Commands:

# gpt-5.2
uv run inspect eval inspect_evals/vqa_rad --model openai/gpt-5.2-2025-12-11 --model-role='{grader: "openai/gpt-5-mini-2025-08-07"}'

# Claude Opus 4
uv run inspect eval inspect_evals/vqa_rad --model anthropic/claude-opus-4-6 --model-role='{grader: "openai/gpt-5-mini-2025-08-07"}'

Changelog

[2-B] - 2026-04-17

Simplified scoring to use a single graded scorer that handles both yes/no (exact match) and open-ended (model-graded) questions. Removed the scorer parameter.