VQA-RAD: Visual Question Answering for Radiology

Multimodal

VQA-RAD is the first manually constructed VQA dataset in radiology, where clinicians asked naturally occurring questions about radiology images and provided reference answers. It contains 315 radiology images (head CTs/MRIs, chest X-rays, abdominal CTs) with question-answer pairs spanning 11 question types including modality, plane, organ system, abnormality, and object/condition presence. Questions are either closed-ended (yes/no) or open-ended free-text answers.

Contributed By

@MattFisher

Code

src/inspect_evals/vqa_rad

Paper

https://doi.org/10.1038/sdata.2018.251

Overview

VQA-RAD is the first manually constructed Visual Question Answering dataset in radiology, introduced by Lau et al. (2018). Clinicians were shown radiology images from MedPix and asked naturally occurring questions they would ask a colleague or radiologist — the kind of questions that arise in routine clinical decision making. The dataset covers 315 images (104 head CTs/MRIs, 107 chest X-rays, 104 abdominal CTs) with 2,244 question-answer pairs spanning 11 question types: modality, plane, organ system, abnormality, object/condition presence, positional reasoning, color, size, attribute (other), counting, and other.

This implementation evaluates models on the test split (451 QA pairs total). Questions are either closed-ended (yes/no answers) or open-ended (free-text answers). The model is prompted to provide its answer in ANSWER: $ANSWER format and scored by exact string match after lowercasing and stripping whitespace. Each sample includes an answer_type metadata field ("closed" or "open") inferred from whether the ground-truth answer is yes/no.

Note that some open-ended answers are multi-word and may not be reproducible by exact match.

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/vqa_rad --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.vqa_rad import vqa_rad
eval(vqa_rad)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/vqa_rad --limit 10
uv run inspect eval inspect_evals/vqa_rad --max-connections 10
uv run inspect eval inspect_evals/vqa_rad --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`vqa_rad`

scorer (Literal[‘exact’, ‘graded’]): Scorer to use. “exact” (default) uses exact string matching, faithful to the paper’s scoring metric. “graded” uses a tool-calling grader model for open-ended questions (with exact match retained for yes/no questions). (default: 'exact')
shuffle (bool): Whether to shuffle the dataset before evaluation. (default: False)

Dataset

The dataset is loaded from flaviagiammarino/vqa-rad on Hugging Face. The original dataset and images are available from the OSF project page (VQA_RAD Dataset Public.json). The images are sourced from MedPix, an open-access database. Each record has:

image: a radiology image (JPEG)
question: a clinician-generated question about the image
answer: the reference answer

The original OSF dataset includes per-sample metadata (question type, image organ, question ID) that is not present in the HuggingFace version. The bundled lookup files vqa_rad_question_type.json and vqa_rad_md5_to_organ.json are derived from that original data. Because the same question text appears for multiple different images and the HuggingFace dataset does not include question IDs, image organ is mapped via the MD5 hash of the image bytes.

Example from the dataset:

Field	Value
question	“are regions of the brain infarcted?”
answer	“yes”
answer_type (inferred)	closed

Field	Value
question	“which organ system is abnormal in this image?”
answer	“cardiovascular”
answer_type (inferred)	open

Scoring

The model is prompted to answer in the format ANSWER: $ANSWER. Two scorers are available via -T scorer=<choice>:

scorer=exact (default, paper-faithful): The predicted answer is extracted with a regex pattern and compared to the ground-truth answer using exact string match after lowercasing. Note that some clinician-authored multi-part descriptions will not be reproducible verbatim (20 samples have answers with more than 4 words, which are highly unlikely to be reproduced exactly by LLMs).

scorer=graded: Uses a tool-calling grader model for open-ended questions, which handles terminology variation common in radiology (e.g. “x-ray” vs “radiograph”, “CT” vs “computed tomography”). Yes/no (closed-ended) questions are still scored by exact match. To specify a grader model:

uv run inspect eval inspect_evals/vqa_rad -T scorer=graded \
    --model-role 'grader={"model": "openai/gpt-5-mini-2025-08-07"}'

Metrics reported:

accuracy: proportion of correct answers
accuracy_closed: proportion of correct answers for closed-ended questions
accuracy_open: proportion of correct answers for open-ended questions
stderr: standard error of the accuracy estimate

The answer_type field in sample metadata ("closed" for yes/no questions, "open" for all others) can be used to analyse performance separately on closed-ended vs open-ended questions.

The paper’s baselines (MCB and SAN models trained on VQA-RAD) achieved:

Closed-ended: ~61% simple accuracy (MCB_RAD)
Open-ended: ~25% simple accuracy (MCB_RAD)

Evaluation Report

Implementation Details

The implementation follows the paper’s simple accuracy metric for scorer=exact: extract the model’s answer from the ANSWER: $ANSWER format using a regex pattern and compare case-insensitively to the reference string.

A second scorer=graded mode is also provided that uses a tool-calling grader model for open-ended questions, handling radiology terminology variation (e.g. “x-ray” / “radiograph”, “CT” / “computed tomography”) that exact match cannot capture. Closed-ended (yes/no) questions use exact match in both modes.

The original paper describes no system prompt — the baseline models (MCB, SAN) were CNN/LSTM networks that received raw image features directly. The radiologist persona prompt used here is our addition to adapt the task for instruction-tuned VLMs.

Results

Full test-split results for openai/gpt-5.2-2025-12-11 (April 2026):

Scorer	Samples	Overall	Closed-ended	Open-ended
`exact` (paper-faithful)	451	0.521	0.781	0.195
`graded`	451	0.683	0.773	0.570

Paper baselines (MCB_RAD, Lau et al. 2018):

Metric	Closed-ended	Open-ended
MCB_RAD simple accuracy	~0.61	~0.25
gpt-5.2 `scorer=exact`	0.781	0.195

Notes:

Closed-ended accuracy (78.1%) substantially exceeds the MCB_RAD baseline (~61%), consistent with the large capability gap between modern VLMs and 2018-era CNN/LSTM models.
Open-ended exact-match accuracy (19.5%) falls below the MCB_RAD baseline (25%), which reflects the strict nature of exact matching rather than a genuine gap in model capability. The graded scorer reveals the true open-ended accuracy is 57.0%, far above the baseline.
The large gap between exact and graded open-ended scores (19.5% vs 57.0%) illustrates why exact match is a poor metric for open-ended radiology questions, where synonym variation is common (e.g. “x-ray” vs “plain film”, “cardiac enlargement” vs “cardiomegaly”).

Reproducibility Information

Date: April 2026
Eval version: 1 (comparability version), 1-A (full version)
Dataset: flaviagiammarino/vqa-rad @ bcf91e7654fb9d51c8ab6a5b82cacf3fafd2fae9, test split
Commands:

# Paper-faithful exact scorer
uv run inspect eval inspect_evals/vqa_rad --model openai/gpt-5.2-2025-12-11

# Graded scorer
uv run inspect eval inspect_evals/vqa_rad --model openai/gpt-5.2-2025-12-11 -T scorer=graded --model-role='{grader: "openai/gpt-5-mini-2025-08-07"}'