PubMedQA: A Dataset for Biomedical Research Question Answering

Knowledge

Biomedical question answering (QA) dataset collected from PubMed abstracts.

Overview

PubMedQA is a biomedical question answering (QA) dataset collected from PubMed abstracts.

Usage

First, install the dependencies:

uv sync

Then, evaluate against one or more models with:

uv run inspect eval inspect_evals/pubmedqa --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.pubmedqa import pubmedqa
eval(pubmedqa)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/pubmedqa --limit 10
uv run inspect eval inspect_evals/pubmedqa --max-connections 10
uv run inspect eval inspect_evals/pubmedqa --temperature 0.5

See uv run inspect eval --help for all available options.

Paper

Title: PubMedQA: A Dataset for Biomedical Research Question Answering

Authors: Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, Xinghua Lu

Abstract: https://arxiv.org/abs/1909.06146

Homepage: https://pubmedqa.github.io/

PubMedQA is a biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances.

Each PubMedQA instance is composed of

  1. a question which is either an existing research article title or derived from one,
  2. a context which is the corresponding abstract without its conclusion,
  3. a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and
  4. a yes/no/maybe answer which summarizes the conclusion.

PubMedQA datasets consist of 3 different subsets:

  1. PubMedQA Labeled (PQA-L): A labeled PubMedQA subset consists of 1k manually annotated yes/no/maybe QA data collected from PubMed articles.
  2. PubMedQA Artificial (PQA-A): An artificially labelled PubMedQA subset consists of 211.3k PubMed articles with automatically generated questions from the statement titles and yes/no answer labels generated using a simple heuristic.
  3. PubMedQA Unlabeled (PQA-U): An unlabeled PubMedQA subset consists of 61.2k context-question pairs data collected from PubMed articles.

Implementation

The original dataset is available on HuggingFace. This dataset contains 1000 samples, but only 500 are used to evaluate the model. The other samples can be used to train models. The original github page provides the IDs of the samples used in the test split. As part of our preprocessing, we use that list to sample the test subset. A lightly reshaped version of the dataset was released by BioBio. This implementation used to use the BigBio release of the dataset, as the original release incorrectly uses all 1000 labeled samples as the test set, which should in fact only be 500. However, as of version 4.0.0 of datasets, dataset scripts are no longer supported by the package. This causes issues with this benchmark’s dataset and prevents use. Therefore, we have gone back to using the original dataset. See this GitHub issue.

The BigBio release actually provides both versions of the dataset - “source” and “bigbio_qa”. The “source” subsets contain the original qiaojin version, but do not include sample ids. The “bigbio_qa” subsets contain the reshaped version of the same data, but with sample ids included. The “bigbio_qa” version of the labeled test set is used in this implementation. The following subsets are available, with their respective splits:

  • 'pubmed_qa_labeled_fold{i}_[source|bigbio_qa]' for i in {0..9}: Each of these 10 cross-validation folds takes the 1000 samples from PubMedQA Labeled (PQA-L) and splits them into the same 500 samples for the test set, and different sets of 450 and 50 samples from the remaining 500 for the training and validation sets, respectively.
    • ‘test’: 500 samples, common across all folds. This is the test set used for the benchmark.
    • ‘train’: 450 samples
    • ‘validation’: 50 samples
  • 'pubmed_qa_artificial_[source|bigbio_qa]': PubMedQA Artificial (PQA-A)
    • ‘train’: 200,000 samples
    • ‘validation’: 11,269 samples
  • 'pubmed_qa_unlabeled_[source|bigbio_qa]': PubMedQA Unlabeled (PQA-U)
    • ‘train’: 61,249 samples

Dataset

Here is an example from the dataset (cropped for brevity):

Context: Gallbladder carcinoma is characterized by delayed diagnosis, ineffective treatment and poor prognosis. …

Question: Is external palliative radiotherapy for gallbladder carcinoma effective?

  1. yes
  2. no
  3. maybe

Prompt

There isn’t an official prompt template for this dataset as the authors intended for people to engineer their own prompts to work for their specific model.

Here is an example of our prompt:

Answer the following multiple choice question about medical knowledge given the context. The entire content of your response should be of the following format: ‘ANSWER: $LETTER’ (without quotes) where LETTER is one of A,B,C.

Context: Apparent life-threatening events in infants are a difficult and frequent problem in pediatric practice. The prognosis is uncertain because of risk of sudden infant death syndrome.Eight infants aged 2 to 15 months were admitted during a period of 6 years; they suffered from similar maladies in the bath: on immersion, they became pale, hypotonic, still and unreactive; recovery took a few seconds after withdrawal from the bath and stimulation. Two diagnoses were initially considered: seizure or gastroesophageal reflux but this was doubtful. The hypothesis of an equivalent of aquagenic urticaria was then considered; as for patients with this disease, each infant’s family contained members suffering from dermographism, maladies or eruption after exposure to water or sun. All six infants had dermographism. We found an increase in blood histamine levels after a trial bath in the two infants tested. The evolution of these “aquagenic maladies” was favourable after a few weeks without baths. After a 2-7 year follow-up, three out of seven infants continue to suffer from troubles associated with sun or water.

Question: Syncope during bathing in infants, a pediatric form of water-induced urticaria?

  1. yes

  2. no

  3. maybe

Note: If you want to alter the prompt format, you can do so in the record_to_sample function and the TEMPLATE object in the eval code.

Scoring

A simple accuracy is calculated over the datapoints.

Dataset (samples) Model Accuracy StdErr Date
pubmedqa (500) anthropic/claude-3-5-haiku-20241022 0.748 0.019 September 2025