PubMedQA: A Dataset for Biomedical Research Question Answering
Novel biomedical question answering (QA) dataset collected from PubMed abstracts.
Overview
PubMedQA is a biomedical question answering (QA) dataset collected from PubMed abstracts.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Then, evaluate against one or more models with:
inspect eval inspect_evals/pubmedqa --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/pubmedqa --limit 10
inspect eval inspect_evals/pubmedqa --max-connections 10
inspect eval inspect_evals/pubmedqa --temperature 0.5
See inspect eval --help
for all available options.
Paper
Title: PubMedQA: A Dataset for Biomedical Research Question Answering
Authors: Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, Xinghua Lu
Abstract: https://arxiv.org/abs/1909.06146
Homepage: https://pubmedqa.github.io/
PubMedQA is a biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances.
Each PubMedQA instance is composed of
- a question which is either an existing research article title or derived from one,
- a context which is the corresponding abstract without its conclusion,
- a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and
- a yes/no/maybe answer which summarizes the conclusion.
PubMedQA datasets consist of 3 different subsets:
- PubMedQA Labeled (PQA-L): A labeled PubMedQA subset consists of 1k manually annotated yes/no/maybe QA data collected from PubMed articles.
- PubMedQA Artificial (PQA-A): An artificially labelled PubMedQA subset consists of 211.3k PubMed articles with automatically generated questions from the statement titles and yes/no answer labels generated using a simple heuristic.
- PubMedQA Unlabeled (PQA-U): An unlabeled PubMedQA subset consists of 61.2k context-question pairs data collected from PubMed articles.
Implementation
The original dataset is available on HuggingFace. A lightly reshaped version of the dataset was released by BioBio. This implementation uses the BigBio release of the dataset, as the original release incorrectly uses all 1000 labeled samples as the test set, which should in fact only be 500. See this GitHub issue.
The BigBio release actually provides both versions of the dataset - “source” and “bigbio_qa”. The “source” subsets contain the original qiaojin version, but do not include sample ids. The “bigbio_qa” subsets contain the reshaped version of the same data, but with sample ids included. The “bigbio_qa” version of the labeled test set is used in this implementation. The following subsets are available, with their respective splits:
'pubmed_qa_labeled_fold{i}_[source|bigbio_qa]' for i in {0..9}
: Each of these 10 cross-validation folds takes the 1000 samples from PubMedQA Labeled (PQA-L) and splits them into the same 500 samples for the test set, and different sets of 450 and 50 samples from the remaining 500 for the training and validation sets, respectively.- ‘test’: 500 samples, common across all folds. This is the test set used for the benchmark.
- ‘train’: 450 samples
- ‘validation’: 50 samples
'pubmed_qa_artificial_[source|bigbio_qa]'
: PubMedQA Artificial (PQA-A)- ‘train’: 200,000 samples
- ‘validation’: 11,269 samples
'pubmed_qa_unlabeled_[source|bigbio_qa]'
: PubMedQA Unlabeled (PQA-U)- ‘train’: 61,249 samples
Dataset
Here is an example from the dataset (cropped for brevity):
Context: Gallbladder carcinoma is characterized by delayed diagnosis, ineffective treatment and poor prognosis. …
Question: Is external palliative radiotherapy for gallbladder carcinoma effective?
- yes
- no
- maybe
Scoring
A simple accuracy is calculated over the datapoints.