MaCBench: Probing the limitations of multimodal language models for chemistry and materials research

Knowledge

inspect-evals

MaCBench is a comprehensive benchmark for evaluating how vision-language models handle real-world chemistry and materials science tasks across three core aspects: data extraction, experimental understanding, and results interpretation. The dataset comprises over 1100 high-quality multimodal questions manually curated by chemistry and materials experts across 34 subsets.

Contributed By

@r-fedorov

@MrtinoRG

Code

src/inspect_evals/macbench

Paper

https://arxiv.org/abs/2411.16955

Overview

Usage

Installation

There are two ways of using Inspect Evals, from pypi as a dependency of your own project and as a standalone checked out GitHub repository.

If you are using it from pypi, install the package and its dependencies via:

pip install inspect-evals

If you are using Inspect Evals in its repository, start by installing the necessary dependencies with:

uv sync

Running evaluations

Now you can start evaluating models. For simplicity’s sake, this section assumes you are using Inspect Evals from the standalone repo. If that’s not the case and you are not using uv to manage dependencies in your own project, you can use the same commands with uv run dropped.

uv run inspect eval inspect_evals/macbench --model openai/gpt-5-nano

You can also import tasks as normal Python objects and run them from python:

from inspect_ai import eval
from inspect_evals.macbench import macbench
eval(macbench)

After running evaluations, you can view their logs using the inspect view command:

uv run inspect view

For VS Code, you can also download Inspect AI extension for viewing logs.

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-opus-4-1-20250805
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

uv run inspect eval inspect_evals/macbench --limit 10
uv run inspect eval inspect_evals/macbench --max-connections 10
uv run inspect eval inspect_evals/macbench --temperature 0.5

See uv run inspect eval --help for all available options.

Parameters

`macbench`

task_name (str | None): Name of a specific subset to evaluate. If None, all subsets are loaded and merged. Must be one of SUBSET_DATA_FILES. (default: None)
cot (bool): Whether to enable chain-of-thought prompting. (default: False)
shuffle (bool): Whether to shuffle the dataset samples. (default: True)

You can run the eval on one of the subsets by specifying its name via -T task_name=<name> command-line argument, for example:

uv run inspect eval inspect_evals/macbench --model openai/gpt-5-nano -T task_name=afm-image

The valid task names are: “afm-image”, “chem-lab-basic”, “chem-lab-comparison”, “chem-lab-equipments”, “chirality”, “cif-atomic-species”, “cif-crystal-system”, “cif-density”, “cif-symmetry”, “cif-volume”, “electronic-structure”, “handdrawn-molecules”, “isomers”, “mof-adsorption-strength-comparison”, “mof-adsorption-strength-order”, “mof-capacity-comparison”, “mof-capacity-order”, “mof-capacity-value”, “mof-henry-constant-comparison”, “mof-henry-constant-order”, “mof-working-capacity-comparison”, “mof-working-capacity-order”, “mof-working-capacity-value”, “organic-molecules”, “org-schema”, “org-schema-wo-smiles”, “spectral-analysis”, “tables-qa”, “us-patent-figures”, “us-patent-plots”, “xrd-pattern-matching”, “xrd-pattern-shape”, “xrd-peak-position”, “xrd-relative-intensity”.

An optional cot parameter can be provided with True to enable Chain-of-Thought:

uv run inspect eval inspect_evals/macbench --model openai/gpt-5-nano -T cot=True

Dataset

MaCBench includes 34 subsets consisting of both multiple-choice and open-ended questions, totaling 1,153 samples. All questions contain both text and image components that need to be analyzed together to arrive at the correct answer.

The subsets are organized into three categories:

Data Extraction: Hand-drawn molecules, organic chemistry (chirality, isomers, organic molecules, reaction schemas), tables and plots
In Silico and Lab Experiments: Lab QA (safety, comparison, equipment), crystal structure analysis (atomic species, density, symmetry, volume, crystal system)
Data Interpretation: AFM image analysis, adsorption isotherms (MOF capacity/Henry constant/adsorption strength/working capacity), electronic structure, NMR/MS spectra, XRD QA (pattern matching, shape, peak position, relative intensity)

You can download the dataset from Hugging Face.

Scoring

Multiple-choice questions are scored with a regex-based pattern scorer using the option letters extracted from target_scores. Numeric questions are scored with pattern_mae from chembench. In the MaCBench dataset, the relative_tolerance field is treated as a precomputed absolute acceptance window; when that field is missing, the Inspect implementation falls back to ChemBench’s 1% relative-tolerance convention. Accuracy is reported for each subset via subset_name, together with overall accuracy and standard error.

Evaluation Report

Commands used:

uv run inspect eval inspect_evals/macbench --model openai/gpt-5-nano
uv run inspect eval inspect_evals/macbench --model openai/gpt-5.2

Evaluation version: 1-A

Results:

Metrics are rounded to 4 significant figures.

Date	Model	Samples	Overall Accuracy (`all`)	Standard Error
2026-03-04	`openai/gpt-5-nano`	`1153/1153`	`0.6878`	`0.01365`
2026-04-23	`openai/gpt-5.2`	`1153/1153`	`0.7259`	`0.01314`

The available local run artifacts preserve the provider aliases above rather than dated OpenAI snapshot IDs, so the table and commands report the exact model identifiers recorded in the logs.

Comparison to the original implementation

We compared this Inspect implementation against the original internal MaCBench implementation using the corresponding openai/gpt-5.2 summary output from the original benchmark codebase.

Metrics are rounded to 4 significant figures.

Source	Model	Samples	Overall Accuracy (`all`)	Standard Error
Original MaCBench summary	`openai/gpt-5.2`	`1153/1153`	`0.7259`	`0.01160`
Inspect eval	`openai/gpt-5.2`	`1153/1153`	`0.7259`	`0.01314`

For the original summary, the standard error is derived from per-topic all_correct means and counts.

The report includes two models to satisfy the repository reporting guidance. The direct implementation-to-implementation comparison is currently strongest for openai/gpt-5.2, because that is the supplied original-summary artifact available for side-by-side checking.

Because the original artifact available here is a subset-aggregated summary rather than a per-sample trace, this comparison is strongest for overall score reproduction and weaker for sample-level trajectory parity.

Notes

Grouped metrics were reported for each subset via subset_name.
No samples were left unscored in the April 23 openai/gpt-5.2 run.
The dataset is loaded from jablonkagroup/MaCBench pinned to revision ca13e7aff2f9a40cb638360c3e569de173ae1fdd.