MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark

Reasoning
Multimodal

Multimodal questions from college exams, quizzes, and textbooks, covering six core disciplinestasks, demanding college-level subject knowledge and deliberate reasoning. Demonstrates multimodel inputs.

Contributed By

Overview

MMMU is a benchmark for evaluating multi-modal models on college-level tasks across a variety of subjects. 11.5k problems are gathered across 30 subjects and image types. Around 93% of questions in the evaluation dataset are multiple-choice, with the remainder being open-ended. GPT-4o is reported to achieve 63-69% accuracy for MMMU.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Then, evaluate against one more models with:

inspect eval inspect_evals/mmmu_multiple_choice --model openai/gpt-4o
inspect eval inspect_evals/mmmu_open --model openai/gpt-4o

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/mmmu_multiple_choice --limit 10
inspect eval inspect_evals/mmmu_open --max-connections 10
inspect eval inspect_evals/mmmu_multiple_choice --temperature 0.5

See inspect eval --help for all available options.

Dataset

Here is an example from the dataset:

Question: The double-tiered columns allowed for all of the following EXCEPT (<image 1>)

Option: A) barrel-vaulted roofing
B) decorative rhythm and repetition
C) a higher roof to make up for the short columns
D) the entrance of light and air into the hall

The model is required to choose the correct answer from the given options and attached image.

Scoring

All evaluation is zero shot. For multiple-choice questions, one question, four options and up to seven images are provided. For open-ended questions, the four options are omitted. Prompts follow those used by the original authors. Micro-averaged accuracy is used in the original paper, while this implementation simply returns Inspect’s accuracy metric for each of the multiple-choice and open-ended tasks.