MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark
Multimodal questions from college exams, quizzes, and textbooks, covering six core disciplinestasks, demanding college-level subject knowledge and deliberate reasoning. Demonstrates multimodel inputs.
Overview
MMMU is a benchmark for evaluating multi-modal models on college-level tasks across a variety of subjects. 11.5k problems are gathered across 30 subjects and image types. Around 93% of questions in the evaluation dataset are multiple-choice, with the remainder being open-ended. GPT-4o is reported to achieve 63-69% accuracy for MMMU.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Then, evaluate against one or more models with:
inspect eval inspect_evals/mmmu_multiple_choice --model openai/gpt-4o
inspect eval inspect_evals/mmmu_open --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/mmmu_multiple_choice --limit 10
inspect eval inspect_evals/mmmu_open --max-connections 10
inspect eval inspect_evals/mmmu_multiple_choice --temperature 0.5
See inspect eval --help
for all available options.
Dataset
Here is an example from the dataset:
Question: The double-tiered columns allowed for all of the following EXCEPT (<image 1>)
Option: A) barrel-vaulted roofing
B) decorative rhythm and repetition
C) a higher roof to make up for the short columns
D) the entrance of light and air into the hall
The model is required to choose the correct answer from the given options and attached image.
Scoring
All evaluation is zero shot. For multiple-choice questions, one question, four options and up to seven images are provided. For open-ended questions, the four options are omitted. Prompts follow those used by the original authors. Micro-averaged accuracy is used in the original paper, while this implementation simply returns Inspect’s accuracy metric for each of the multiple-choice and open-ended tasks.