Pre-Flight: Aviation Operations Knowledge Evaluation

Knowledge

Tests model understanding of aviation regulations including ICAO annexes, flight dispatch rules, pilot procedures, and airport ground operations safety protocols.

Contributed By

@alexbrooker

Code

src/inspect_evals/pre_flight

Overview

Pre-Flight is an evaluation benchmark that tests the model’s understanding of ICAO annex documentation, flight dispatch rules and airport ground operations safety procedures and protocols. The evaluation consists of multiple-choice questions related to international airline and airport ground operations safety manuals.

Developed by Airside Labs, a specialized aviation sector software, data, and AI experimentation startup focused on accelerating innovation and optimizing operations through data-driven insights.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/pre_flight --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/pre_flight --limit 10
inspect eval inspect_evals/pre_flight --max-connections 10
inspect eval inspect_evals/pre_flight --temperature 0.5

See inspect eval --help for all available options.

Dataset

The dataset consists of multiple-choice questions drawn from standard international airport ground operations safety manuals. Each question has 4-5 possible answer choices, with one correct answer.

Topics covered include: - Airport safety procedures - Ground equipment operations - Staff training requirements - Emergency response protocols - Fueling safety - Aircraft towing procedures

Specifically by section denoted by id (note there are gaps in the id sequence to enable future additions): - 001-205 Airport operations procedures and rules - 206-299 reserved - 300-399 Derived from US aviation role training material - 400-499 ICAO annexes, rules of the air and global guidelines - 500-599 General aviation trivia questions - 600-699 Complex reasoning scenarios

Example questions from the dataset, note that the actual messages sent differ slightly as we use the multiple_choice solver, these examples have been lightly edited for readability:

Q: What is the effect of alcohol consumption on functions of the body?
Options: A. “Alcohol has an adverse effect, especially as altitude increases.” B. “Alcohol has little effect if followed by an ounce of black coffee for every ounce of alcohol.” C. “Small amounts of alcohol in the human system increase judgment and decision-making abilities.” D. “no suitable option”

A: A

Q: An airport is managing snow removal during winter operations on January 15th. Given the following information. | Current time | Temperature | Snowfall | Ground temperature | | ———— | ———– | ————————————— | —————— | | 07:45 | -3°C | Light, expected to continue for 2 hours | -6°C |

Aircraft movements and stand occupancy | Stand | Current/Next Departure | Next Arrival | | —– | ———————- | ————— | | A | BA23 with TOBT 07:55 | BA24 at 08:45 | | B | AA12 with TOBT 08:10 | AA14 at 09:00 | | C | Currently vacant | ETD112 at 08:15 | | D | ETD17 with TOBT 08:40 | ETD234 at 09:30 |

Assuming it takes 20 minutes to clear each stand and only one stand can be cleared at a time, what is the most efficient order to clear the stands to maximize the number of stands cleared before they’re reoccupied? Consider that stands must be cleared after an aircraft departs and before the next one arrives. Answer with the stand order only for all four stands and with the total duration of clearing and no preamble or other explanation, our jobs depend on this answer. Example 1. A, B, C, D 1hr 20 minutes Example 2. D, A, C, B 5hr 5 minutes
Options: A. “A, B, C, D 1 hour 20 minutes” B. “D, A, C, B 5 hours 5 minutes” C. “C, A, B, D 1 hour 20 minutes” D. “no suitable option”

A: C

Scoring

The benchmark is scored using accuracy, which is the proportion of questions answered correctly. Each question has one correct answer from the multiple choices provided. The implementation uses the multiple_choice solver and the choice scorer.

Evaluation Report

The following results were produced from running the full dataset (300 samples) across multiple models on March 21, 2025.

Model	Accuracy	Correct Answers	Total Samples
anthropic/claude-3-7-sonnet-20250219	0.747	224	300
openai/gpt-4o-2024-11-20	0.733	220	300
openai/gpt-4o-mini-2024-07-18	0.733	220	300
anthropic/claude-3-5-sonnet-20241022	0.713	214	300
groq/llama3-70b-8192	0.707	212	300
anthropic/claude-3-haiku-20240307	0.683	205	300
anthropic/claude-3-5-haiku-20241022	0.667	200	300
groq/llama3-8b-8192	0.660	198	300
openai/gpt-4-0125-preview	0.660	198	300
openai/gpt-3.5-turbo-0125	0.640	192	300
groq/gemma2-9b-it	0.623	187	300
groq/qwen-qwq-32b	0.587	176	300
groq/llama-3.1-8b-instant	0.557	167	300

These results demonstrate the capability of modern language models to comprehend aviation safety protocols and procedures from international standards documentation. The strongest models achieve approximately 75% accuracy on the dataset.