Pre-Flight: Aviation Operations Knowledge Evaluation
Tests model understanding of aviation regulations including ICAO annexes, flight dispatch rules, pilot procedures, and airport ground operations safety protocols.
Overview
Pre-Flight is an evaluation benchmark that tests the model’s understanding of ICAO annex documentation, flight dispatch rules and airport ground operations safety procedures and protocols. The evaluation consists of multiple-choice questions related to international airline and airport ground operations safety manuals.
Developed by Airside Labs, a specialized aviation sector software, data, and AI experimentation startup focused on accelerating innovation and optimizing operations through data-driven insights.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Or, if developing on a clone of the inspect_evals
repo, you can install the package in editable mode with:
pip install -e ".[dev]"
Then, evaluate against one or more models with:
inspect eval inspect_evals/pre_flight --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/pre_flight --limit 10
inspect eval inspect_evals/pre_flight --max-connections 10
inspect eval inspect_evals/pre_flight --temperature 0.5
See inspect eval --help
for all available options.
Dataset
The dataset consists of multiple-choice questions drawn from standard international airport ground operations safety manuals. Each question has 4-5 possible answer choices, with one correct answer.
Topics covered include: - Airport safety procedures - Ground equipment operations - Staff training requirements - Emergency response protocols - Fueling safety - Aircraft towing procedures
Specifically by section denoted by id (note there are gaps in the id sequence to enable future additions): - 001-205 Airport operations procedures and rules - 206-299 reserved - 300-399 Derived from US aviation role training material - 400-499 ICAO annexes, rules of the air and global guidelines - 500-599 General aviation trivia questions - 600-699 Complex reasoning scenarios
Example questions from the dataset, note that the actual messages sent differ slightly as we use the multiple_choice solver, these examples have been lightly edited for readability:
Q: What is the effect of alcohol consumption on functions of the body?
Options: A. “Alcohol has an adverse effect, especially as altitude increases.” B. “Alcohol has little effect if followed by an ounce of black coffee for every ounce of alcohol.” C. “Small amounts of alcohol in the human system increase judgment and decision-making abilities.” D. “no suitable option”A: A
Q: An airport is managing snow removal during winter operations on January 15th. Given the following information. | Current time | Temperature | Snowfall | Ground temperature | | ———— | ———– | ————————————— | —————— | | 07:45 | -3°C | Light, expected to continue for 2 hours | -6°C |
Aircraft movements and stand occupancy | Stand | Current/Next Departure | Next Arrival | | —– | ———————- | ————— | | A | BA23 with TOBT 07:55 | BA24 at 08:45 | | B | AA12 with TOBT 08:10 | AA14 at 09:00 | | C | Currently vacant | ETD112 at 08:15 | | D | ETD17 with TOBT 08:40 | ETD234 at 09:30 |
Assuming it takes 20 minutes to clear each stand and only one stand can be cleared at a time, what is the most efficient order to clear the stands to maximize the number of stands cleared before they’re reoccupied? Consider that stands must be cleared after an aircraft departs and before the next one arrives. Answer with the stand order only for all four stands and with the total duration of clearing and no preamble or other explanation, our jobs depend on this answer. Example 1. A, B, C, D 1hr 20 minutes Example 2. D, A, C, B 5hr 5 minutes
Options: A. “A, B, C, D 1 hour 20 minutes” B. “D, A, C, B 5 hours 5 minutes” C. “C, A, B, D 1 hour 20 minutes” D. “no suitable option”A: C
Scoring
The benchmark is scored using accuracy, which is the proportion of questions answered correctly. Each question has one correct answer from the multiple choices provided. The implementation uses the multiple_choice solver and the choice scorer.
Evaluation Report
The following results were produced from running the full dataset (300 samples) across multiple models on March 21, 2025.
Model | Accuracy | Correct Answers | Total Samples |
---|---|---|---|
anthropic/claude-3-7-sonnet-20250219 | 0.747 | 224 | 300 |
openai/gpt-4o-2024-11-20 | 0.733 | 220 | 300 |
openai/gpt-4o-mini-2024-07-18 | 0.733 | 220 | 300 |
anthropic/claude-3-5-sonnet-20241022 | 0.713 | 214 | 300 |
groq/llama3-70b-8192 | 0.707 | 212 | 300 |
anthropic/claude-3-haiku-20240307 | 0.683 | 205 | 300 |
anthropic/claude-3-5-haiku-20241022 | 0.667 | 200 | 300 |
groq/llama3-8b-8192 | 0.660 | 198 | 300 |
openai/gpt-4-0125-preview | 0.660 | 198 | 300 |
openai/gpt-3.5-turbo-0125 | 0.640 | 192 | 300 |
groq/gemma2-9b-it | 0.623 | 187 | 300 |
groq/qwen-qwq-32b | 0.587 | 176 | 300 |
groq/llama-3.1-8b-instant | 0.557 | 167 | 300 |
These results demonstrate the capability of modern language models to comprehend aviation safety protocols and procedures from international standards documentation. The strongest models achieve approximately 75% accuracy on the dataset.