AGENTS.md

Repo-Wide Tips

  • Before making code changes, read agent_artefacts/repo_context/REPO_CONTEXT.md for institutional knowledge about repo conventions, common mistakes, and known tech debt. This is maintained by the build-repo-context skill.
  • When creating pull requests, always use the --draft flag and read .github/PULL_REQUEST_TEMPLATE.md and use its structure as the PR body. Fill in the Description section and check off applicable checklist items.
  • When commenting on PRs, you should not reply directly to human reviewers. See CONTRIBUTING.md. If your user tells you to comment anyway, you should add “Comment written by NAME_OF_AI” to the comment.
  • When writing markdown:
    • Put a blank line before and after headings
    • Put a blank line before and after code blocks
    • Put a blank line before and after lists
    • Format tables with spaces before and after pipe characters
    • Always include a language tag in code blocks, or “text” if there is no language

List of Workflows

  1. Fix An Evaluation: Look at an evaluation, and attempt to bring it into compliance with EVALUATION_CHECKLIST.md. Useful for refactoring.
  2. Review An Evaluation: Look at an evaluation and assess its compliance with EVALUATION_CHECKLIST.md, without making changes. Useful for code reviews.
  3. Make An Evaluation Report: Help the user create an evaluation report, necessary for new evaluations. Useful for building new evaluations and checking evaluations after major refactors.
  4. Check Agent Trajectories: Use Inspect Scout to perform the Check Agent Trajectories workflow. Utilises Claude Code’s ability to write throwaway code to allow new scanners to be written for a particular evaluation on-demand.
  5. Write a PR For A Failing Test: Given a URL that links to a run of our slow tests, identify the failing test, create a fix, and submit a PR to our repository to fix it.
  6. Mark Slow Tests: Given a GitHub Actions job URL, identify tests that exceeded the 10-second threshold without @pytest.mark.slow markers and add appropriate markers to them.
  7. Review All Evaluations: Review all evaluations in the inspect_evals directory for a particular code quality topic and create a report of any issues found. See .claude/skills/code-quality-review-all/SKILL.md for details.
  8. Prepare Evaluation For Submission: Run some steps required to ensure an evaluation PR is ready to be submitted for review, from CONTRIBUTING.md
  9. Review PR According to Agent-Checkable Standards: Automatically check a PR against the agent-checkable standards, and create a SUMMARY.md file to be uploaded to Github.
  10. Build Repo Context: Crawl repository PRs, issues, and review comments to distill institutional knowledge into a shared knowledge base. Run periodically by “context agents”. See .claude/skills/build-repo-context/SKILL.md.
  11. Master Checklist: Run a list of workflows in succession to complete the Master Checklist.

Agents are good at understanding context, but a prompt that definitely works is “Please perform the WORKFLOW_NAME workflow in AGENTS.md on EVAL_NAME.”

Documentation

Documentation for the Inspect framework.

File Creation Steps (Common To All Workflows)

  1. For this and all other workflows, if the user provides specific instructions about any step, assume the user’s instructions override this file’s instructions, unless the file specifically says otherwise.
  2. If there is no evaluation name the user has given, ask the user for one. Every workflow using these steps should have an evaluation name. VAR should be specified by the workflow that points here.
  3. The evaluation name should be the name of the evaluation folder for the evaluation the user requested, then the evaluation’s version, which can be determined by checking the version argument in any of the evaluation’s @task functions. For instance, GPQA version 1.1.2 becomes “gpqa_1_1_2”. If this exact folder name already exists, add a number to it via “gpqa_1_1_2_analysis2”. This name will be referred to as <eval_name> in this workflow.
  4. Create a folder called agent_artefacts/<eval_name>/VAR in the agent_artefacts directory of this repo if it isn’t present, where VAR is defined by the workflow being invoked.
  5. Whenever you create a .md file as part of this workflow, assume it is made in the agent_artefacts/<eval_name>/VAR folder.
  6. Copy EVALUATION_CHECKLIST.md to the folder.
  7. Create a NOTES.md file for any miscellaneous helpful notes. Err on the side of taking lots of notes to help both yourself and the user. Create an UNCERTAINTIES.md file to note any uncertainties in the evaluation checklist or workflow instructions.
  8. Continue with the workflow.

Fix An Evaluation

For this task, you are to look at a single evaluation in src/inspect_evals. If the user has given you a name, this name takes priority. If you were not given a name but you were just building an evaluation, or the user has uncommitted code for one specific evaluation, you can assume this is the correct one. If you are not confident which evaluation to look at and the user has not given you the name of an evaluation to check, please request the evaluation name from the user.

Our standards are currently found in EVALUATION_CHECKLIST.md, and include liberal links to other areas of our BEST_PRACTICES.md and CONTRIBUTING.md documents to assist you. Your job is to refactor the evaluation in order to meet these standards to the best of your ability. Follow the following workflow in order to do this.

  1. Run the File Creation Steps workflow where VAR = “fix”.
  2. Go over each item in the EVALUATION_CHECKLIST, using the linked documents for context where necessary, going from top to bottom. It is important to go over every single item in the checklist!
    1. For each checklist item, assess your confidence that you know what is being asked of you. Select Low, Medium, or High.
    2. If you have High confidence, fix the evaluation to pass the checklist if needed, then edit EVALUATION_CHECKLIST.md to place a check next to it. It is acceptable to check off an item without making any changes if it already passes the requirement.
    3. If you have Medium confidence, make a note of this in UNCERTAINTIES.md along with any questions you have, then do your best to solve it as per the High confidence workflow.
    4. If you have Low confidence, make a note of this in UNCERTAINTIES.md along with any questions you have, then do not attempt to solve it and leave the checkmark blank.

Do not attempt the evaluation report. You should tell the user in SUMMARY.md that producing evaluation reports automatically requires a separate workflow, titled Make An Evaluation Report. You should still perform the initial tests to ensure the evaluation runs at all. The model you should test on is the one that the example commands in the evaluation’s README uses. For example, uv run inspect eval inspect_evals/<eval_name> --model openai/gpt-5-nano --limit 2

Your task is over when you have examined every single checklist item except for the evaluation report and completed each one that you are capable of. Do a final pass over NOTES.md and UNCERTAINTIES.md, then create a SUMMARY.md file and summarise what you have done. Then inform the user that your task is finished.

Review An Evaluation

For this task, you are to look at a single evaluation in src/inspect_evals. If you were just building one, you can assume this is the correct evaluation to look at. If the user was not building an evaluation and has not given you the name of an evaluation to check, please request the evaluation name from the user.

Our standards are currently found in EVALUATION_CHECKLIST.md, and include liberal links to other areas of our BEST_PRACTICES.md and CONTRIBUTING.md documents to assist you. Your job is to review the evaluation in order to meet the agent-checkable standards to the best of your ability. Follow the following workflow in order to do this.

  1. Run the File Creation Steps workflow where VAR = “review”.
  2. Read our EVALUATION_CHECKLIST.md and linked documents.
  3. For each item in the agent runnable checks, go over all possibly relevant files in the evaluation. Go over each item, using linked documents for context where necessary, going from top to bottom. It is important to go over every single item in the checklist! Note any issues you find in NOTES.md under the following format:
    1. Standard Describe the standard that has not been met.
    2. Issue: Describe the issue.
    3. Location: If possible, write the file and lines where the issue occurs. If the issue is not localised by line, write only the file. If the issue is not localised by file, write the evaluation name here.
    4. Fix: What fix you would recommend. Go into as much detail as you feel confident - it’s okay if you don’t have a strong solution to solve a given problem, but if you do, please mention what your intended solution is.
    5. Comment: Write a comment that can be used as a Github comment that gives all the relevant information from above. Each comment should begin with (Agent) in it, to make it clear where the comment comes from. We will provide functionality to write these as Github comments in future through the ‘gh’ CLI, so make your comments compatible with this. The comments should be informative, yet polite - the contributor may not have been the one to ask for this review, and external contributors have varying degrees of experience.
  4. Write a summary in SUMMARY.md of the evaluation’s overall quality, how many issues you found, and major issues if any. The contributor won’t see this unless they’re the one who asked for this review, so you can be more blunt here.
  5. Tell the user your task is done.

Check Agent Trajectories

This workflow uses Inspect Scout to automatically analyze agent trajectories. It’s faster than manual analysis but may miss nuanced issues.

The workflow is as follows:

  1. If the user does not give any indication of what log file to use, ask them for which log file or files they want the workflow performed on. You can show them the most recent log file in logs/ and ask them if that’s what they want.

  2. Determine the name of the evaluation the log file is used in, and create a folder titled agent_artefacts/trajectory_analysis/<eval_name>. Add a number to the end if the folder already exists.

  3. Read agent_artefacts/trajectory_analysis/inspect_scout/scanners.py to see what the default scanners check for:

    1. outcome_summary: Brief summary of why the agent succeeded or failed.
    2. external_failure: Failed due to CAPTCHAs, rate limiting, network issues, missing dependencies.
    3. formatting_failure: Failed due to incorrect answer formatting despite correct answer.
    4. reward_hacking_success: Succeeded through reward hacking or unintended means.
    5. ethical_refusal: Failed because the agent refused on ethical or safety grounds.

    Additionally, if agent_artefacts/trajectory_analysis/<eval_name><version> exists already, check for any scanners contained in the latest version and include those.

  4. Tell the user what will be checked by default and ask if they want to check for anything else.

  5. If the user wants additional checks, create an eval_scanners.py file under agent_artefacts/trajectory_analysis/<eval_name>/, and add Inspect Scout scanners that check for their requirements. Use the existing scanners in agent_artefacts/trajectory_analysis/inspect_scout/scanners.py and the Inspect Scout documentation as references. Copy any scanners found in the eval_name folder at the end of Step 2 across as well.

    Each custom scanner should be wrapped in an InspectEvalScanner object and added to a SCANNERS list:

    from scanners import InspectEvalScanner
    
    SCANNERS = [
        InspectEvalScanner(
            name="my_scanner",
            scanner_factory=my_scanner_function,
            invalidates_success=False,  # Set True if flagging invalidates a success
            invalidates_failure=True,   # Set True if flagging invalidates a failure
        ),
    ]

    Ask the user whether each scanner should invalidate successes, failures, both, or neither.

  6. Check how many samples are in the log file with uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> --dry-run. If more than 100 samples would be analysed, ask the user if they want to run all samples or a subset. Tell them that Inspect Evals guidance is to run at least 100 samples, and ask them if they want to run any more than that.

  7. Provide the user the command they can use to run the scanners. If the user asks you to run it, you can do so.

    1. If no custom scanners: uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> -o agent_artefacts/trajectory_analysis/<eval_name>/scout_results with an optional --limit <num> if a subset is run.
    2. If custom scanners were created in step 4, add -n <eval_name> like so: uv run python agent_artefacts/trajectory_analysis/inspect_scout/run_all_scanners.py <log_file> -n <eval_name> -o agent_artefacts/trajectory_analysis/<eval_name>/scout_results

    The -n <eval_name> option automatically loads scanners from agent_artefacts/trajectory_analysis/<eval_name>/eval_scanners.py. Duplicate scanner names are detected and skipped with a warning.

  8. Once the command is done, extract the results: uv run python agent_artefacts/trajectory_analysis/inspect_scout/extract_results.py agent_artefacts/trajectory_analysis/<eval_name>/scout_results

  9. Analyze sample validity: uv run python agent_artefacts/trajectory_analysis/inspect_scout/analyze_validity.py agent_artefacts/trajectory_analysis/<eval_name>/scout_results

  10. Review the extracted results and create a summary in <eval_name>_<model_name>_ANALYSIS.md. If the file already exists, check to see if the user wants it overwritten or a new file generated by date. If the user wants it generated by date, use <eval_name>_<model_name>_<date>_ANALYSIS.md.

  11. Tell the user the task is done.

Make an Evaluation Report

Report Formatting

The evaluation report included in the README.md should look something like this. It should run on the entire dataset or $5 of compute per model, whichever is cheaper. Use the token count from smaller runs to make this prediction.

==================================================

Model Provider Accuracy Stderr Time
model_x OpenAI 0.600 0.245 18s
model_y Anthropic 0.400 0.245 6s
model_z Google 0.600 0.245 8s

Notes:

  • All three providers successfully completed the evaluation
  • Human expert baseline from the paper is 69.7% accuracy

==================================================

If the eval.yaml file includes an arXiv paper, check that paper for the models used and human baselines. Include the human baseline in the notes section if it is present. If you can, select three models that would be suitable to check if this evaluation successfully replicates the original paper, including at least two different model providers.

If you cannot do this, or if there aren’t three suitable candidates, use frontier models to fill out the remainder of the list. Currently, these are as follows:

Frontier Models

openai/gpt-5.1-2025-11-13 (Cost: $1.25/MToK input, $10/MToK output) anthropic/claude-sonnet-4-5-20250929 (Cost: $3/$15) google/gemini-3-pro-preview (Cost: $2/$12)

Workflow Steps

The steps of the workflow are as follows:

  1. Complete the File Creation Steps workflow where VAR = “evalreport”.

  2. Read the Evaluation Report Guidelines.

  3. Check to see if the README for the evaluation already has an evaluation report. If so, double-check with the user that they want it overwritten.

  4. Read the main file in EVAL_NAME, which should be src/inspect_evals/EVAL_NAME/EVAL_NAME.py in order to see how many tasks there are.

  5. Perform an initial test with ‘uv run inspect eval inspect_evals/EVAL_NAME –model gpt-5.1-2025-11-13 –limit 5’ to get estimated token counts. Use -T shuffle=True if possible to produce random samples - to see if it’s possible, you’ll need to check the evaluation itself.

  6. Perform model selection as above to decide which models to run.

  7. Select a method of randomisation of the samples that ensures all meaningfully different subsets of the data are checked. This is as simple as ensuring the dataset is shuffled in most cases. Explicitly track how many meaningfully different subsets exist.

  8. Tell the user what commands to run and how much compute it is expected to cost.

    Base your compute calculation on the most expensive model in your list. Each model should be run on the same dataset size regardless of cost, hence we limit it via the most expensive one. If the task will be more expensive than $5 per model to run the full dataset, give the user your estimate for the cost of the full dataset. This means if there are multiple tasks to run, you’ll need to split the cost among them according to token usage in the initial estimate. You should assume Gemini reasoning models take roughly 10x the tokens of the other models when performing these calculations.

As a rough example, tell the user something like this:

**I recommend running N samples of this dataset, for an estimated compute cost of $X. I recommend the following models: <list of models>. I recommend them because <reasoning here>. The full dataset is expected to cost roughly <full_cost> for all three models. This is based on running <model> for 5 samples and using Y input tokens and Z output tokens. Please note that this is a rough estimate, especially if samples vary significantly in difficulty.

The command to perform this run is as follows:

uv run inspect eval inspect_evals/<eval_name> --model model1,model2,model3 --limit <limit> --max-tasks <num_tasks> --epochs <epochs>**

After you have run the command, let me know and I’ll fill out the evaluation report from there.**

Num tasks should be number of models run * the number of @task functions being tested. Epochs should be 1 if the limit is below the size of the full dataset, or as many epochs as can be fit into the cost requirements otherwise.

Make sure to give them the command in one line or with multiline escapes to ensure it can be run after copy-pasted.

If the number of samples recommended by this process is less than 20, or less than (3 * meaningful subcategories), you should also inform the user that the number of samples achievable on the $5/model budget is too small for a meaningful evaluation, and that more resources are needed to test properly. A minimum of 20 samples are required for error testing. If they need help, they can ask the repository’s maintainers for testing resources.

If the user asks you to run this for them, remind them that they won’t be able to see the progress of the evaluation due to the way Inspect works, and asking if they’re sure. Do not proactively offer to run the command for them. After the command has been run, run ‘uv run python tools/parse_eval_logs_for_evaluation_report.py -n (num_models * num_tasks)’. This will give you the information you need to put into the README for the evaluation report. If any problems arise, ask the user to give you this information manually.

  1. Add the report to the README.md in the appropriate section, and then tell the user the task is done. Do not add human baseline or random baseline data unless they already appear in the README.

Write a PR For A Failing Test

The steps to the workflow are as follows:

  1. The user should provide you with a link that looks like https://github.com/ArcadiaImpact/inspect-evals-actions/actions/runs/<number>/job/<number2>, or a test to fix. If they haven’t, please ask them for the URL of the failing Github action, or a test that should be fixed.
  2. If a link is provided, you will be unable to view this link directly, but you can use ‘gh run view’ to see it.
  3. Checkout main and do a git pull. Make sure you’re starting from a clean slate and not pulling in any changes from whatever you or the user was previously doing.
  4. Create a branch agent/<number> for your changes, in inspect_evals. Note this is not the same as inspect-evals-actions where the above URL was from.
  5. Identify the failing test(s) and any other problems.
  6. Attempt these fixes one by one and verify they work locally, with uv run pytest tests/<eval> --runslow.
  7. If your fix changes datasets, expected values, scoring logic, or anything that affects evaluation results, read TASK_VERSIONING.md and PACKAGE_VERSIONING.md, and make all appropriate changes.
  8. Perform linting as seen in .github/workflows/build.yml, and fix any problems that arise.
  9. Commit your changes to the branch.
  10. Before pushing, go over your changes and make sure all the changes that you have are high quality and that they are relevant to your finished solution. Remove any code that was used only for diagnosing the problem.
  11. Push your changes.
  12. Use the Github API to create a pull request with the change. Include a link to the failing test, which test failed, and how you fixed it.

TODO in future: Automate kicking off this workflow. Allow the agent to check the status of its PR and fix any failing checks through further commits autonomously.

Mark Slow Tests

This workflow adds @pytest.mark.slow(seconds) markers to tests that exceed the 10-second threshold in CI but are not already marked. This is necessary because our CI pipeline checks for unmarked slow tests and fails the build if any test takes >= 15 seconds without a marker.

Workflow Steps

  1. Get the GitHub Actions URL: The user should provide a URL like https://github.com/ArcadiaImpact/inspect-evals-actions/actions/runs/<run_id> or https://github.com/ArcadiaImpact/inspect-evals-actions/actions/runs/<run_id>/job/<job_id>. If they haven’t, please ask them for the URL of the failing Github action.

  2. Extract slow test information: Use the fetch_slow_tests.py script to get all slow tests and their recommended markers:

    uv run python agent_artefacts/scripts/fetch_slow_tests.py <url_or_run_id> --threshold 10

    This script:

    • Fetches logs from all jobs in the GitHub Actions run
    • Parses test timing information from the “Check for unmarked slow tests” step
    • Aggregates durations across all Python versions and platforms
    • Reports the maximum duration observed for each test
    • Provides recommended @pytest.mark.slow(N) marker values (ceiling of max duration)

    For JSON output (useful for programmatic processing):

    uv run python agent_artefacts/scripts/fetch_slow_tests.py <url_or_run_id> --threshold 10 --json

    Tests marked [FAIL] in the output exceeded 60 seconds and caused the CI to fail. All tests above the threshold (default 10s) should be marked.

  3. Check existing markers: For each identified test, check if it already has a @pytest.mark.slow marker:

    grep -E "def <test_name>|@pytest\.mark\.slow" tests/<path>

    If the test already has a slow marker with a value >= the observed duration, skip it (per user instruction: “If the number is there but too high, leave it as is”).

  4. Add slow markers: For tests without markers, add @pytest.mark.slow(N) where N is the observed maximum duration rounded up to the nearest integer. Place the marker immediately before other decorators like @pytest.mark.huggingface or @pytest.mark.asyncio.

    Example:

    # Before
    @pytest.mark.huggingface
    def test_example():
    
    # After
    @pytest.mark.slow(45)
    @pytest.mark.huggingface
    def test_example():
  5. Run linting: After adding markers, run linting to ensure proper formatting:

    uv run ruff check --fix <modified_files>
    uv run ruff format <modified_files>
  6. Commit: Commit the changes with a descriptive message listing all tests that were marked:

    Add slow markers to tests exceeding 10s threshold
    
    Tests marked:
    - <test_path>::<test_name>: slow(N)
    - ...
    
    Co-Authored-By: Claude <assistant_name> <noreply@anthropic.com>

Notes

  • The slow marker value should be slightly higher than the observed maximum to account for CI variance. Rounding up to the nearest integer is usually sufficient.
  • Parametrized tests (e.g., test_foo[param1]) share a single marker on the base test function. You should use the highest value found across all params for the slow marker.

TODO: What if one param is an order of magnitude higher than the rest? Should we recommend splitting it into its own test? pytest.mark.slow contents aren’t that crucial, but still.

Review PR According to Agent-Checkable Standards

This workflow is designed to run in CI (GitHub Actions) to automatically review pull requests against the agent-checkable standards in EVALUATION_CHECKLIST.md. It produces a SUMMARY.md file that is posted as a PR comment.

Important Notes

  • This workflow is optimized for automated CI runs, not interactive use.
  • The SUMMARY.md file location is /tmp/SUMMARY.md (appropriate for GitHub runners).
  • Unlike other workflows, this one does NOT create folders in agent_artefacts since it runs in ephemeral CI environments.

Workflow Steps

  1. Identify if an evaluation was modified: Look at the changed files in the PR to determine which evaluation(s) are affected. Use git diff commands to see what has changed. If an evaluation (found in src/inspect_evals) was modified, complete all steps in the workflow. If no evaluation was modified, skip any steps that say (Evaluation) in them.

  2. Read the repo context: Read agent_artefacts/repo_context/REPO_CONTEXT.md if present. Use these lessons in your analysis.

  3. (Evaluation) Read the standards: Read EVALUATION_CHECKLIST.md, focusing on the Agent Runnable Checks section. These are the standards you will check against.

  4. (Evaluation) Check each standard: For each evaluation modified in the PR, go through every item in the Agent Runnable Checks section:

    • Read the relevant files in the evaluation
    • Compare against the standard
    • For any violations found, add an entry to /tmp/NOTES.md with:
      • Standard: Which standard was violated
      • Issue: Description of the issue
      • Location: file_path:line_number or file_path if not line-specific
      • Recommendation: What should be changed

    Only worry about violations that exist within the code that was actually changed. Pre-existing violations that were not touched or worsened by this PR can be safely ignored.

  5. Examine test coverage: Use the ensure-test-coverage skill and appropriate commands to identify any meaningful gaps in test coverage. You can skip this step if a previous step showed no tests were found, since this issue will already be pointed out.

  6. Apply code quality analysis: Examine all files changed for quality according to BEST_PRACTICES.md. Perform similar analysis and note-taking as in Step 5. If you notice poor quality in a way that is not mentioned in BEST_PRACTICES.md, add this to your notes, and under Standard, write “No explicit standard - agent’s opinion”. Err on the side of including issues here, unless REPO_CONTEXT.md or another document tells you not to - we will be updating these documents over time, and it is easier for us to notice something too nitpicky than to notice the absence of something important.

  7. Write the summary: After checking all standards, create /tmp/SUMMARY.md by consolidating the issues from NOTES.md:

    ## Summary
    
    Brief overview of what was reviewed and overall assessment.
    
    ## Issues Found
    
    ### [Standard Category Name]
    
    **Issue**: Description of the issue
    **Location**: `file_path:line_number` or `file_path` if not line-specific
    **Recommendation**: What should be changed
    
    (Repeat for each issue from NOTES.md)
    
    ## Notes
    
    Any additional observations or suggestions that aren't violations but could improve the code.

    If no issues are found, write a brief summary confirming the PR passes all agent-checkable standards. Do NOT include a list of passed checks - it is assumed that anything not mentioned is fine.

  8. Important: Always write to /tmp/SUMMARY.md even if there are no issues. The CI workflow depends on this file existing.

End your comment with ‘This is an automatic review performed by Claude Code. Any issues raised here should be fixed or justified, but a human review is still required in order for the PR to be merged.’

What This Workflow Does NOT Check

  • Human-required checks (these need manual review)
  • Evaluation report quality (requires running the evaluation)
  • Whether the evaluation actually works (requires execution)

This workflow only checks code-level standards that can be verified by reading the code.

Prepare Eval For Submission

Workflow Steps

To prepare an evaluation for submission as a pull request:

  1. Add custom dependencies in the pyproject.toml file, if required. Don’t forget to add your evaluation to the [test] dependencies if you’ve added any custom dependencies.

    [project.optional-dependencies]
    worldsense = ["pandas", "matplotlib"]  # existing groups
    your_evaluation = ["example-python-package"]  # add this line for your evaluation
    
    test = [
       "inspect_evals[dev]",  # existing groups
       "inspect_evals[your_evaluation]",  # add this line for your evaluation 
    ]

    Pin the package version if your evaluation might be sensitive to version updates.

    Make sure your module does not import your custom dependencies when the module is imported. This breaks the way inspect_ai imports tasks from the registry.

    To check for this, pip uninstall your optional dependences and run another eval with --limit=0. If it fails due to one of your imports, you need to change how you are importing that dependency

  2. Add pytest tests covering the custom functions included in your evaluation. See the Testing Standards section for more details.

  3. Run pytest to ensure all new and existing tests pass.

    To install test dependencies:

    uv sync --extra test

    Then run the tests

    uv run pytest
  4. Add an eval.yaml file to your evaluation directory (e.g., src/inspect_evals/your_eval/eval.yaml) to include it in the documentation (which is automatically generated and deployed by Github Actions after merge). See the eval.yaml Reference section in CONTRIBUTING.md for a complete example and field documentation.

    Don’t forget to add any custom dependencies to pyproject.toml if applicable.

  5. Run linting, formatting, and type checking and address any issues:

    make check EVAL=<eval_name>

    Note that we automatically run these commands on all incoming PRs, so be sure your contribution passes those before opening a PR. run_autolint.py handles automated checks

  6. Fill in the missing sections of the generated README.md which are marked with ‘TODO:’.

  7. Create a changelog fragment to document your changes by running uv run scriv create.

Master Checklist

This workflow runs a series of workflows each in turn. Each workflow is to be run serially.

General Agent Tips

PR Guidelines

  • Before opening a new PR, run appropriate linting from .github/workflows/build.yml.
  • When creating a new PR for the user, you should make the PR as a draft. The user will mark it as ready for review after going through the code themselves.
  • Always work on a branch, and never attempt to push directly to main.

TODO: Figure out a reliable way to have the agent monitor the pipeline in the background and ensure it passes without errors. gh watch isn’t the best.

Useful Commands

  1. You can see our linting in the .github/workflows/build.yml file. You should run these commands when checking linting, including both ruff and tools/run_autolint.py.
  2. To run tests, run uv run pytest tests/eval_name.
  3. To run evaluations, run uv run inspect eval inspect_evals/eval_name.
  4. To run a specific task (i.e, a function with the @task decorator), run uv run inspect eval inspect_evals/eval_name@task_name.
  5. Some useful arguments are:
    1. --limit X: Run only up to X samples.
    2. --model model_name: Run on this particular model. Multiple models can be run with commas like model1,model2. You can find information on model names here. If the user lacks an API key for OpenAI, Anthropic, or Google, note this down in your report. You should assume the user has these keys until Inspect throws an error.
    3. -T allows you to pass task arguments in from the command line.