Benchmark Implementation Methodology

This methodology guide provides a structured approach for implementing benchmarks in the Inspect Evals repository. It focuses on the strategic and methodological aspects of benchmark development, complementing the technical requirements outlined in CONTRIBUTING.md.

This document focuses on design and methodology for new benchmarks. For technical requirements (testing, CI, code style), see CONTRIBUTING.md.

For repository guidance on how task interfaces should be made configurable, see Task Configurability.

For quality expectations and review criteria, see BEST_PRACTICES.md and EVALUATION_CHECKLIST.md

Getting Help

Need assistance or have questions? Raise an issue to connect with the Inspect Evals team and get support from the community. If you’re unsure whether a benchmark is in-scope or how to interpret the original paper, please open a GitHub issue.

We especially welcome early “design review” issues that share your benchmark choice, planned scope, and open questions before you start coding.

Implementation Methodology

Phase 1: Research and Planning

Getting Started with Inspect

Read the Inspect documentation to understand the framework
Review existing evals in the inspect_evals repository to see examples of implementations, such as MMLU-Pro or InterCode
Consider starting with a Jupyter notebook to explore the dataset and prototype your approach

Benchmark Selection Strategy

Has a published paper or widely used dataset
Uses publicly available data (or clearly documented access protocol)
Fills a gap in existing Inspect Evals coverage (capability, domain, or safety dimension)
Has at least one set of published or reference results to validate against
Review CONTRIBUTING.md for acceptance criteria and priority guidelines

If contributing a new benchmark, review open issues and PRs for existing implementations before creating a new one to discuss your proposal.

Look for existing implementations in other frameworks that could be adapted, e.g.:

Custom implementation in a GitHub repository linked from the paper or website
openai/simple-evals
EleutherAI/lm-evaluation-harness
Helm
OpenCompass

Consider the benchmark’s complexity and your familiarity with the domain when selecting what to implement. We recommend implementing simpler Q&A style benchmarks before complex agentic / tool-using benchmarks.

Inspect Evals aims for compatibility with the Agentic Benchmark Checklist framework to ensure a high standard of rigor.

Phase 2: Benchmark Design Review

Before starting implementation, conduct a thorough design review by answering these key questions:

Core Benchmark Understanding

Purpose & Goals
- What is the primary objective of the benchmark?
- Which specific capabilities or behaviors does it measure?
- What domain or tasks does it focus on?
Dataset Analysis
- What is the size and source of the dataset?
- How and when was the data collected and labeled?
- Are there any known biases or limitations in the data?
- What are the key characteristics and distributions in the data?
- What is the dataset’s license and are there any usage restrictions?
- How is the data split (train / validation / test), and which split(s) will you use?
- Is there any risk of test-set contamination (e.g. widely used benchmarks that might be in model training data)? How will you document or mitigate this?
Prompt Engineering
- How are prompts structured and formatted?
- Are there specific system prompts or instructions?
- How is few-shot or zero-shot prompting implemented?
- Are there any prompt variations or templates?
Evaluation Methodology
- What metrics are used to evaluate performance?
- How are scores calculated and aggregated?
- What constitutes a passing or failing result?
- How is statistical significance determined?
- How many model runs will you perform (e.g. repeated runs with different seeds) for reliable estimates?
- Do you need additional quantification of uncertainty beyond standard error?
- Are there any budget constraints (tokens / runtime) that limit full-dataset evaluation, and how will you sample if so?
Prior Work & Baselines
- Are there existing implementations of this benchmark? Feel free to import or adapt code from other implementations, with appropriate attribution.
- What models have been evaluated on this benchmark?
- What were their reported results?
- How does this benchmark compare to similar evaluations?
- What are the state-of-the-art results?
- What is a baseline value that could be achieved by random guessing or a trivial agent? This could be by returning null strings or “guessing” the answer on every question.
Risks & Safety Considerations
- Does this benchmark expose sensitive or hazardous capabilities?
- Are there access controls or redaction steps required for the data or prompts?
- How might this evaluation be misused, and how does your implementation mitigate that?

Implementation Considerations

Document any uncertainties or questions that need clarification
Note any assumptions made about the evaluation design
Identify potential edge cases or failure modes
Plan how to validate the implementation against known results
Consider potential data leakage or contamination (e.g. if models have likely been trained on this test set) and how you will document that risk.
Decide how you will handle versioning (dataset versions, paper revisions, scoring changes, required packages) and record this in your design decisions log.

Design Decisions Log

Maintain a log of key design decisions, including:

The decision made
Rationale behind it
Alternative approaches considered
Potential impact on evaluation results
We recommend creating a docs/<eval_name>_design.md file in your eval’s directory and keeping this log there so it can evolve alongside the implementation.

Example format:

Date	Decision	Rationale	Impact
YYYY-MM-DD	Decision made	Why this approach was chosen	How it affects the evaluation

Phase 3: Implementation Planning

Create a detailed technical implementation plan covering all aspects of the benchmark. Use the following table as a template to document your approach:

Aspect	Implementation Details	Dependencies	Validation: How will you…
Data Loading	How will you retrieve the dataset (e.g., from a URL or Hugging Face)?	Dependencies needed	validate data loading? Compare counts against the source paper / dataset docs; spot-check random samples; optionally add checksum or schema validation?
Benchmark Type	Is this Q&A or agentic? Does it require tools or sandboxing? Is it a simple Q&A task, completion, multiple choice, or an agentic evaluation (multi-turn, tools, sandboxed environment)? Additionally, note any special requirements (e.g., Docker, browser automation, external APIs).	-	-
Task Definition	Describe the tasks and dataset subsets involved.	-	-
Data Transformation	Explain how raw data will be converted into usable samples.	-	-
Prompt Engineering	How will prompts be constructed and formatted?	Any prompt templates	test prompt effectiveness? Run a small “dry run” with a cheap model on 5-10 samples; manually inspect prompts and outputs to ensure they match the paper’s description?
Evaluation Metrics	How will responses be scored and evaluated?	Scoring criteria	verify scoring accuracy? Recompute scores for a small subset manually or against another implementation; verify edge cases (ties, invalid answers, partial credit)?
Error Handling	How will errors and edge cases be handled?	-	test error conditions and error handling logic?
Performance	Any performance considerations or optimizations?	-	measure and validate performance?
Model Choice	Which model(s) will be used for validation	-	-
Reproducibility	How will you set seeds, sampling options, and concurrency to allow repeatable runs?	-	verify that repeated runs under the same config reproduce results within expected variance? Additionally, how will you verify LLM-as-a-judge evaluations in benchmarks which do not use substring matching?
Cost & Resources	Estimate token / compute cost, and whether you will use sampling or subsetting for validation runs.	Hardware / provider constraints	ensure your plan fits contributor budgets?

Phase 4: Implementation Execution

Core Implementation

Project Setup
- Initialize project structure following repository conventions
Data Pipeline
- Implement data loading and preprocessing
- Add data validation and integrity checks
- Create data sampling utilities for testing
Core Functionality
- Implement prompt construction logic
- Develop evaluation metrics and scoring
- Run a small smoke test (e.g. –limit 5 with a cheap model) to verify data loading, prompt construction, and scoring all work end-to-end before running full evaluations

Testing & Validation

Test Development
- Write unit tests for all components
- Implement integration tests for the full pipeline
- Add benchmark-specific test cases
- For further guidance, refer to EVALUATION_CHECKLIST.md
Validation
- Run against known reference implementations
- Verify results match expected outputs
- Document any discrepancies and their resolutions
- Run against known reference implementations or published baselines where available and compare aggregate scores
- Document acceptable discrepancies (e.g. minor differences due to prompt formatting or randomization)
- Where feasible, perform repeated runs or paired comparisons across models to estimate uncertainty, not just single-run point estimates.
Code Review
- Conduct internal code review
- Address all feedback and make necessary improvements
- Ensure code meets project standards
- Define specific versions for packages used

Documentation & Submission

Documentation
- Write comprehensive README with usage examples (see CONTRIBUTING.md for details)
  - Overview: 2-3 sentences describing what the eval measures and why it matters.
  - Data: dataset source, license, splits used, preprocessing, sample prompt(s) with input and output details.
  - Usage: 1-2 example uv run inspect eval … commands.
  - Scoring: explain how individual components are scored and how an aggregate score is calculated.
  - Validation: brief summary of how implementation was validated (e.g. comparison table vs paper).
  - Limitations & Known Issues: any caveats, missing features, or deviations from the original benchmark.
- Document API and configuration options
- Add inline code documentation
Evaluation Report
- Document implementation details and decisions
- Present results and analysis
- Include limitations and future work
  - Include a small table comparing your results against paper / other frameworks (e.g., model name, score, difference).
  - Summarize any methodological deviations from the original paper (different splits, prompt formats, scoring tweaks).
  - Report uncertainties (e.g. standard error, 95% CI) where feasible.
PR Preparation
- See CONTRIBUTING.md for technical submission requirements
- Create a clear, descriptive PR title and description
- Include test results and validation outcomes
- Reference related issues and discussions
- Link to your design document (e.g. docs/{eval_name}_design.md) and evaluation report in the PR description.
- Confirm you have run through EVALUATION_CHECKLIST.md and note any items that are intentionally not applicable.

PR Best Practices

Draft PRs for Early Feedback
- Create a draft PR early to get initial feedback
- Mark as “Ready for Review” when all tests pass and documentation is complete

For examples of well-structured PRs, see the merged implementation PRs

Remember that high-quality contributions are more important than speed. Take the time to ensure your implementation is correct, well-tested, and well-documented.