Benchmark Implementation Methodology
This methodology guide provides a structured approach for implementing benchmarks in the Inspect Evals repository. It focuses on the strategic and methodological aspects of benchmark development, complementing the technical requirements outlined in CONTRIBUTING.md.
This document focuses on design and methodology for new benchmarks. For technical requirements (testing, CI, code style), see CONTRIBUTING.md.
For quality expectations and review criteria, see BEST_PRACTICES.md and EVALUATION_CHECKLIST.md
Getting Help
Need assistance or have questions? Raise an issue to connect with the Inspect Evals team and get support from the community. If you’re unsure whether a benchmark is in-scope or how to interpret the original paper, please open a GitHub issue.
We especially welcome early “design review” issues that share your benchmark choice, planned scope, and open questions before you start coding.
Implementation Methodology
Phase 1: Research and Planning
Getting Started with Inspect
- Read the Inspect documentation to understand the framework
- Review existing evals in the inspect_evals repository to see examples of implementations, such as MMLU-Pro or InterCode
- Consider starting with a Jupyter notebook to explore the dataset and prototype your approach
Benchmark Selection Strategy
- Has a published paper or widely used dataset
- Uses publicly available data (or clearly documented access protocol)
- Fills a gap in existing Inspect Evals coverage (capability, domain, or safety dimension)
- Has at least one set of published or reference results to validate against
- Review CONTRIBUTING.md for acceptance criteria and priority guidelines
If contributing a new benchmark, review open issues and PRs for existing implementations before creating a new one to discuss your proposal.
Look for existing implementations in other frameworks that could be adapted, e.g.:
- Custom implementation in a GitHub repository linked from the paper or website
- openai/simple-evals
- EleutherAI/lm-evaluation-harness
- Helm
- OpenCompass
Consider the benchmark’s complexity and your familiarity with the domain when selecting what to implement. We recommend implementing simpler Q&A style benchmarks before complex agentic / tool-using benchmarks.
Inspect Evals aims for compatibility with the Agentic Benchmark Checklist framework to ensure a high standard of rigor.
Phase 2: Benchmark Design Review
Before starting implementation, conduct a thorough design review by answering these key questions:
Core Benchmark Understanding
- Purpose & Goals
- What is the primary objective of the benchmark?
- Which specific capabilities or behaviors does it measure?
- What domain or tasks does it focus on?
- Dataset Analysis
- What is the size and source of the dataset?
- How and when was the data collected and labeled?
- Are there any known biases or limitations in the data?
- What are the key characteristics and distributions in the data?
- What is the dataset’s license and are there any usage restrictions?
- How is the data split (train / validation / test), and which split(s) will you use?
- Is there any risk of test-set contamination (e.g. widely used benchmarks that might be in model training data)? How will you document or mitigate this?
- Prompt Engineering
- How are prompts structured and formatted?
- Are there specific system prompts or instructions?
- How is few-shot or zero-shot prompting implemented?
- Are there any prompt variations or templates?
- Evaluation Methodology
- What metrics are used to evaluate performance?
- How are scores calculated and aggregated?
- What constitutes a passing or failing result?
- How is statistical significance determined?
- How many model runs will you perform (e.g. repeated runs with different seeds) for reliable estimates?
- Do you need additional quantification of uncertainty beyond standard error?
- Are there any budget constraints (tokens / runtime) that limit full-dataset evaluation, and how will you sample if so?
- Prior Work & Baselines
- Are there existing implementations of this benchmark? Feel free to import or adapt code from other implementations, with appropriate attribution.
- What models have been evaluated on this benchmark?
- What were their reported results?
- How does this benchmark compare to similar evaluations?
- What are the state-of-the-art results?
- What is a baseline value that could be achieved by random guessing or a trivial agent? This could be by returning null strings or “guessing” the answer on every question.
- Risks & Safety Considerations
- Does this benchmark expose sensitive or hazardous capabilities?
- Are there access controls or redaction steps required for the data or prompts?
- How might this evaluation be misused, and how does your implementation mitigate that?
Implementation Considerations
- Document any uncertainties or questions that need clarification
- Note any assumptions made about the evaluation design
- Identify potential edge cases or failure modes
- Plan how to validate the implementation against known results
- Consider potential data leakage or contamination (e.g. if models have likely been trained on this test set) and how you will document that risk.
- Decide how you will handle versioning (dataset versions, paper revisions, scoring changes, required packages) and record this in your design decisions log.
Design Decisions Log
Maintain a log of key design decisions, including:
- The decision made
- Rationale behind it
- Alternative approaches considered
- Potential impact on evaluation results
- We recommend creating a docs/<eval_name>_design.md file in your eval’s directory and keeping this log there so it can evolve alongside the implementation.
Example format:
| Date | Decision | Rationale | Impact |
|---|---|---|---|
| YYYY-MM-DD | Decision made | Why this approach was chosen | How it affects the evaluation |
Phase 3: Implementation Planning
Create a detailed technical implementation plan covering all aspects of the benchmark. Use the following table as a template to document your approach:
| Aspect | Implementation Details | Dependencies | Validation: How will you… |
|---|---|---|---|
| Data Loading | How will you retrieve the dataset (e.g., from a URL or Hugging Face)? | Dependencies needed | validate data loading? Compare counts against the source paper / dataset docs; spot-check random samples; optionally add checksum or schema validation? |
| Benchmark Type | Is this Q&A or agentic? Does it require tools or sandboxing? Is it a simple Q&A task, completion, multiple choice, or an agentic evaluation (multi-turn, tools, sandboxed environment)? Additionally, note any special requirements (e.g., Docker, browser automation, external APIs). | - | - |
| Task Definition | Describe the tasks and dataset subsets involved. | - | - |
| Data Transformation | Explain how raw data will be converted into usable samples. | - | - |
| Prompt Engineering | How will prompts be constructed and formatted? | Any prompt templates | test prompt effectiveness? Run a small “dry run” with a cheap model on 5-10 samples; manually inspect prompts and outputs to ensure they match the paper’s description? |
| Evaluation Metrics | How will responses be scored and evaluated? | Scoring criteria | verify scoring accuracy? Recompute scores for a small subset manually or against another implementation; verify edge cases (ties, invalid answers, partial credit)? |
| Error Handling | How will errors and edge cases be handled? | - | test error conditions and error handling logic? |
| Performance | Any performance considerations or optimizations? | - | measure and validate performance? |
| Model Choice | Which model(s) will be used for validation | - | - |
| Reproducibility | How will you set seeds, sampling options, and concurrency to allow repeatable runs? | - | verify that repeated runs under the same config reproduce results within expected variance? Additionally, how will you verify LLM-as-a-judge evaluations in benchmarks which do not use substring matching? |
| Cost & Resources | Estimate token / compute cost, and whether you will use sampling or subsetting for validation runs. | Hardware / provider constraints | ensure your plan fits contributor budgets? |
Phase 4: Implementation Execution
Core Implementation
- Project Setup
- Initialize project structure following repository conventions
- Data Pipeline
- Implement data loading and preprocessing
- Add data validation and integrity checks
- Create data sampling utilities for testing
- Core Functionality
- Implement prompt construction logic
- Develop evaluation metrics and scoring
- Run a small smoke test (e.g. –limit 5 with a cheap model) to verify data loading, prompt construction, and scoring all work end-to-end before running full evaluations
Testing & Validation
- Test Development
- Write unit tests for all components
- Implement integration tests for the full pipeline
- Add benchmark-specific test cases
- For further guidance, refer to EVALUATION_CHECKLIST.md
- Validation
- Run against known reference implementations
- Verify results match expected outputs
- Document any discrepancies and their resolutions
- Run against known reference implementations or published baselines where available and compare aggregate scores
- Document acceptable discrepancies (e.g. minor differences due to prompt formatting or randomization)
- Where feasible, perform repeated runs or paired comparisons across models to estimate uncertainty, not just single-run point estimates.
- Code Review
- Conduct internal code review
- Address all feedback and make necessary improvements
- Ensure code meets project standards
- Define specific versions for packages used
Documentation & Submission
- Documentation
- Write comprehensive README with usage examples (see CONTRIBUTING.md for details)
- Overview: 2-3 sentences describing what the eval measures and why it matters.
- Data: dataset source, license, splits used, preprocessing, sample prompt(s) with input and output details.
- Usage: 1-2 example uv run inspect eval … commands.
- Scoring: explain how individual components are scored and how an aggregate score is calculated.
- Validation: brief summary of how implementation was validated (e.g. comparison table vs paper).
- Limitations & Known Issues: any caveats, missing features, or deviations from the original benchmark.
- Document API and configuration options
- Add inline code documentation
- Write comprehensive README with usage examples (see CONTRIBUTING.md for details)
- Evaluation Report
- Document implementation details and decisions
- Present results and analysis
- Include limitations and future work
- Include a small table comparing your results against paper / other frameworks (e.g., model name, score, difference).
- Summarize any methodological deviations from the original paper (different splits, prompt formats, scoring tweaks).
- Report uncertainties (e.g. standard error, 95% CI) where feasible.
- PR Preparation
- See CONTRIBUTING.md for technical submission requirements
- Create a clear, descriptive PR title and description
- Include test results and validation outcomes
- Reference related issues and discussions
- Link to your design document (e.g.
docs/{eval_name}_design.md) and evaluation report in the PR description. - Confirm you have run through EVALUATION_CHECKLIST.md and note any items that are intentionally not applicable.
PR Best Practices
- Draft PRs for Early Feedback
- Create a draft PR early to get initial feedback
- Mark as “Ready for Review” when all tests pass and documentation is complete
For examples of well-structured PRs, see the merged implementation PRs
Remember that high-quality contributions are more important than speed. Take the time to ensure your implementation is correct, well-tested, and well-documented.