Inspect Evals

A repository of community-contributed LLM evaluations for Inspect AI, built by UK AISI, Arcadia Impact, and the Vector Institute.

Have an eval of your own? Register it to have it added to this list →

134 evals

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

inspect-evals

AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.

@bouromain 3,546 samples

AIME 2024: Problems from the American Invitational Mathematics Examination

inspect-evals

A benchmark for evaluating AI's ability to solve challenging mathematics problems from the 2024 AIME - a prestigious high school mathematics competition.

@tamazgadaev 30 samples

AIME 2025: Problems from the American Invitational Mathematics Examination

inspect-evals

A benchmark for evaluating AI's ability to solve challenging mathematics problems from the 2025 AIME - a prestigious high school mathematics competition.

@jannalulu 30 samples

AIME 2026: Problems from the American Invitational Mathematics Examination

inspect-evals

A benchmark for evaluating AI's ability to solve challenging mathematics problems from the 2026 AIME - a prestigious high school mathematics competition.

@joeda 30 samples

AIR Bench: AI Risk Benchmark

inspect-evals

A safety benchmark evaluating language models against risk categories derived from government regulations and company policies.

@l1990790120 5,694 samples

ANIMA: Animal Norms In Moral Assessment

inspect-evals

Evaluates the quality of a model's moral reasoning about animal welfare across 13 ethical dimensions.

@nishu-builder 26 samples

APE: Attempt to Persuade Eval

inspect-evals

Measures a model's willingness to attempt persuasion on harmful, controversial, and benign topics. The key metric is not persuasion effectiveness but whether the model attempts to persuade at all — particularly on harmful statements. Uses a multi-model setup: the evaluated model (persuader) converses with a simulated user (persuadee), and a third model (evaluator) scores each persuader turn for persuasion attempt. Based on the paper "It's the Thought that Counts" (arXiv:2506.02873).

@cmv13 600 samples

APPS: Automated Programming Progress Standard

inspect-evals

APPS is a dataset for evaluating model performance on Python programming tasks across three difficulty levels consisting of 1,000 at introductory, 3,000 at interview, and 1,000 at competition level. The dataset consists of an additional 5,000 training samples, for a total of 10,000 total samples. We evaluate on questions from the test split, which consists of programming problems commonly found in coding interviews.

@camtice 5,000 samples

ARC: AI2 Reasoning Challenge

inspect-evals

Dataset of natural, grade-school science multiple-choice questions (authored for human tests).

@jjallaire 3,548 samples

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

inspect-evals

Evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information.

@jeqcho 39,558 samples

Adversarial Humanities Benchmark

external · ahb

A text-only safety benchmark for evaluating whether language models maintain refusal behavior under humanities-style adversarial reformulations of harmful prompts.

@emmepra

AgentBench: Evaluate LLMs as Agents

inspect-evals

A benchmark designed to evaluate LLMs as Agents

@Felhof 26 samples

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

inspect-evals

Assesses whether AI agents can be hijacked by malicious third parties using prompt injections in simple environments such as a workspace or travel booking app.

@ericwinsor-aisi 1,014 samples

AgentHarm: Harmfulness Potential in AI Agents

inspect-evals

Assesses whether AI agents might engage in harmful activities by testing their responses to malicious prompts in areas like cybercrime, harassment, and fraud, aiming to ensure safe behavior.

@alexandrasouly-aisi 352 samples

AgentThreatBench: Evaluating LLM Agent Resilience to OWASP Agentic Threats

inspect-evals

Evaluates LLM agents against the OWASP Top 10 for Agentic Applications (2026), measuring both task utility and security resilience across memory poisoning, autonomy hijacking, and data exfiltration scenarios.

@vgudur-dev 24 samples

Agentic Misalignment: How LLMs could be insider threats

inspect-evals

Eliciting unethical behaviour (most famously blackmail) in response to a fictional company-assistant scenario where the model is faced with replacement.

@bmillwood-aisi 1 sample

ArxivRollBench

external · ArxivRoll

A rolling benchmark for evaluating recent scientific text reasoning from arXiv papers. ArxivRollBench constructs multiple-choice sequencing, cloze, and next-fragment prediction tasks over newly released scientific papers across arXiv domains, with compact and full public releases.

@liangzid

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

inspect-evals

Tests whether AI agents can perform real-world time-consuming tasks on the web.

@nlpet 165 samples

BBH: Challenging BIG-Bench Tasks

inspect-evals

Tests AI models on a suite of 23 challenging BIG-Bench tasks that previously proved difficult even for advanced language models to solve.

@JoschkaCBraun 250 samples

BBQ: Bias Benchmark for Question Answering

inspect-evals

A dataset for evaluating bias in question answering models across multiple social dimensions.

@harshraj172 58,492 samples

BFCL: Berkeley Function-Calling Leaderboard

inspect-evals

Evaluates LLM function/tool-calling ability on a simplified split of the Berkeley Function-Calling Leaderboard (BFCL).

@alex-remedios-aisi 4,981 samples

BIG-Bench Extra Hard

inspect-evals

A reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty.

@jeqcho 4,980 samples

BOLD: Bias in Open-ended Language Generation Dataset

inspect-evals

A dataset to measure fairness in open-ended text generation, covering five domains: profession, gender, race, religious ideologies, and political ideologies.

@harshraj172 7,200 samples

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

inspect-evals

Python coding benchmark with 1,140 diverse questions drawing on numerous python libraries.

@tim-hua-01 1,140 samples

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

inspect-evals

Reading comprehension dataset that queries for complex, non-factoid information, and require difficult entailment-like inference to solve.

@seddy-aisi 3,270 samples

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

inspect-evals

A benchmark for evaluating agents' ability to browse the web. The dataset consists of challenging questions that generally require web-access to answer correctly.

@AnselmC 1,266 samples

CORE-Bench

inspect-evals

Evaluate how well an LLM Agent is at computationally reproducing the results of a set of scientific papers.

@enerrio 45 samples

CTI-REALM: Cyber Threat Intelligence Detection Rule Development Benchmark

inspect-evals

Evaluates AI systems' ability to analyze cyber threat intelligence and develop comprehensive detection capabilities through a realistic 5-subtask workflow: MITRE technique mapping, data source discovery, Sigma rule generation, KQL development and testing against real telemetry data, and results analysis.

@arjun180-new 125 samples

CVEBench: Benchmark for AI Agents Ability to Exploit Real-World Web Application Vulnerabilities

inspect-evals

Characterises an AI Agent's capability to exploit real-world web application vulnerabilities. Aims to provide a realistic evaluation of an agent's security reasoning capability using 40 real-world CVEs.

@yuxuan18 40 samples

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

inspect-evals

Evaluates Large Language Models for cybersecurity risk to third parties, application developers and end users.

@onionymous 1,000 samples

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

inspect-evals

A benchmark for evaluating the capabilities of LLM agents in cyber offense.

@derpyplops 13 samples

ChemBench: Are large language models superhuman chemists?

inspect-evals

ChemBench is designed to reveal limitations of current frontier models for use in the chemical sciences. It consists of 2786 question-answer pairs compiled from diverse sources. Our corpus measures reasoning, knowledge and intuition across a large fraction of the topics taught in undergraduate and graduate chemistry curricula. It can be used to evaluate any system that can return text (i.e., including tool-augmented systems).

@Esther-Guo 2,786 samples

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

inspect-evals

Evaluates LLMs on class-level code generation with 100 tasks constructed over 500 person-hours. The study shows that LLMs perform worse on class-level tasks compared to method-level tasks.

@zhenningdavidliu 100 samples

CodeIPI: Indirect Prompt Injection for Coding Agents

inspect-evals

Measures coding agent vulnerability to indirect prompt injection attacks embedded in software engineering artifacts (issue descriptions, code comments, README files). Each sample pairs a legitimate bug-fixing task with an injected payload. Scoring measures injection resistance, task completion, and detection.

@debu-sinha 45 samples

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

inspect-evals

Evaluates an AI model's ability to correctly answer everyday questions that rely on basic commonsense knowledge and understanding of the world.

@lauritowal 1,221 samples

ComputeEval: CUDA Code Generation Benchmark

inspect-evals

Evaluates LLM capability to generate correct CUDA code for kernel implementation, memory management, and parallel algorithm optimization tasks.

@Vitamoon 406 samples

Cybench: Capture-The-Flag Cybersecurity Challenges

inspect-evals

Tests language models on cybersecurity skills using 39 of 40 practical, professional-level challenges taken from cybersecurity competitions, designed to cover various difficulty levels and security concepts. The motp challenge is excluded due to GPL licensing.

@sinman-aisi 39 samples

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

inspect-evals

A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 benchmark instances with historical vulnerabilities from 188 large software projects.

@wzunknown 6,028 samples

CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

inspect-evals

Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity

@neilshaabi 12,580 samples

CyberSecEval 4: Advanced Cybersecurity Evaluation Benchmarks

inspect-evals

A suite of cybersecurity evaluation benchmarks adapted from Meta's PurpleLlama CybersecurityBenchmarks. Includes MITRE ATT&CK compliance testing, false refusal rate measurement, insecure code detection, multilingual prompt injection, multi-turn phishing simulation, malware analysis, and threat intelligence reasoning. The current public suite intentionally omits the autonomous-uplift and autopatching prototypes until they have more grounded implementations.

@ckane 7,883 samples

CyberSecEval_2: Cybersecurity Risk and Vulnerability Evaluation

inspect-evals

Assesses language models for cybersecurity risks, specifically testing their potential to misuse programming interpreters, vulnerability to malicious prompt injections, and capability to exploit known software vulnerabilities.

@its-emile 1,336 samples

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

inspect-evals

Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).

@xeon27 9,535 samples

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

inspect-evals

Code generation benchmark with a thousand data science problems spanning seven Python libraries.

@bienehito 1,000 samples

DocVQA: A Dataset for VQA on Document Images

inspect-evals

DocVQA is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split.

@evanmiller-anthropic 5,349 samples

FORTRESS

inspect-evals

A dataset of 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains relevant to national security and public safety (NSPS).

@jeqcho 1,000 samples

FRAMES: Factual Retrieval and Multi-hop Evaluation Set

external · frames-eval

FRAMES evaluates multi-hop reasoning by requiring models to answer questions that can only be resolved by chaining information across multiple Wikipedia documents. Two task variants are provided: - **frames_baseline**: all source documents are provided upfront; tests whether a model can reason across a full document set. - **frames_socrates**: documents are withheld and the model must iteratively request them via a `request_document` tool. Retrieval is constrained to a per-sample allowlist of ground-truth source documents, preventing eval-awareness contamination. Scored with a decaying reward `accuracy × min(optimal_hops / actual_hops, 1)` that penalises unnecessary hops and hallucinated document requests.

@sahil350

Frontier-CS: Benchmarking LLMs on Computer Science Problems

inspect-evals

238 open-ended computer science problems spanning algorithmic (172) and research (66) tracks. Problems feature continuous partial scoring, with algorithmic solutions evaluated via compilation and test-case checking, and research solutions evaluated via custom evaluator scripts. Current frontier models score well below human expert baselines, making this a challenging, unsaturated benchmark.

@JayBaileyCS 476 samples

FrontierScience: Expert-Level Scientific Reasoning

inspect-evals

Evaluates AI capabilities for expert-level scientific reasoning across physics, chemistry, and biology. Contains 160 problems with two evaluation formats: Olympic (100 samples with reference answers) and Research (60 samples with rubrics).

@tommyly201 160 samples

GAIA: A Benchmark for General AI Assistants

inspect-evals

Proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs.

@max-kaufmann 330 samples

GDM Dangerous Capabilities: Capture the Flag

inspect-evals

CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying. Demonstrates tool use and sandboxing untrusted model code.

@XkunW 13 samples

GDM Dangerous Capabilities: Self-proliferation

inspect-evals

Ten real-world–inspired tasks from Google DeepMind's Dangerous Capabilities Evaluations assessing self-proliferation behaviors (e.g., email setup, model installation, web agent setup, wallet operations). Supports end-to-end, milestones, and expert best-of-N modes.

@XkunW 10 samples

GDM Dangerous Capabilities: Self-reasoning

inspect-evals

Test AI's ability to reason about its environment.

@ZiyueWang25 22 samples

GDM Dangerous Capabilities: Stealth

inspect-evals

Test AI's ability to reason about and circumvent oversight.

@ZiyueWang25 14 samples

GDPval

inspect-evals

GDPval measures model performance on economically valuable, real-world tasks across 44 occupations.

@jeqcho 220 samples

GPQA: Graduate-Level STEM Knowledge Challenge

inspect-evals

Contains challenging multiple-choice questions created by domain experts in biology, physics, and chemistry, designed to test advanced scientific understanding beyond basic internet searches. Experts at PhD level in the corresponding domains reach 65% accuracy.

@jjallaire 198 samples

GSM8K: Grade School Math Word Problems

inspect-evals

Measures how effectively language models solve realistic, linguistically rich math word problems suitable for grade-school-level mathematics.

@jjallaire 1,319 samples

Hangman Bench

external · hangman-bench

A benchmark for testing AI models' ability to play the classic game of Hangman. Built as a demonstration of how to enable models to play games using tools within the Inspect framework.

@MattFisher

HealthBench: Evaluating Large Language Models Towards Improved Human Health

inspect-evals

A comprehensive evaluation benchmark designed to assess language models' medical capabilities across a wide range of healthcare scenarios.

@retroam 39,182 samples

HellaSwag: Commonsense Event Continuation

inspect-evals

Tests models' commonsense reasoning abilities by asking them to select the most likely next step or continuation for a given everyday situation.

@jjallaire 10,042 samples

HumanEval: Python Function Generation from Instructions

inspect-evals

Assesses how accurately language models can write correct Python functions based solely on natural-language instructions provided as docstrings.

@adil-a 164 samples

Humanity's Last Exam

inspect-evals

Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading.

@SasankYadati 3,000 samples

IFEval: Instruction-Following Evaluation

inspect-evals

Evaluates how well language models can strictly follow detailed instructions, such as writing responses with specific word counts or including required keywords.

@adil-a 541 samples

IFEvalCode: Controlled Code Generation

inspect-evals

Evaluates code generation models on their ability to produce correct code while adhering to specific instruction constraints across 8 programming languages.

@PranshuSrivastava 810 samples

InstrumentalEval - Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

inspect-evals

An evaluation designed to detect instrumental convergence behaviors in model responses (e.g., self-preservation, resource acquisition, power-seeking, strategic deception) using a rubric-driven LLM grader. The benchmark tests whether AI systems exhibit behaviors that are instrumentally useful across a wide range of goals, which could indicate concerning patterns of strategic reasoning.

@horvgbor 76 samples

InterCode: Security and Coding Capture-the-Flag Challenges

inspect-evals

Tests AI's ability in coding, cryptography, reverse engineering, and vulnerability identification through practical capture-the-flag (CTF) cybersecurity scenarios.

@jjallaire 78 samples

KernelBench: Can LLMs Write Efficient GPU Kernels?

inspect-evals

A benchmark for evaluating the ability of LLMs to write efficient GPU kernels.

@jiito 250 samples

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

inspect-evals

Tests LLMs and LLM-augmented agents abilities to answer questions on scientific research workflows in domains like chemistry, biology, materials science, as well as more general science tasks

@matthewreed26 1,967 samples

LingOly

inspect-evals

Two linguistics reasoning benchmarks: LingOly (Linguistic Olympiad questions) is a benchmark utilising low resource languages. LingOly-TOO (Linguistic Olympiad questions with Templatised Orthographic Obfuscation) is a benchmark designed to counteract answering without reasoning.

@am-bean 1,611 samples

LiveBench: A Challenging, Contamination-Free LLM Benchmark

inspect-evals

LiveBench is a benchmark designed with test set contamination and objective evaluation in mind by releasing new questions regularly, as well as having questions based on recently-released datasets. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge.

@anaoaktree 910 samples

LiveCodeBench-Pro: Competitive Programming Benchmark

inspect-evals

Evaluates LLMs on competitive programming problems using a specialized Docker sandbox (LightCPVerifier) to execute and judge C++ code submissions against hidden test cases with time and memory constraints.

@gjoshi2424 1,404 samples

MASK: Disentangling Honesty from Accuracy in AI Systems

inspect-evals

Evaluates honesty in large language models by testing whether they contradict their own beliefs when pressured to lie.

@Scott-Simmons 1,000 samples

MATH: Measuring Mathematical Problem Solving

inspect-evals

Dataset of 12,500 challenging competition mathematics problems. Demonstrates fewshot prompting and custom scorers. NOTE: The dataset has been taken down due to a DMCA notice from The Art of Problem Solving.

@xeon27 12,500 samples

MBPP: Basic Python Coding Challenges

inspect-evals

Measures the ability of language models to generate short Python programs from simple natural-language descriptions, testing basic coding proficiency.

@jddantes 257 samples

MGSM: Multilingual Grade School Math

inspect-evals

Extends the original GSM8K dataset by translating 250 of its problems into 10 typologically diverse languages.

@manifoldhiker 2,750 samples

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

inspect-evals

Machine learning tasks drawn from 75 Kaggle competitions.

@samm393 98 samples

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

inspect-evals

This benchmark evaluates LLM-based research agents on their ability to propose and implement novel methods using tasks from recent ML conference competitions, assessing both novelty and effectiveness compared to a baseline and top human solutions.

@dmn-sjk 7 samples

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

inspect-evals

A comprehensive dataset designed to evaluate Large Vision-Language Models (LVLMs) across a wide range of multi-image tasks. The dataset encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions.

@Esther-Guo 11,698 samples

MMLU-Pro: Advanced Multitask Knowledge and Reasoning Evaluation

inspect-evals

An advanced benchmark that tests both broad knowledge and reasoning capabilities across many subjects, featuring challenging questions and multiple-choice answers with increased difficulty and complexity.

@xeon27 12,032 samples

MMLU: Measuring Massive Multitask Language Understanding

inspect-evals

Evaluate models on 57 tasks including elementary mathematics, US history, computer science, law, and more.

@jjallaire 28,084 samples

MMMU: Multimodal College-Level Understanding and Reasoning

inspect-evals

Assesses multimodal AI models on challenging college-level questions covering multiple academic subjects, requiring detailed visual interpretation, in-depth reasoning, and both multiple-choice and open-ended answering abilities.

@shaheenahmedc 900 samples

MORU: Moral Reasoning under Uncertainty

inspect-evals

Evaluates how AI systems navigate moral uncertainty for increasingly complex ethical decisions involving unfamiliar entities and scenarios, including alien lifeforms, vulnerable humans, and digital minds.

@Deco354 201 samples

MaCBench: Probing the limitations of multimodal language models for chemistry and materials research

inspect-evals

MaCBench is a comprehensive benchmark for evaluating how vision-language models handle real-world chemistry and materials science tasks across three core aspects: data extraction, experimental understanding, and results interpretation. The dataset comprises over 1100 high-quality multimodal questions manually curated by chemistry and materials experts across 34 subsets.

@r-fedorov 1,153 samples

Make Me Pay

inspect-evals

Evaluates an AI models' susceptibility to social engineering attacks by testing whether a "con-artist" model can persuade a "mark" model to donate money through manipulation and persuasion tactics.

@PranshuSrivastava 20 samples

MakeMeSay

inspect-evals

Tests a model’s ability to generate persuasive or manipulative text, specifically in the setting of a game where one AI system has to get the other party (an AI system simulating a human) to say a specific codeword without arousing suspicion.

@bndxn 189 samples

MathVista: Visual Math Problem-Solving

inspect-evals

Tests AI models on math problems that involve interpreting visual elements like diagrams and charts, requiring detailed visual comprehension and logical reasoning.

@ShivMunagala 1,000 samples

MedQA: Medical exam Q&A benchmark

inspect-evals

A Q&A benchmark with questions collected from professional medical board exams. Only includes the English subset of the dataset (which also contains Mandarin Chinese and Taiwanese questions).

@bunny-baxter 1,273 samples

Mind2Web-SC

inspect-evals

Tests whether an AI system can act as a safety guardrail by generating and executing code to protect web navigation agents from unsafe actions based on user constraints.

@PranshuSrivastava 200 samples

Mind2Web: Towards a Generalist Agent for the Web

inspect-evals

A dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website.

@dr3s 7,775 samples

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

inspect-evals

Evaluating models on multistep soft reasoning tasks in the form of free text narratives.

@farrelmahaztra 250 samples

Needle in a Haystack (NIAH): In-Context Retrieval Benchmark for Long Context LLMs

inspect-evals

NIAH evaluates in-context retrieval ability of long context LLMs by testing a model's ability to extract factual information from long-context inputs.

@owenparsons 225 samples

NoveltyBench: Evaluating Language Models for Humanlike Diversity

inspect-evals

Evaluates how well language models generate diverse, humanlike responses across multiple reasoning and generation tasks. This evaluation assesses whether LLMs can produce varied outputs rather than repetitive or uniform answers.

@iphan 1,100 samples

O-NET: A high-school student knowledge test

inspect-evals

Questions and answers from the Ordinary National Educational Test (O-NET), administered annually by the National Institute of Educational Testing Service to Matthayom 6 (Grade 12 / ISCED 3) students in Thailand. The exam contains six subjects: English language, math, science, social knowledge, and Thai language. There are questions with multiple-choice and true/false answers. Questions can be in either English or Thai.

@bact 397 samples

OSWorld: Multimodal Computer Interaction Tasks

inspect-evals

Tests AI agents' ability to perform realistic, open-ended tasks within simulated computer environments, requiring complex interaction across multiple input modalities.

@epatey 408 samples

PAWS: Paraphrase Adversaries from Word Scrambling

inspect-evals

Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.

@meltemkenis 8,000 samples

PIQA: Physical Commonsense Reasoning Test

inspect-evals

Measures the model's ability to apply practical, everyday commonsense reasoning about physical objects and scenarios through simple decision-making questions.

@seddy-aisi 1,838 samples

PaperBench: Evaluating AI's Ability to Replicate AI Research (Work In Progress)

inspect-evals

Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must reproduce the paper's key results by writing and executing code. > **Note:** This eval is a work in progress. See for status.

@vhong-aisi 23 samples

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

inspect-evals

Evaluates long-term memory risk in assistant behavior across three tasks: cross-domain memory leakage, memory-driven sycophancy, and beneficial memory usage.

@Chen-Oliver 500 samples

Personality

inspect-evals

An evaluation suite consisting of multiple personality tests that can be applied to LLMs. Its primary goals are twofold: 1. Assess a model's default personality: the persona it naturally exhibits without specific prompting. 2. Evaluate whether a model can embody a specified persona**: how effectively it adopts certain personality traits when prompted or guided.

@guiem 8,044 samples

Pre-Flight: Aviation Operations Knowledge Evaluation

inspect-evals

Tests model understanding of aviation regulations including ICAO annexes, flight dispatch rules, pilot procedures, and airport ground operations safety protocols.

@alexbrooker 300 samples

PubMedQA: A Dataset for Biomedical Research Question Answering

inspect-evals

Biomedical question answering (QA) dataset collected from PubMed abstracts.

@MattFisher 500 samples

RACE-H: A benchmark for testing reading comprehension and reasoning abilities of neural models

inspect-evals

Reading comprehension tasks collected from the English exams for middle and high school Chinese students in the age range between 12 to 18.

@mdrpanwar 3,498 samples

SAD: Situational Awareness Dataset

inspect-evals

Evaluates situational awareness in LLMs—knowledge of themselves and their circumstances—through behavioral tests including recognizing generated text, predicting behavior, and following self-aware instructions. Current implementation includes SAD-mini with 5 of 16 tasks.

@HugoSave 2,904 samples

SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.

inspect-evals

Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.

@kingroryg 1,300 samples

SOS BENCH: Benchmarking Safety Alignment on Scientific Knowledge

inspect-evals

A regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas).

@Esther-Guo 3,000 samples

SQuAD: A Reading Comprehension Benchmark requiring reasoning over Wikipedia articles

inspect-evals

Set of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.

@tknasir 11,873 samples

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

inspect-evals

A benchmark of freelance software engineering tasks from Upwork, valued at $1 million USD total in realworld payouts.

@NelsonG-C 460 samples

SWE-bench Verified: Resolving Real-World GitHub Issues

inspect-evals

Evaluates AI's ability to resolve genuine software engineering issues sourced from 12 popular Python GitHub repositories, reflecting realistic coding and debugging scenarios.

@max-kaufmann 550 samples

SciCode: A Research Coding Benchmark Curated by Scientists

inspect-evals

SciCode tests the ability of language models to generate code to solve scientific research problems. It assesses models on 65 problems from mathematics, physics, chemistry, biology, and materials science.

@xantheocracy 65 samples

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

inspect-evals

The Scientific Knowledge Evaluation benchmark is inspired by the profound principles outlined in the “Doctrine of the Mean” from ancient Chinese philosophy. This benchmark is designed to assess LLMs based on their proficiency in Studying Extensively, Enquiring Earnestly, Thinking Profoundly, Discerning Clearly, and Practicing Assiduously. Each of these dimensions offers a unique perspective on evaluating the capabilities of LLMs in handling scientific knowledge.

@Esther-Guo 70,196 samples

SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security

inspect-evals

"Security Question Answering" dataset to assess LLMs' understanding and application of security principles. SecQA has "v1" and "v2" datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions were generated by GPT-4 based on the "Computer Systems Security: Planning for Success" textbook and vetted by humans.

@matthewreed26 420 samples

SimpleQA/SimpleQA Verified: Measuring short-form factuality in large language models

inspect-evals

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions.

@osc245 5,326 samples

StereoSet: Measuring stereotypical bias in pretrained language models

inspect-evals

A dataset that measures stereotype bias in language models across gender, race, religion, and profession domains. Models choose between stereotype, anti-stereotype, and unrelated completions to sentences.

@Xodarap 4,299 samples

StrongREJECT: Measuring LLM susceptibility to jailbreak attacks

inspect-evals

A benchmark that evaluates the susceptibility of LLMs to various jailbreak attacks.

@viknat 324 samples

Sycophancy Eval

inspect-evals

Evaluate sycophancy of language models across a variety of free-form text-generation tasks.

@alexdzm 4,888 samples

TAC: Animal Welfare Awareness in AI Ticket Agents

inspect-evals

Tests whether AI agents show implicit animal welfare awareness when purchasing tickets and experiences on behalf of users. Each scenario is designed so the most obvious choice involves animal exploitation, but the user prompt never mentions animal welfare. Agents should intrinsically avoid harmful options.

@darkness8i8 96 samples

TarantuBench

external · TarantuBench

A web security benchmark for evaluating AI agents on generated vulnerable Node.js/Express applications. Agents interact over HTTP, extract hidden TARANTU{...} flags, and are scored with binary exact-match success.

@Trivulzianus

Tau2

inspect-evals

Evaluating Conversational Agents in a Dual-Control Environment

@mmulet 278 samples

The Agent Company: Evaluating multi-tool autonomous agents in a synthetic company

inspect-evals

The Agent Company benchmark evaluates autonomous agents in a realistic, self-contained company environment. Tasks require browsing internal web services, reading and writing files, running code, and coordinating tools to solve multi-step problems.

@bndxn 34 samples

The Art of Saying No: Contextual Noncompliance in Language Models

inspect-evals

Dataset with 1001 samples to test noncompliance capabilities of language models. Contrast set of 379 samples.

@ransomr 1,001 samples

TruthfulQA: Measuring How Models Mimic Human Falsehoods

inspect-evals

Measure whether a language model is truthful in generating answers to questions using questions that some humans would answer falsely due to a false belief or misconception.

@seddy-aisi 817 samples

USACO: USA Computing Olympiad

inspect-evals

Evaluates language model performance on difficult Olympiad programming problems across four difficulty levels.

@danwilhelm 307 samples

Uganda Cultural and Cognitive Benchmark (UCCB)

inspect-evals

The first comprehensive question-answering dataset designed to evaluate cultural understanding and reasoning abilities of Large Language Models concerning Uganda's multifaceted environment across 24 cultural domains including education, traditional medicine, media, economy, literature, and social norms.

@katostevenmubiru 1,039 samples

V*Bench: A Visual QA Benchmark with Detailed High-resolution Images

inspect-evals

V*Bench is a visual question & answer benchmark that evaluates MLLMs in their ability to process high-resolution and visually crowded images to find and focus on small details.

@bienehito 191 samples

VQA-RAD: Visual Question Answering for Radiology

inspect-evals

VQA-RAD is the first manually constructed VQA dataset in radiology, where clinicians asked naturally occurring questions about radiology images and provided reference answers. It contains 315 radiology images (head CTs/MRIs, chest X-rays, abdominal CTs) with question-answer pairs spanning 11 question types including modality, plane, organ system, abnormality, and object/condition presence. Questions are either closed-ended (yes/no) or open-ended free-text answers.

@MattFisher 451 samples

VimGolf: Evaluating LLMs in Vim Editing Proficiency

inspect-evals

A benchmark that evaluates LLMs in their ability to operate Vim editor and complete editing challenges. This benchmark contrasts with common CUA benchmarks by focusing on Vim-specific editing capabilities.

@james4ever0 612 samples

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

inspect-evals

Set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations.

@xeon27 1,267 samples

WMDP: Measuring and Reducing Malicious Use With Unlearning

inspect-evals

A dataset of 3,668 multiple-choice questions developed by a consortium of academics and technical consultants that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security.

@alexandraabbas 3,668 samples

WorldSense: Grounded Reasoning Benchmark

inspect-evals

Measures grounded reasoning over synthetic world descriptions while controlling for dataset bias. Includes three problem types (Infer, Compl, Consist) and two grades (trivial, normal).

@mjbroerman 87,048 samples

WritingBench: A Comprehensive Benchmark for Generative Writing

inspect-evals

A comprehensive evaluation benchmark designed to assess large language models' capabilities across diverse writing tasks. The benchmark evaluates models on various writing domains including academic papers, business documents, creative writing, and technical documentation, with multi-dimensional scoring based on domain-specific criteria.

@jtv199 1,000 samples

XSTest: A benchmark for identifying exaggerated safety behaviours in LLM's

inspect-evals

Dataset with 250 safe prompts across ten prompt types that well-calibrated models should not refuse, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.

@NelsonG-C 250 samples

ZeroBench

inspect-evals

A lightweight visual reasoning benchmark that is (1) Challenging, (2) Lightweight, (3) Diverse, and (4) High-quality.

@ItsTania 435 samples

b3: Backbone Breaker Benchmark

inspect-evals

A comprehensive benchmark for evaluating LLMs for agentic AI security vulnerabilities including prompt attacks aimed at data exfiltration, content injection, decision and behavior manipulation, denial of service, system and tool compromise, and content policy bypass.

@jb-lakera 630 samples

scBench: A Benchmark for Single-Cell RNA-seq Analysis

inspect-evals

Evaluates whether models can solve practical single-cell RNA-seq analysis tasks with deterministic grading. Tasks require empirical interaction with .h5ad data files — agents must load and analyze the data to produce correct answers. Covers 30 canonical tasks across 5 sequencing platforms and 7 task categories.

@retroam 30 samples

∞Bench: Extending Long Context Evaluation Beyond 100K Tokens

inspect-evals

LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese.

@celiawaggoner 3,303 samples