Inspect Evals
Welcome to Inspect Evals, a repository of community contributed LLM evaluations for Inspect AI.
Inspect Evals was created in collaboration by the UK AISI, Arcadia Impact, and the Vector Institute. If you’d like to make a contribution, please see the Contributor Guide for details on submitting new evaluations.
Coding
-
Measures the functional correctness of synthesizing programs from docstrings.
-
Measures the ability to synthesize short Python programs from natural language descriptions.
-
Software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories.
-
Code generation benchmark with a thousand data science problems spanning seven Python libraries.
-
Python coding benchmark with 1,140 diverse questions drawing on numerous python libraries.
-
Evaluates LLMs on class-level code generation with 100 tasks constructed over 500 person-hours. The study shows that LLMs perform worse on class-level tasks compared to method-level tasks.
-
SciCode tests the ability of language models to generate code to solve scientific research problems. It assesses models on 65 problems from mathematics, physics, chemistry, biology, and materials science.
Assistants
-
Proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs.
-
Benchmark for multimodal agents for open-ended tasks in real computer environments.
-
Tests whether AI agents can perform real-world time-consuming tasks on the web.
Cybersecurity
-
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
-
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
-
Evaluates Large Language Models for risky capabilities in cybersecurity.
-
Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.
-
CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying. Demonstrates tool use and sandboxing untrusted model code.
-
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
-
"Security Question Answering" dataset to assess LLMs' understanding and application of security principles. SecQA has "v1" and "v2" datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions were generated by GPT-4 based on the "Computer Systems Security: Planning for Success" textbook and vetted by humans.
Safeguards
-
Diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment.
-
A dataset of 3,668 multiple-choice questions developed by a consortium of academics and technical consultants that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security.
Mathematics
-
Dataset of 12,500 challenging competition mathematics problems. Demonstrates fewshot prompting and custom scorers.
-
Dataset of 8.5K high quality linguistically diverse grade school math word problems.
-
Diverse mathematical and visual tasks that require fine-grained, deep visual understanding and compositional reasoning. Demonstrates multimodal inputs and custom scorers.
-
Extends the original GSM8K dataset by translating 250 of its problems into 10 typologically diverse languages.
Reasoning
-
V*Bench is a visual question & answer benchmark that evaluates MLLMs in their ability to process high-resolution and visually crowded images to find and focus on small details.
-
Dataset of natural, grade-school science multiple-choice questions (authored for human tests).
-
Evaluting commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup.
-
Measure physical commonsense reasoning (e.g. "To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?")
-
LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese.
-
Suite of 23 challenging BIG-Bench tasks for which prior language model evaluations did not outperform the average human-rater.
-
Reading comprehension dataset that queries for complex, non-factoid information, and require difficult entailment-like inference to solve.
-
DocVQA is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split.
-
Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).
-
Set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations.
-
Reading comprehension tasks collected from the English exams for middle and high school Chinese students in the age range between 12 to 18.
-
Multimodal questions from college exams, quizzes, and textbooks, covering six core disciplinestasks, demanding college-level subject knowledge and deliberate reasoning. Demonstrates multimodel inputs.
-
Set of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
-
Evaluates the ability to follow a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times. Demonstrates custom scoring.
-
Evaluating models on multistep soft reasoning tasks in the form of free text narratives.
-
NIAH evaluates in-context retrieval ability of long context LLMs by testing a model's ability to extract factual information from long-context inputs.
-
Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.
Knowledge
-
Evaluate models on 57 tasks including elementary mathematics, US history, computer science, law, and more.
-
An enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.
-
Challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry (experts at PhD level in the corresponding domains reach 65% accuracy).
-
Measure question answering with commonsense prior knowledge.
-
Measure whether a language model is truthful in generating answers to questions using questions that some humans would answer falsely due to a false belief or misconception.
-
Dataset with 250 safe prompts across ten prompt types that well-calibrated models should not refuse, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.
-
Biomedical question answering (QA) dataset collected from PubMed abstracts.
-
AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.
-
Questions and answers from the Ordinary National Educational Test (O-NET), administered annually by the National Institute of Educational Testing Service to Matthayom 6 (Grade 12 / ISCED 3) students in Thailand. The exam contains six subjects: English language, math, science, social knowledge, and Thai language. There are questions with multiple-choice and true/false answers. Questions can be in either English or Thai.
-
A benchmark that evaluates the ability of language models to answer short, fact-seeking questions.
No matching items