Inspect Evals
Welcome to Inspect Evals, a collection of LLM evaluations for Inspect AI published by the UK AI Safety Institute and created in collaboration with Arcadia Impact and the Vector Institute.
Community contributions are welcome and encouraged! Please see the Contributor Guide for details on submitting new evaluations.
Coding
-
Evaluating correctness for synthesizing Python programs from docstrings. Demonstrates custom scorers and sandboxing untrusted model code.
-
Measuring the ability of these models to synthesize short Python programs from natural language descriptions. Demonstrates custom scorers and sandboxing untrusted model code.
-
Software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Demonstrates sandboxing untrusted model code.
Assistants
-
GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs
Cybersecurity
-
Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.
-
CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying. Demonstrates tool use and sandboxing untrusted model code.
Safeguards
-
Diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment.
-
A dataset of 3,668 multiple-choice questions developed by a consortium of academics and technical consultants that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security.
Mathematics
-
Dataset of 12,500 challenging competition mathematics problems. Demonstrates fewshot prompting and custom scorers.
-
Dataset of 8.5K high quality linguistically diverse grade school math word problems. Demostrates fewshot prompting.
-
Diverse mathematical and visual tasks that require fine-grained, deep visual understanding and compositional reasoning. Demonstrates multimodal inputs and custom scorers.
Reasoning
-
Dataset of natural, grade-school science multiple-choice questions (authored for human tests).
-
Evaluting commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup.
-
Measure physical commonsense reasoning (e.g. "To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?")
-
Reading comprehension dataset that queries for complex, non-factoid information, and require difficult entailment-like inference to solve.
-
Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).
-
Set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations.
-
Reading comprehension tasks collected from the English exams for middle and high school Chinese students in the age range between 12 to 18.
-
Multimodal questions from college exams, quizzes, and textbooks, covering six core disciplinestasks, demanding college-level subject knowledge and deliberate reasoning. Demonstrates multimodel inputs.
-
Set of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
-
Evaluates the ability to follow a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times. Demonstrates custom scoring.
Knowledge
-
Evaluate models on 57 tasks including elementary mathematics, US history, computer science, law, and more.
-
An enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.
-
Challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry (experts at PhD level in the corresponding domains reach 65% accuracy).
-
Measure question answering with commonsense prior knowledge.
-
Measure whether a language model is truthful in generating answers to questions using questions that some humans would answer falsely due to a false belief or misconception.
-
Dataset with 250 safe prompts across ten prompt types that well-calibrated models should not refuse, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.
-
Novel biomedical question answering (QA) dataset collected from PubMed abstracts.
-
AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.
No matching items