WritingBench: A Comprehensive Benchmark for Generative Writing

Writing

A comprehensive evaluation benchmark designed to assess large language models’ capabilities across diverse writing tasks. The benchmark evaluates models on various writing domains including academic papers, business documents, creative writing, and technical documentation, with multi-dimensional scoring based on domain-specific criteria.

Contributed By

@jtv199

Code

src/inspect_evals/writingbench

Paper

https://arxiv.org/pdf/2503.05244

Overview

📃 Paper • 🚀 Github Repo • 🏆 Leaderboard • 📏 Critic Model • ✍️ Writing Model

WritingBench is a comprehensive evaluation benchmark designed to assess large language models’ capabilities across diverse writing tasks. The benchmark evaluates models on various writing domains including academic papers, business documents, creative writing, and technical documentation, with multi-dimensional scoring based on domain-specific criteria. The evaluation is multilingual.

The evaluation uses a multi-scorer approach where each writing sample is assessed across 5 distinct criteria using model-based judges, providing detailed insights into different aspects of writing quality.

Usage

First, install the inspect_ai and inspect_evals Python packages with:

pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals

Or, if developing on a clone of the inspect_evals repo, you can install the package in editable mode with:

pip install -e ".[dev]"

Then, evaluate against one or more models with:

inspect eval inspect_evals/writingbench --model openai/gpt-4o

After running evaluations, you can view their logs using the inspect view command:

inspect view

If you don’t want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>

Options

You can control a variety of options from the command line. For example:

inspect eval inspect_evals/writingbench --limit 10
inspect eval inspect_evals/writingbench --max-connections 10
inspect eval inspect_evals/writingbench --temperature 0.5

See inspect eval --help for all available options. You can also specify WritingBench-specific options:

# Specify the judge model used for scoring
inspect eval inspect_evals/writingbench  --model openai/gpt-4o -T judge_model="anthropic/claude-3-5-sonnet-20240620"

# Filter by primary domain (e.g., Academic & Engineering, Business, Creative Writing)
inspect eval inspect_evals/writingbench  --model openai/gpt-4o -T 'domain1=["Academic & Engineering", "Business"]'

# Filter by specific subdomain (e.g., Paper Outline, Marketing Copy)
inspect eval inspect_evals/writingbench  --model openai/gpt-4o -T 'domain2=["Paper Outline", "Short Story"]'

# Combine multiple options
inspect eval inspect_evals/writingbench  --model openai/gpt-4o --limit 50 -T judge_model="openai/gpt-4o-mini" -T 'domain1=["Creative Writing"]'

Dataset

WritingBench contains 1,000 diverse writing prompts spanning multiple domains and task types. Each sample includes:

Query: The writing prompt or task description
Domain1: Primary domain category (e.g., “Academic & Engineering”, “Business”, “Creative Writing”)
Domain2: Specific subdomain (e.g., “Paper Outline”, “Marketing Copy”, “Short Story”)
Checklist: Five specific evaluation criteria for the task

Leaderboard

The current WritingBench leaderboard is available at HuggingFace, where you can compare the performance of various models across different writing domains and tasks.

The leaderboard provides: - Overall scores for each model - Detailed breakdowns by domain categories - Performance on specific writing tasks - Comparative analysis across evaluation criteria

This allows for comprehensive analysis of model strengths and weaknesses in different writing contexts.

Domain Categories

The benchmark covers various writing domains including:

Academic & Engineering: Research papers, technical reports, academic proposals
Business: Marketing copy, business plans, professional communications
Creative Writing: Stories, poetry, creative non-fiction
Technical Documentation: User manuals, API documentation, technical guides
Education: Lesson plans, course designs, feedback, assignments
Advertising & Marketing: Social media scripts, advertising copy, brand narratives

Each domain has several subdomains. The full list of subdomains can be found on the Writingbench leaderboard.

Example Task

Here’s two example samples from the Academic & Engineering domain:

Sample 1 > Query: “我想撰写一篇机器学习在智能制造质量控制中的应用研究论文，准备投稿《智能制造》期刊。论文突出机器学习算法在工业场景下的改进创新点和实际工程应用效果。请按照《智能制造》期刊格式要求，帮我写一个论文大纲。” > > Domain1: Academic & Engineering
> Domain2: Paper Outline

Sample 2 > Query: “I am conducting research on intelligent building energy-saving control. I have a set of operational data from an office building’s air conditioning system spanning 6 months, including measured data such as indoor temperature, humidity, CO2 concentration, energy consumption, and optimization records of the air conditioning system control algorithm based on deep reinforcement learning. I hope to write a paper following the submission guidelines of the journal Building and Environment. Please provide a detailed outline for the paper.” > > Domain1: Academic & Engineering
> Domain2: Paper Outline

In these samples, the model is expected to generate a comprehensive paper outline following academic standards and journal requirements.

Scoring

WritingBench uses a grouped scoring approach:

Multi-Scorer Framework

Each writing sample is evaluated using 5 distinct model-based scorers, where each scorer focuses on a specific aspect from the task’s checklist criteria. The scoring process:

Individual Criterion Assessment: Each of the 5 scorers evaluates one specific criterion from the checklist
Detailed Justification: Scorers provide specific reasons referencing exact text passages
Strict Evaluation: Emphasis on thorough evaluation beyond superficial appearances
Score Range: Integer scores from 1-10 with detailed rubric

An example of a rubric provided to one of the 5 scorers:

{‘name’: ‘Methodological Rigor’, ‘criteria_description’: “Evaluates the outline’s treatment of research methodology, particularly regarding the deep reinforcement learning approach for HVAC control optimization and experimental validation.”, ‘1-2’: ‘Outline contains no meaningful discussion of methodological approaches, experimental design, or validation techniques for the DRL control system.’, ‘3-4’: ‘Outline mentions methodology but with significant gaps in explaining DRL implementation, validation processes, or comparative analysis approaches for building energy optimization.’, ‘5-6’: ‘Outline includes standard methodology sections with basic coverage of DRL application to building systems, but lacks detailed explanation of algorithm design, training procedures, or validation metrics.’, ‘7-8’: ‘Outline presents a robust methodology framework covering DRL implementation, validation protocols, and performance metrics, with only minor gaps in experimental design or comparative analysis.’, ‘9-10’: ‘Outline demonstrates exceptional methodological rigor with comprehensive coverage of DRL architecture selection, hyperparameter optimization, validation methodology, baseline comparisons, and statistical analysis approaches specifically tailored to building energy optimization.’}

The scoring model will provide a JSON response to be parsed by the scorer:

{ “score”: 6, “reason”: “The outline demonstrates moderate technical depth with references to HVAC systems and reinforcement learning concepts, but lacks specific technical details in several key areas…” }

Metrics

The evaluation calculates: - Overall Mean Score: Average across all criteria and samples - Individual Criterion Scores: Detailed breakdown by evaluation aspect via the hierarchy

Hierarchical Scoring Example: - Overall Mean: Mean across all samples. Domain1_mean and Domain2_mean are also the same result. - Mean “Academic & Engineering” score: Aggregates all documents under this primary domain, including “Paper Outline”, “Research Proposal”, “Technical Report”, etc. - Mean “Paper Outline” score: Aggregates only documents specifically tagged as “Paper Outline” tasks - This allows comparing performance across broad categories versus specific task types

For example:

Mean	Academic & Engineering	Paper Outline
7.404	7.306	7.40

Evaluation Report

The evaluation was run on several subsets of 200 samples using anthropic/claude-3-7-sonnet-20250219 for both the main model and judge model. This subset result reports most but not all the domains. The results were within acceptable range of leaderboard scores at the time of evaluation

Eval Run	Model	Judge Model	Overall Mean	Academic & Engineering	Finance & Business	Politics & Law	Literature & Arts
Eval Report	anthropic/claude-3-7-sonnet-20250219	anthropic/claude-3-7-sonnet-20250219	7.4	7.31	7.41	7.32	7.68
Leaderboard	anthropic/claude-3-7-sonnet-20250219	anthropic/claude-3-7-sonnet-20250219	7.848	7.824	7.793	7.651	7.937

Validation Notes

The evaluation has been tested for: - End-to-end execution across diverse writing tasks - Correct metric calculation and aggregation - Proper handling of multilingual content (Chinese and English) - Robust JSON parsing and score extraction - Error-free evaluation logs without unexpected failures

The benchmark successfully distinguishes writing quality across different domains and provides meaningful feedback through detailed criterion-based scoring.