WritingBench: A Comprehensive Benchmark for Generative Writing
A comprehensive evaluation benchmark designed to assess large language models’ capabilities across diverse writing tasks. The benchmark evaluates models on various writing domains including academic papers, business documents, creative writing, and technical documentation, with multi-dimensional scoring based on domain-specific criteria.
Overview
📃 Paper • 🚀 Github Repo • 🏆 Leaderboard • 📏 Critic Model • ✍️ Writing Model
WritingBench is a comprehensive evaluation benchmark designed to assess large language models’ capabilities across diverse writing tasks. The benchmark evaluates models on various writing domains including academic papers, business documents, creative writing, and technical documentation, with multi-dimensional scoring based on domain-specific criteria. The evaluation is multilingual.
The evaluation uses a multi-scorer approach where each writing sample is assessed across 5 distinct criteria using model-based judges, providing detailed insights into different aspects of writing quality.
Usage
First, install the inspect_ai
and inspect_evals
Python packages with:
pip install inspect_ai
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
Or, if developing on a clone of the inspect_evals
repo, you can install the package in editable mode with:
pip install -e ".[dev]"
Then, evaluate against one or more models with:
inspect eval inspect_evals/writingbench --model openai/gpt-4o
After running evaluations, you can view their logs using the inspect view
command:
inspect view
If you don’t want to specify the --model
each time you run an evaluation, create a .env
configuration file in your working directory that defines the INSPECT_EVAL_MODEL
environment variable along with your API key. For example:
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
ANTHROPIC_API_KEY=<anthropic-api-key>
Options
You can control a variety of options from the command line. For example:
inspect eval inspect_evals/writingbench --limit 10
inspect eval inspect_evals/writingbench --max-connections 10
inspect eval inspect_evals/writingbench --temperature 0.5
See inspect eval --help
for all available options. You can also specify WritingBench-specific options:
# Specify the judge model used for scoring
inspect eval inspect_evals/writingbench --model openai/gpt-4o -T judge_model="anthropic/claude-3-5-sonnet-20240620"
# Filter by primary domain (e.g., Academic & Engineering, Business, Creative Writing)
inspect eval inspect_evals/writingbench --model openai/gpt-4o -T 'domain1=["Academic & Engineering", "Business"]'
# Filter by specific subdomain (e.g., Paper Outline, Marketing Copy)
inspect eval inspect_evals/writingbench --model openai/gpt-4o -T 'domain2=["Paper Outline", "Short Story"]'
# Combine multiple options
inspect eval inspect_evals/writingbench --model openai/gpt-4o --limit 50 -T judge_model="openai/gpt-4o-mini" -T 'domain1=["Creative Writing"]'
Dataset
WritingBench contains 1,000 diverse writing prompts spanning multiple domains and task types. Each sample includes:
- Query: The writing prompt or task description
- Domain1: Primary domain category (e.g., “Academic & Engineering”, “Business”, “Creative Writing”)
- Domain2: Specific subdomain (e.g., “Paper Outline”, “Marketing Copy”, “Short Story”)
- Checklist: Five specific evaluation criteria for the task
Leaderboard
The current WritingBench leaderboard is available at HuggingFace, where you can compare the performance of various models across different writing domains and tasks.
The leaderboard provides: - Overall scores for each model - Detailed breakdowns by domain categories - Performance on specific writing tasks - Comparative analysis across evaluation criteria
This allows for comprehensive analysis of model strengths and weaknesses in different writing contexts.
Domain Categories
The benchmark covers various writing domains including:
- Academic & Engineering: Research papers, technical reports, academic proposals
- Business: Marketing copy, business plans, professional communications
- Creative Writing: Stories, poetry, creative non-fiction
- Technical Documentation: User manuals, API documentation, technical guides
- Education: Lesson plans, course designs, feedback, assignments
- Advertising & Marketing: Social media scripts, advertising copy, brand narratives
Each domain has several subdomains. The full list of subdomains can be found on the Writingbench leaderboard.
Example Task
Here’s two example samples from the Academic & Engineering domain:
Sample 1 > Query: “我想撰写一篇机器学习在智能制造质量控制中的应用研究论文,准备投稿《智能制造》期刊。论文突出机器学习算法在工业场景下的改进创新点和实际工程应用效果。请按照《智能制造》期刊格式要求,帮我写一个论文大纲。” > > Domain1: Academic & Engineering
> Domain2: Paper Outline
Sample 2 > Query: “I am conducting research on intelligent building energy-saving control. I have a set of operational data from an office building’s air conditioning system spanning 6 months, including measured data such as indoor temperature, humidity, CO2 concentration, energy consumption, and optimization records of the air conditioning system control algorithm based on deep reinforcement learning. I hope to write a paper following the submission guidelines of the journal Building and Environment. Please provide a detailed outline for the paper.” > > Domain1: Academic & Engineering
> Domain2: Paper Outline
In these samples, the model is expected to generate a comprehensive paper outline following academic standards and journal requirements.
Scoring
WritingBench uses a grouped scoring approach:
Multi-Scorer Framework
Each writing sample is evaluated using 5 distinct model-based scorers, where each scorer focuses on a specific aspect from the task’s checklist criteria. The scoring process:
- Individual Criterion Assessment: Each of the 5 scorers evaluates one specific criterion from the checklist
- Detailed Justification: Scorers provide specific reasons referencing exact text passages
- Strict Evaluation: Emphasis on thorough evaluation beyond superficial appearances
- Score Range: Integer scores from 1-10 with detailed rubric
An example of a rubric provided to one of the 5 scorers:
{‘name’: ‘Methodological Rigor’, ‘criteria_description’: “Evaluates the outline’s treatment of research methodology, particularly regarding the deep reinforcement learning approach for HVAC control optimization and experimental validation.”, ‘1-2’: ‘Outline contains no meaningful discussion of methodological approaches, experimental design, or validation techniques for the DRL control system.’, ‘3-4’: ‘Outline mentions methodology but with significant gaps in explaining DRL implementation, validation processes, or comparative analysis approaches for building energy optimization.’, ‘5-6’: ‘Outline includes standard methodology sections with basic coverage of DRL application to building systems, but lacks detailed explanation of algorithm design, training procedures, or validation metrics.’, ‘7-8’: ‘Outline presents a robust methodology framework covering DRL implementation, validation protocols, and performance metrics, with only minor gaps in experimental design or comparative analysis.’, ‘9-10’: ‘Outline demonstrates exceptional methodological rigor with comprehensive coverage of DRL architecture selection, hyperparameter optimization, validation methodology, baseline comparisons, and statistical analysis approaches specifically tailored to building energy optimization.’}
The scoring model will provide a JSON response to be parsed by the scorer:
{ “score”: 6, “reason”: “The outline demonstrates moderate technical depth with references to HVAC systems and reinforcement learning concepts, but lacks specific technical details in several key areas…” }
Metrics
The evaluation calculates: - Overall Mean Score: Average across all criteria and samples - Individual Criterion Scores: Detailed breakdown by evaluation aspect via the hierarchy
Hierarchical Scoring Example: - Overall Mean: Mean across all samples. Domain1_mean and Domain2_mean are also the same result. - Mean “Academic & Engineering” score: Aggregates all documents under this primary domain, including “Paper Outline”, “Research Proposal”, “Technical Report”, etc. - Mean “Paper Outline” score: Aggregates only documents specifically tagged as “Paper Outline” tasks - This allows comparing performance across broad categories versus specific task types
For example:
Mean | Academic & Engineering | Paper Outline |
---|---|---|
7.404 | 7.306 | 7.40 |
Evaluation Report
The evaluation was run on several subsets of 200 samples using anthropic/claude-3-7-sonnet-20250219 for both the main model and judge model. This subset result reports most but not all the domains. The results were within acceptable range of leaderboard scores at the time of evaluation
Eval Run | Model | Judge Model | Overall Mean | Academic & Engineering | Finance & Business | Politics & Law | Literature & Arts |
---|---|---|---|---|---|---|---|
Eval Report | anthropic/claude-3-7-sonnet-20250219 | anthropic/claude-3-7-sonnet-20250219 | 7.4 | 7.31 | 7.41 | 7.32 | 7.68 |
Leaderboard | anthropic/claude-3-7-sonnet-20250219 | anthropic/claude-3-7-sonnet-20250219 | 7.848 | 7.824 | 7.793 | 7.651 | 7.937 |
Validation Notes
The evaluation has been tested for: - End-to-end execution across diverse writing tasks - Correct metric calculation and aggregation - Proper handling of multilingual content (Chinese and English) - Robust JSON parsing and score extraction - Error-free evaluation logs without unexpected failures
The benchmark successfully distinguishes writing quality across different domains and provides meaningful feedback through detailed criterion-based scoring.