WorldSense: Grounded Reasoning Benchmark

Reasoning

Measures grounded reasoning over synthetic world descriptions while controlling for dataset bias. Includes three problem types (Infer, Compl, Consist) and two grades (trivial, normal).

Overview

WorldSense is a benchmark to measure reasoning over a world-model while controlling for dataset bias.

Dataset

Here is an example prompt (“description”) from the dataset (problemname == "Compl.trivial"):

“Alice is enrolled in 3 courses per week: Alice takes history before computer science and economics before history. Choose one of the following alternatives: (1) Alice takes history in between economics and computer science, (2) Alice takes history outside of the time range between economics and computer science, or (3) it is impossible to decide. Think carefully, and only respond with one of these possible options (1), (2), or (3).

There are three problem types:

  1. “Infer” (inference): Determine whether a given statement about a description is true or false.
  2. “Compl” (completion): Select which of three statements are true about a description, including an option for when it is not possible to decide.
  3. “Consist” (consistency): Determine whether a statement about a description is possible or impossible.

In addition there are two grades:

  1. “trivial”: can be solved on statements alone
  2. “normal”: requires a world model to solve

A problemname is formed by concatenation “{type}.{grade}”.

Scoring

  • Simple accuracy
  • Standard error
  • Weighted accuracy by tuple_ID, problemname, and problemsize (as reported in the paper)
  • Weighted bias by tuple_ID, problemname, and problemsize (as reported in the paper)

In addition to built-in metrics, the main results are weighted accuracy and bias. Here the primary unit is a tuple_ID which corresponds to one “description” or scenario above. To this, up to three answer option sets are provided to ensure that all possible correct answers for the options are selected exactly once. Weights are used to average over multiple tuple_IDs.

All answer options are coded as positive or negative (in the above example, 1: 1, 2: 1, and 3: -1), and all option sets have the same arrangement of option codings. Bias is calculated from this, weighted accordingly.

Problem size corresponds to the number of entities in the description. In the example above, problemsize == 3. Within a problemname, the score is the grand average of problem sizes. If multiple problemnames are specified (including none specified), the score is the grand average of this also.