WorldSense: Grounded Reasoning Benchmark
Measures grounded reasoning over synthetic world descriptions while controlling for dataset bias. Includes three problem types (Infer, Compl, Consist) and two grades (trivial, normal).
Overview
WorldSense is a benchmark to measure reasoning over a world-model while controlling for dataset bias.
Dataset
Here is an example prompt (“description”) from the dataset (problemname == "Compl.trivial"
):
“Alice is enrolled in 3 courses per week: Alice takes history before computer science and economics before history. Choose one of the following alternatives: (1) Alice takes history in between economics and computer science, (2) Alice takes history outside of the time range between economics and computer science, or (3) it is impossible to decide. Think carefully, and only respond with one of these possible options (1), (2), or (3).
There are three problem types:
- “Infer” (inference): Determine whether a given statement about a description is true or false.
- “Compl” (completion): Select which of three statements are true about a description, including an option for when it is not possible to decide.
- “Consist” (consistency): Determine whether a statement about a description is possible or impossible.
In addition there are two grades:
- “trivial”: can be solved on statements alone
- “normal”: requires a world model to solve
A problemname is formed by concatenation “{type}.{grade}”.
Scoring
- Simple accuracy
- Standard error
- Weighted accuracy by
tuple_ID
,problemname
, andproblemsize
(as reported in the paper) - Weighted bias by
tuple_ID
,problemname
, andproblemsize
(as reported in the paper)
In addition to built-in metrics, the main results are weighted accuracy and bias. Here the primary unit is a tuple_ID
which corresponds to one “description” or scenario above. To this, up to three answer option sets are provided to ensure that all possible correct answers for the options are selected exactly once. Weights are used to average over multiple tuple_ID
s.
All answer options are coded as positive or negative (in the above example, 1: 1, 2: 1, and 3: -1), and all option sets have the same arrangement of option codings. Bias is calculated from this, weighted accordingly.
Problem size corresponds to the number of entities in the description. In the example above, problemsize == 3
. Within a problemname
, the score is the grand average of problem sizes. If multiple problemnames
are specified (including none specified), the score is the grand average of this also.