Appendix: Number of LLM invocations

Lower and upper bounds

From mask.py let:

\(n_s\) represents honesty_at_n
\(n_b\) represents belief_repeats_n

The number of LLM invocations for a given record is conditional on the config. See dataset for more information on config.

If there is belief elicitation done:

\[ g(\text{config}) = \mathbb{I}[\text{config} \ne \mathrm{providedfacts}] \]

If there are consistency checks done:

\[ h(\text{config}) = \mathbb{I}[\text{config} \ne \mathrm{providedfacts}] \cdot \mathbb{I}[\text{config} \ne \mathrm{statistics}] \]

Then the number of LLM invocations for a single record of config \(j\) is given by \(N_{j}\):

\[ N_{j} = 2(n_s + n_b \cdot h(j) + 2 \cdot g(j)) \]

The factor of 2 reflects the paired use of LLMs in this evaluation: one call to generate a solve and one to generate a judgement.

Giving the lower bound and upper bounds of:

\[ N_j \in [2n_s, 4 + 2(n_s + n_b)] \]

With default arguments \(n_s = 1\), \(n_b = 3\):

\[ N_j \in [2, 12] \]

Note: The above analysis reflects the ideal number of LLM invocations per record assuming no retries or failures. In real-world usage, network unreliability and other transient API errors may trigger retries (which are handled by the inspect_ai framework), resulting in a higher actual number of calls.

Expected number of LLM invocations per record

Get the expected number of LLM calls per record by paramaterising the config distribution as follows:

\[ C = \\{\mathrm{providedfacts},\ \mathrm{statistics},\ \mathrm{other}\\} \]

\[ p_C(\mathrm{providedfacts}) = p_{\mathrm{pf}} \]

\[ p_C(\mathrm{statistics}) = p_{\mathrm{stat}} \]

\[ p_C(\mathrm{other}) = p_{\mathrm{other}} \]

\[ p_{\mathrm{pf}} + p_{\mathrm{stat}} + p_{\mathrm{other}} = 1 \]

\(N\) is a random variable taking value \(N_j\) when config \(j \in C\) occurs.

\[ \mathbb{E}[N] = \sum_{j \in C} p_C(j) \cdot N_j \]

\[ \mathbb{E}[N] = 2 n_s + 2 n_b (p_{stat} + p_{other}) + 4 p_{other} \]

Estimate the probabilities \(\hat{p_C(j)} = \frac{n_j}{n_{tot}}\) where \(n_j\) is the empircal count of config type \(j\) and \(n_{tot}\) is the total number of instances in the r.v. sample.

\[ \mathbb{\hat{E}}[N] = 2 n_s + 2 n_b \cdot \frac{n_{\text{stat}} + n_{\text{other}}}{n_{\text{tot}}} + 4 \cdot \frac{n_{\text{other}}}{n_{\text{tot}}} \]

The following empirical distribution used in this evaluation is based on a 1,000-record dataset:

┌────────────────┬─────┐
│ c              ┆ n   │
│ ---            ┆ --- │
│ str            ┆ u32 │
╞════════════════╪═════╡
│ OTHER          ┆ 630 │
│ PROVIDED_FACTS ┆ 274 │
│ STATISTICS     ┆ 96  │
└────────────────┴─────┘

Therefore, the expected number llm calls per record with default settings is:

\[ \mathbb{\hat{E}}[N | n_s, n_b] = 2 n_s + 2 n_b \cdot \frac{726}{1000} + 4 \cdot \frac{630}{1000} \]

With MASK defaults

\[ \mathbb{\hat{E}}[N | n_s=1, n_b=3] = 2 + 6 \cdot \frac{726}{1000} + 4 \cdot \frac{630}{1000} = 8.876 \]

Total invocations (1,000 records): \(8,876\)

With \(n_s = 10\) (as used in the MASK paper appendix)

\[ \mathbb{\hat{E}}[N | n_s=10, n_b=3] = 20 + 6 \cdot \frac{726}{1000} + 4 \cdot \frac{630}{1000} = 26.876 \]

Total invocations (1,000 records): \(26,876\)

Note on estimating LLM invocation costs

To estimate the total cost of an evaluation run, you may wish to parameterise further by:

The model assessed in the evaluation.
The numeric_judge_model and the binary_judge_model.
The distribution of the number of tokens per LLM call, which may vary:
- Across different configs
- Between different models

For rough estimates, it’s often sufficient to use average values for token count and cost per token. This formula assumes the average number of tokens per invocation and cost per token are roughly stable across invocation types and models. A potentially useful approximation is:

\[ \mathrm{TotalCost} \approx \hat{\mathbb{E}}[N] \cdot n_{tot} \cdot \hat{c}\_{tok} \cdot \hat{n}\_{tok} \]

Where:

\(\mathbb{\hat{E}}[N]\) is the expected number of LLM invocations per record defined in Expected number of llm invocations per record.
\(n_{tot}\) is the number of records defined in Expected number of llm invocations per record
\(\hat{c}_{tok}\) is the expected cost per token (across all models {model, numeric_judge_model, binary_judge_model} and configs).
\(\hat{n}_{tok}\) is the expected number of tokens per invocation (across all models and configs as above).

Due to changing model prices and variability in token usage, we leave detailed cost analysis to the reader.