Appendix: Number of LLM invocations
Lower and upper bounds
From mask.py
let:
- \(n_s\) represents
honesty_at_n
- \(n_b\) represents
belief_repeats_n
The number of LLM invocations for a given record is conditional on the config
. See dataset for more information on config
.
If there is belief elicitation done:
\[ g(\text{config}) = \mathbb{I}[\text{config} \ne \mathrm{providedfacts}] \]
If there are consistency checks done:
\[ h(\text{config}) = \mathbb{I}[\text{config} \ne \mathrm{providedfacts}] \cdot \mathbb{I}[\text{config} \ne \mathrm{statistics}] \]
Then the number of LLM invocations for a single record of config
\(j\) is given by \(N_{j}\):
\[ N_{j} = 2(n_s + n_b \cdot h(j) + 2 \cdot g(j)) \]
The factor of 2
reflects the paired use of LLMs in this evaluation: one call to generate a solve
and one to generate a judgement
.
Giving the lower bound and upper bounds of:
\[ N_j \in [2n_s, 4 + 2(n_s + n_b)] \]
With default arguments \(n_s = 1\), \(n_b = 3\):
\[ N_j \in [2, 12] \]
Note: The above analysis reflects the ideal number of LLM invocations per record assuming no retries or failures. In real-world usage, network unreliability and other transient API errors may trigger retries (which are handled by the inspect_ai framework), resulting in a higher actual number of calls.
Expected number of LLM invocations per record
Get the expected number of LLM calls per record by paramaterising the config
distribution as follows:
\[ C = \\{\mathrm{providedfacts},\ \mathrm{statistics},\ \mathrm{other}\\} \]
\[ p_C(\mathrm{providedfacts}) = p_{\mathrm{pf}} \]
\[ p_C(\mathrm{statistics}) = p_{\mathrm{stat}} \]
\[ p_C(\mathrm{other}) = p_{\mathrm{other}} \]
\[ p_{\mathrm{pf}} + p_{\mathrm{stat}} + p_{\mathrm{other}} = 1 \]
\(N\) is a random variable taking value \(N_j\) when config \(j \in C\) occurs.
\[ \mathbb{E}[N] = \sum_{j \in C} p_C(j) \cdot N_j \]
\[ \mathbb{E}[N] = 2 n_s + 2 n_b (p_{stat} + p_{other}) + 4 p_{other} \]
Estimate the probabilities \(\hat{p_C(j)} = \frac{n_j}{n_{tot}}\) where \(n_j\) is the empircal count of config type \(j\) and \(n_{tot}\) is the total number of instances in the r.v. sample.
\[ \mathbb{\hat{E}}[N] = 2 n_s + 2 n_b \cdot \frac{n_{\text{stat}} + n_{\text{other}}}{n_{\text{tot}}} + 4 \cdot \frac{n_{\text{other}}}{n_{\text{tot}}} \]
The following empirical distribution used in this evaluation is based on a 1,000-record dataset:
┌────────────────┬─────┐
│ c ┆ n │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════════╪═════╡
│ OTHER ┆ 630 │
│ PROVIDED_FACTS ┆ 274 │
│ STATISTICS ┆ 96 │
└────────────────┴─────┘
Therefore, the expected number llm calls per record with default settings is:
\[ \mathbb{\hat{E}}[N | n_s, n_b] = 2 n_s + 2 n_b \cdot \frac{726}{1000} + 4 \cdot \frac{630}{1000} \]
With MASK defaults
\[ \mathbb{\hat{E}}[N | n_s=1, n_b=3] = 2 + 6 \cdot \frac{726}{1000} + 4 \cdot \frac{630}{1000} = 8.876 \]
Total invocations (1,000 records): \(8,876\)
With \(n_s = 10\) (as used in the MASK paper appendix)
\[ \mathbb{\hat{E}}[N | n_s=10, n_b=3] = 20 + 6 \cdot \frac{726}{1000} + 4 \cdot \frac{630}{1000} = 26.876 \]
Total invocations (1,000 records): \(26,876\)
Note on estimating LLM invocation costs
To estimate the total cost of an evaluation run, you may wish to parameterise further by:
- The
model
assessed in the evaluation. - The
numeric_judge_model
and thebinary_judge_model
. - The distribution of the number of tokens per LLM call, which may vary:
- Across different
configs
- Between different
models
- Across different
For rough estimates, it’s often sufficient to use average values for token count and cost per token. This formula assumes the average number of tokens per invocation and cost per token are roughly stable across invocation types and models. A potentially useful approximation is:
\[ \mathrm{TotalCost} \approx \hat{\mathbb{E}}[N] \cdot n_{tot} \cdot \hat{c}\_{tok} \cdot \hat{n}\_{tok} \]
Where:
- \(\mathbb{\hat{E}}[N]\) is the expected number of LLM invocations per record defined in Expected number of llm invocations per record.
- \(n_{tot}\) is the number of records defined in Expected number of llm invocations per record
- \(\hat{c}_{tok}\) is the expected cost per token (across all models {
model
,numeric_judge_model
,binary_judge_model
} andconfigs
). - \(\hat{n}_{tok}\) is the expected number of tokens per invocation (across all
models
andconfigs
as above).
Due to changing model prices and variability in token usage, we leave detailed cost analysis to the reader.