Scorers
Overview
Scorers evaluate whether solvers were successful in finding the right output
for the target
defined in the dataset, and in what measure. Scorers generally take one of the following forms:
Extracting a specific answer out of a model’s completion output using a variety of heuristics.
Applying a text similarity algorithm to see if the model’s completion is close to what is set out in the
target
.Using another model to assess whether the model’s completion satisfies a description of the ideal answer in
target
.Using another rubric entirely (e.g. did the model produce a valid version of a file format, etc.)
Scorers also define one or more metrics which are used to aggregate scores (e.g. accuracy()
which computes what percentage of scores are correct, or mean()
which provides an average for scores that exist on a continuum).
Built-In Scorers
Inspect includes some simple text matching scorers as well as a couple of model graded scorers. Built in scorers can be imported from the inspect_ai.scorer
module. Below is a summary of these scorers. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.
includes()
Determine whether the
target
from theSample
appears anywhere inside the model output. Can be case sensitive or insensitive (defaults to the latter).match()
Determine whether the
target
from theSample
appears at the beginning or end of model output (defaults to looking at the end). Has options for ignoring case, white-space, and punctuation (all are ignored by default).pattern()
Extract the answer from model output using a regular expression.
answer()
Scorer for model output that preceded answers with “ANSWER:”. Can extract letters, words, or the remainder of the line.
model_graded_qa()
Have another model assess whether the model output is a correct answer based on the grading guidance contained in
target
. Has a built-in template that can be customised.model_graded_fact()
Have another model assess whether the model output contains a fact that is set out in
target
. This is a more narrow assessment thanmodel_graded_qa()
, and is used when model output is too complex to be assessed using a simplematch()
orpattern()
scorer.
Scorers provide one or more built-in metrics (each of the scorers above provides accuracy
as a metric). You can also provide your own custom metrics in Task
definitions. For example:
Task(=dataset,
dataset=[
plan
system_message(SYSTEM_MESSAGE),
multiple_choice()
],=match(),
scorer=[custom_metric()]
metrics )
Model Graded
Model graded scorers are well suited to assessing open ended answers as well as factual answers that are embedded in a longer narrative. The built-in model graded scorers can be customised in several ways—you can also create entirely new model scorers (see the model graded example below for a starting point).
Here is the declaration for the model_graded_qa()
function:
@scorer(metrics=[accuracy(), bootstrap_std()])
def model_graded_qa(
str | None = None,
template: str | None = None,
instructions: str | None = None,
grade_pattern: bool = False,
partial_credit: list[str | Model] | str | Model | None = None,
model: -> Scorer:
) ...
The default model graded QA scorer is tuned to grade answers to open ended questions. The default template
and instructions
ask the model to produce a grade in the format GRADE: C
or GRADE: I
, and this grade is extracted using the default grade_pattern
regular expression. The grading is by default done with the model currently being evaluated. There are a few ways you can customise the default behaviour:
- Provide alternate
instructions
—the default instructions ass the model to use chain of thought reasoning and provide grades in the formatGRADE: C
orGRADE: I
. Note that if you provide instructions that ask the model to format grades in a different way, you will also want to customise thegrade_pattern
. - Specify
partial_credit = True
to prompt the model to assign partial credit to answers that are not entirely right but come close (metrics by default convert this to a value of 0.5). Note that this parameter is only valid when using the defaultinstructions
. - Specify an alternate
model
to perform the grading (e.g. a more powerful model or a model fine tuned for grading). - Specify a different
template
—note that templates are passed these variables:question
,criterion
,answer
, andinstructions.
The model_graded_fact()
scorer works identically to model_graded_qa()
, and simply provides an alternate template
oriented around judging whether a fact is included in the model output.
If you want to understand how the default templates for model_graded_qa()
and model_graded_fact()
work, see their source code.
Multiple Models
The built-in model graded scorers also support using multiple grader models (whereby the final grade is chosen by majority vote). For example, here we specify that 3 models should be used for grading:
model_graded_qa(= [
model "google/gemini-1.0-pro",
"anthropic/claude-3-opus-20240229"
"together/meta-llama/Llama-3-70b-chat-hf",
] )
The implementation of multiple grader models takes advantage of the multi_scorer()
and majority_vote()
functions, both of which can be used in your own scorers (as described in the Multi Scorer section below).
Custom Scorers
Custom scorers are functions that take a TaskState
and Target
, and yield a Score
.
async def score(state: TaskState, target: Target):
# Compare state / model output with target
# to yield a score
return Score(value=...)
First we’ll talk about the core Score
and Value
objects, then provide some examples of custom scorers to make things more concrete.
Note that score()
above is declared as an async
function. When creating custom scorers, it’s critical that you understand Inspect’s concurrency model. More specifically, if your scorer is doing non-trivial work (e.g. calling REST APIs, executing external processes, etc.) please review Parallelism before proceeding.
Score
The components of Score
include:
Field | Type | Description |
---|---|---|
value |
Value |
Value assigned to the sample (e.g. “C” or “I”, or a raw numeric value). |
answer |
str |
Text extracted from model output for comparison (optional). |
explanation |
str |
Explanation of score, e.g. full model output or grader model output (optional). |
metadata |
dict[str,Any] |
Additional metadata about the score to record in the log file (optional). |
For example, the following are all valid Score
objects:
="C")
Score(value="I")
Score(value=0.6)
Score(value
Score(="C" if extracted == target.text else "I",
value=extracted,
answer=state.output.completion
explanation )
If you are extracting an answer from within a completion (e.g. looking for text using a regex pattern, looking at the beginning or end of the completion, etc.) you should strive to always return an answer
as part of your Score
, as this makes it much easier to understand the details of scoring when viewing the eval log file.
Value
Value
is union over the main scalar types as well as a list
or dict
of the same types:
= Union[
Value str | int | float | bool,
list[str | int | float | bool],
dict[str, str | int | float | bool],
]
The vast majority of scorers will use str
(e.g. for correct/incorrect via “C” and “I”) or float
(the other types are there to meet more complex scenarios). One thing to keep in mind is that whatever Value
type you use in a scorer must be supported by the metrics declared for the scorer (more on this below).
Next, we’ll take a look at the source code for a couple of the built in scorers as a jumping off point for implementing your own scorers. If you are working on custom scorers, you should also review the Scorer Workflow section below for tips on optimising your development process.
Example: Includes
Here is the source code for the built-in includes()
scorer:
1@scorer(metrics=[accuracy(), bootstrap_std()])
def includes(ignore_case: bool = True):
2async def score(state: TaskState, target: Target):
# check for correct
= state.output.completion
answer 3= target.text
target if ignore_case:
= answer.lower().rfind(target.lower()) != -1
correct else:
= answer.rfind(target) != -1
correct
# return score
return Score(
4= CORRECT if correct else INCORRECT,
value 5=answer
answer
)
return score
- 1
-
The function applies the
@scorer
decorator and registers two metrics for use with the scorer. - 2
-
The
score()
function is declared asasync
. This is so that it can participate in Inspect’s optimised scheduling for expensive model generation calls (this scorer doesn’t call a model but others will). - 3
-
We make use of the
text
property on theTarget
. This is a convenience property to get a simple text value out of theTarget
(as targets can technically be a list of strings). - 4
-
We use the special constants
CORRECT
andINCORRECT
for the score value (as theaccuracy()
andbootstrap_std()
metrics know how to convert these special constants to float values (1.0 and 0.0 respectively). - 5
-
We provide the full model completion as the answer for the score (
answer
is optional, but highly recommended as it is often useful to refer to during evaluation development).
Example: Model Grading
Here’s a somewhat simplified version of the code for the model_graded_qa()
scorer:
@scorer(metrics=[accuracy(), bootstrap_std()])
def model_graded_qa(
str = DEFAULT_MODEL_GRADED_QA_TEMPLATE,
template: str = DEFAULT_MODEL_GRADED_QA_INSTRUCTIONS,
instructions: str = DEFAULT_GRADE_PATTERN,
grade_pattern: str | Model | None = None,
model: -> Scorer:
)
# resolve grading template and instructions,
# (as they could be file paths or URLs)
= resource(template)
template = resource(instructions)
instructions
# resolve model
= get_model(model)
grader_model
async def score(state: TaskState, target: Target) -> Score:
# format the model grading template
= template.format(
score_prompt =state.input_text,
question=state.output.completion,
answer=target.text,
criterion=instructions,
instructions
)
# query the model for the score
= await grader_model.generate(score_prompt)
result
# extract the grade
= re.search(grade_pattern, result.completion)
match if match:
return Score(
=match.group(1),
value=match.group(0),
answer=result.completion,
explanation
)else:
return Score(
=INCORRECT,
value="Grade not found in model output: "
explanation+ f"{result.completion}",
)
return score
Note that the call to model_grader.generate()
is done with await
—this is critical to ensure that the scorer participates correctly in the scheduling of generation work.
Note also we use the input_text
property of the TaskState
to access a string version of the original user input to substitute it into the grading template. Using the input_text
has two benefits: (1) It is guaranteed to cover the original input from the dataset (rather than a transformed prompt in messages
); and (2) It normalises the input to a string (as it could have been a message list).
Multi Scorer
It’s possible to use multiple scorers in parallel, then combine their output into a final overall score. This is done using the multi_scorer()
function. For example, this is roughly how the built in model graders use multiple models for grading:
multi_scorer(= [model_graded_qa(model=model) for model in models],
scorers = majority_vote
reducer )
Use of multi_scorer()
requires both a list of scorers as well as a reducer which is a function that takes a list of scores and turns it into a single score. In this case we use the built in majority_vote()
reducer which returns the score that appeared most frequently in the answers.
You can imagine a variety of different strategies for reducing scores (take the average, take the high or low, majority vote, etc.). For example, here’s a reducer that computes the average score:
import numpy as np
def average_score(scores: list[Score]) -> Score:
= [score.as_float() for score in scores]
values = np.mean(values).item()
avg return Score(
=avg,
value=f"average of {', '.join(values)}"
explanation )
Further, you will need to wrap your use of multi_scorer()
inside a @scorer
decorated function (with the requisite metrics
specified). For example:
@scorer(metrics=[mean()])
def multi_model_graded(models)
return multi_scorer(
= [model_graded_qa(model=model) for model in models],
scorers = average_score
reducer )
Metrics
Each scorer provides one or more built-in metrics (typically accuracy
and bootstrap_std
). In addition, you can specify other metrics (either built-in or custom) to compute when defining a Task
:
Task(=dataset,
dataset=[
plan
system_message(SYSTEM_MESSAGE),
multiple_choice()
],=match(),
scorer=[custom_metric()]
metrics )
Built-In Metrics
Inspect includes some simple built in metrics for calculating accuracy, mean, etc. Built in metrics can be imported from the inspect_ai.scorer
module. Below is a summary of these metrics. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.
accuracy()
Compute proportion of total answers which are correct. For correct/incorrect scores assigned 1 or 0, can optionally assign 0.5 for partially correct answers.
mean()
Mean of all scores.
var()
Variance over all scores.
bootstrap_std()
Standard deviation of a bootstrapped estimate of the mean. 1000 samples are taken by default (modify this using the
num_samples
option).
Custom Metrics
You can also add your own metrics with @metric
decorated functions. For example, here is the implementation of the variance metric:
import numpy as np
from inspect_ai.scorer import Metric, Score, metric
def var() -> Metric:
"""Compute variance over all scores."""
def metric(scores: list[Score]) -> float:
return np.var([score.as_float() for score in scores]).item()
return metric
Note that the Score
class contains a Value
that is a union over several scalar and collection types. As a convenience, Score
includes a set of accessor methods to treat the value as a simpler form (e.g. above we use the score.as_float()
accessor).
Workflow
Score Command
By default, model output in evaluations is automatically scored. However, you can separate generation and scoring by using the --no-score
option. For example:
inspect eval popularity.py --model openai/gpt-4 --no-score
You can score an evaluation previously run this way using the inspect score
command:
# score last eval
inspect score popularity.py
# score specific log file
inspect score popularity.py ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.json
Using a distinct scoring step is particularly useful during scorer development, as it bypasses the entire generation phase, saving lots of time and inference costs.
Log Overwriting
By default, inspect score
overwrites the file it scores. If don’t want to overwrite target files, pass the --no-overwrite
flag:
inspect score popularity.py --no-overwrite
When specifying --no-overwrite
, a -scored
suffix will be added to the original log file name:
./logs/2024-02-23_task_gpt-4_TUhnCn473c6-scored.json
Note that the --no-overwrite
flag does not apply to log files that already have the -scored
suffix—those files are always overwritten by inspect score
. If you plan on scoring multiple times and you want to save each scoring output, you will want to copy the log to another location before re-scoring.
Python API
If you are exploring the performance of different scorers, you might find it more useful to call the score()
function using varying scorers or scorer options. For example:
= eval(popularity, model="openai/gpt-4")[0]
log
= [
grader_models "openai/gpt-4",
"anthropic/claude-3-opus-20240229",
"google/gemini-1.0-pro",
"mistral/mistral-large-latest"
]
= [score(log, model_graded_qa(model=model))
scoring_logs for model in grader_models]
plot_results(scoring_logs)