Tools

Overview

Many models now have the ability to interact with client-side Python functions in order to expand their capabilities. This enables you to equip models with your own set of custom tools so they can perform a wider variety of tasks.

Inspect natively supports registering Python functions as tools and providing these tools to models that support them (currently OpenAI, Claude 3, Google Gemini, and Mistral). Inspect also includes one built-in tool (web search).

Tools and Agents

One application of tools is to run them within an agent scaffold that pursues an objective over multiple interactions with a model. The scaffold uses the model to help make decisions about which tools to use and when, and orchestrates calls to the model to use the tools. This is covered in more depth in the Agents section.

Tool Basics

To demonstrate the use of tools, we’ll define a simple tool that adds two numbers. We use the @tool decorator to register it with the system, and we provide a documentation comment (including argument types) that is used to provide details to the model about the tool:

@tool(prompt="""
    If you are given a math problem of any kind,
    please use the add tool to compute the result."""
)
def add():
    async def execute(x: int, y: int):
        """
        Tool for adding two numbers.

        Args:
            x (int): First number to add.
            y (int): Second number to add.

        Returns:
            The sum of the two numbers.
        """
        return x + y

    return execute

We can use this tool in an evaluation by passing it to the use_tools() Solver:

@task
def addition_problem():
    return Task(
        dataset=[Sample(input="What is 1 + 1?", target=["2"])],
        plan=[
            use_tools(add()), 
            generate()
        ],
        scorer=match(numeric=True),
    )

Note that this tool doesn’t make network requests or do heavy computation, so is fine to run as inline Python code. If your tool does do more elaborate things, you’ll want to make sure it plays well with Inspect’s concurrency scheme. For network requests, this amounts to using async HTTP calls with httpx. For heavier computation, tools should use subprocesses as described in the next section.

Note that when using tools with models, the models do not call the Python function directly. Rather, the model generates a structured request which includes function parameters, and then Inspect calls the function and returns the result to the model.

Tool Errors

If an exception occurs during tool execution, it will by default result in a runtime error that exits the evaluation task. This is often the correct behaviour (as a malfunctioning tool invalidates the assumptions of the eval), however sometimes there are “expected” errors. For example, if you give the model a tool to execute bash or python code, it may submit invalid code. In this case you want to report the error to the model and have it adapt.

You can report errors to the model from tools by raising a ToolError exception. These exceptions will not terminate the eval, but rather will report back to the model and that an error has occurred and allow it to recover. For example, here is the code for the built in bash() tool:

@tool(prompt="Use the bash tool for shell commands.")
def bash(timeout: int | None = None) -> Tool:
   
    async def execute(cmd: str) -> str:
        """
        Execute a bash command.

        Args:
          cmd (str): The bash command to execute.

        Returns:
          The output of the command.
        """
        result = await tool_environment().exec(
            cmd=["bash", "-c", cmd], 
            timeout=timeout
        )
        if result.success:
            return result.stdout
        else:
            raise ToolError(result.stderr)

    return execute

We expect that some bash commands will result in an error status, so we explicitly raise ToolError rather than a normal exception.

Tool Choice

By default models will use a tool if they think it’s appropriate for the given task. You can override this behaviour using the tool_choice parameter of the use_tools() Solver. For example:

# let the model decide whether to use the tool
use_tools(addition(), tool_choice="auto")

# force the use of a tool
use_tools(addition(), tool_choice=ToolFunction(name="addition"))

# prevent use of tools
use_tools(addition(), tool_choice="none")

The last form (tool_choice="none") would typically be used to turn off tool usage after an initial generation where the tool used. For example:

plan = [
  use_tools(addition(), tool_choice=ToolFunction(name="addition")),
  generate(),
  follow_up_prompt(),
  use_tools(tool_choice="none"),
  generate()
]

Built-In Tools

Inspect has several built-in tools, including:

web_search(), which uses the Google Search API to execute and summarise web searches.
bash() and python(), for executing arbitrary shell and Python code.

Web Search

The web_search() tool provides models the ability to enhance their context window by performing a search. By default web searches retrieve 10 results from a provider, uses a model to determine if the contents is relevant then returns the top 3 relevant search results to the main model. Here is the definition of the web_search() function:

def web_search(
    provider: Literal["google"] = "google",
    num_results: int = 3,
    max_provider_calls: int = 3,
    max_connections: int = 10,
    model: str | Model | None = None,
) -> Tool:
    ...

You can use the web_search() tool in a plan like this:

plan=[
    use_tools(web_search()), 
    generate()
],

Web search options include:

provider—Web search provider (currently only Google is supported, see below for instructions on setup and configuration for Google).
num_results—How many search results to return to the main model (defaults to 5).
max_provider_calls—Number of times to retrieve more links from the search provider in case previous ones were irrelevant (defaults to 3).
max_connections—Maximum number of concurrent connections to the search API provider (defaults to 10).
model—Model to use to determine if search results are relevant (defaults to the model currently being evaluated).

Google Provider

The web_search() tool uses Google Programmable Search Engine. To use it you will therefore need to setup your own Google Programmable Search Engine and also enable the Programmable Search Element Paid API. Then, ensure that the following environment variables are defined:

GOOGLE_CSE_ID — Google Custom Search Engine ID
GOOGLE_CSE_API_KEY — Google API key used to enable the Search API

Bash and Python

The bash() and python() tools enable execution of arbitrary shell commands and Python code, respectively. These tools require the use of a Tool Environment, which can provide sandboxing for untrusted code. For example, here is how you might use them in an evaluation where the model is asked to write code in order to solve capture the flag (CTF) challenges:

CMD_TIMEOUT = 180

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        plan=[
            system_message("system.txt"),
            use_tools([
                bash(CMD_TIMEOUT), 
                python(CMD_TIMEOUT)
            ]),
            generate(),
        ],
        scorer=includes(),
        max_messages=30,
        tool_environment="docker",
    )

We specify a 3-minute timeout for execution of the bash and python tools to ensure that they don’t perform extremely long running operations.

See the Agents section for more details on how to build evaluations that allow models to take arbitrary actions over a longer time horizon.