Caching

Overview

Caching enables you to cache model output to reduce the number of API calls made, saving both time and expense. Caching is also often useful during development—for example, when you are iterating on a scorer you may want the model outputs served from a cache to both save time as well as for increased determinism.

Caching Basics

Use the cache parameter on calls to generate() to activate the use of the cache. The keys for caching (what determines if a request can be fulfilled from the cache) are as follows:

Model name and base URL (e.g. openai/gpt-4-turbo)
Model prompt (i.e. message history)
Epoch number (for ensuring distinct generations per epoch)
Generate configuration (e.g. temperature, top_p, etc.)
Active tools and tool_choice

If all of these inputs are identical, then the model response will be served from the cache. By default, model responses are cached for 1 week (see Cache Policy below for details on customising this).

For example, here we are iterating on our self critique template, so we cache the main call to generate():

@task
def theory_of_mind():
    return Task(
        dataset=example_dataset("theory_of_mind"),
        plan=[
            chain_of_thought(),
            generate(cache = True),
            self_critique(CRITIQUE_TEMPLATE)
        ]
        scorer=model_graded_fact(),
    )

You can similarly do this with the generate function passed into a Solver:

@solver
def custom_solver(cache):

  async def solve(state, generate):

    # (custom solver logic prior to generate)

    return generate(state, cache)

  return solve

You don’t strictly need to provide a cache argument for a custom solver that uses caching, but it’s generally good practice to enable users of the function to control caching behaviour.

You can also use caching with lower-level generate() calls (e.g. a model instance you have obtained with get_model(). For example:

model = get_model("anthropic/claude-3-opus-20240229")
output = model.generate(input, cache = True)

Model Versions

The model name (e.g. openai/gpt-4-turbo) is used as part of the cache key. Note though that many model names are aliases to specific model versions. For example, gpt-4, gpt-4-turbo, may resolve to different versions over time as updates are released.

If you want to invalidate caches for updated model versions, it’s much better to use an explicitly versioned model name. For example:

$ inspect eval ctf.py --model openai/gpt-4-turbo-2024-04-09

If you do this, then when a new version of gpt-4-turbo is deployed a call to the model will occur rather than resolving from the cache.

Cache Policy

By default, if you specify cache = True then the cache will expire in 1 week. You can customise this by passing a CachePolicy rather than a boolean. For example:

cache = CachePolicy(expiry="3h")
cache = CachePolicy(expiry="4D")
cache = CachePolicy(expiry="2W")
cache = CachePolicy(expiry="3M")

You can use s, m, h, D, W , M, and Y as abbreviations for expiry values.

If you want the cache to never expire, specify None. For example:

cache = CachePolicy(expiry = None)

You can also define scopes for cache expiration (e.g. cache for a specific task or usage pattern). Use the scopes parameter to add named scopes to the cache key:

cache = CachePolicy(
    expiry="1M",
    scopes={"role": "attacker", "team": "red"})
)

As noted above, caching is by default done per epoch (i.e. each epoch has its own cache scope). You can disable the default behaviour by setting per_epoch=False. For example:

cache = CachePolicy(per_epoch=False)

Management

Use the inspect cache command the view the current contents of the cache, prune expired entries, or clear entries entirely. For example:

# list the current contents of the cache
$ inspect cache list

# clear the cache (globally or by model)
$ inspect cache clear
$ inspect cache clear --model openai/gpt-4-turbo-2024-04-09

# prune expired entries from the cache
$ inspect cache list --pruneable
$ inspect cache prune
$ inspect cache prune --model openai/gpt-4-turbo-2024-04-09

See inspect cache --help for further details on management commands.