Data Processors

Transforming loaded data for Bayesian modelling

Processors in hibayes transform your loaded data into the format required by your statistical models. They extract features, create indices for categorical variables, and prepare coordinates for posterior analysis. There are a number of built-in processors you can select from.

What makes up a processor?

Processors operate on the AnalysisState and transform the processed_data DataFrame into model-ready features, coords, and dims. They return the updated AnalysisState.

from hibayes.process import DataProcessor, process

from jax import numpy as jnp


@process
def my_processor(
    feature_name: str = "score",
) -> DataProcessor:
    """
    A processor which extracts a feature from the data.
    """
    def process(state, display):
        if feature_name not in state.processed_data.columns:
            raise ValueError(f"Column '{feature_name}' not found")

        features = {
            "obs": jnp.array(state.processed_data[feature_name].values)
        }

        if state.features:
            state.features.update(features)
        else:
            state.features = features

        if display:
            display.logger.info(f"Extracted '{feature_name}' -> 'obs'")

        return state
    return process

1: Any arguments the user can configure through the config file.
2: The inner function receives the AnalysisState and optional display for logging.
3: Provide helpful error messages when required columns are missing.
4: Update existing features rather than overwriting - processors can be chained.
5: Use the display logger to provide feedback during processing.

Built-in processors

Feature extraction

extract_observed_feature - Extract a single column as the observed outcome (obs). Use for simple cases where you just need the response variable.

extract_features - The main processor for GLM-style models. Extracts:

Categorical features: factorised indices, counts, coords, and dims
Continuous features: raw arrays (optionally standardised)
Interactions: categorical × categorical dims, continuous × categorical slopes

Supports effect coding, custom reference categories, custom category ordering, and train/test splitting.

Data transformation

map_columns - Rename columns in the data to match expected names (e.g., rename "accuracy" to "score").

groupby - Aggregate data by specified columns for binomial models. Creates n_correct and n_total columns from individual score observations.

drop_rows_with_missing_features - Remove rows with missing values in specified columns.

Configuring processors

Processors are configured in your hibayes.yaml under the process section:

process:
  processors:
    - map_columns:
        column_mapping:
          accuracy: score
          llm: model
    - groupby:
        groupby_columns: ["model", "task"]
    - extract_observed_feature:
        feature_name: n_correct
    - extract_features:
        categorical_features: ["model", "task"]
        interactions: true

Or use the simpler list format for processors with default arguments:

process:
  processors:
    - extract_observed_feature
    - extract_features

Train/test splitting

The extract_features processor supports train/test data splitting via the test parameter. This is useful for model validation and out-of-sample prediction.

process:
  processors:
    - extract_features:
        categorical_features: ["model", "task"]
        continuous_features: ["difficulty"]
        test: is_test  # Column name containing True/False for test rows
        standardise: true

When test is specified:

Features are split: Training features go to state.features, test features go to state.test_features
Standardisation uses train statistics: Continuous features are standardised using the mean and standard deviation from training data only, then applied to both train and test sets
Category ordering is consistent: Categories are ordered based on the full dataset, ensuring train and test use the same index mappings

Custom processors

To use a custom processor, create a Python file with your processor definition and reference it in your config:

process:
  path: path/to/my_processors.py
  processors:
    - my_processor:
        feature_name: custom_score

The @process decorator registers your function so it can be accessed by name in the config.

Features, Coords, and Dims

Processors populate three key structures on the AnalysisState:

Features: A dictionary of arrays passed to the NumPyro model (e.g., {"obs": [...], "model_index": [...], "num_model": 3})
Coords: Named coordinates for ArviZ (e.g., {"model": ["gpt-4", "claude", "gemini"]})
Dims: Dimension names for parameters (e.g., {"model_effects": ["model"]})

These enable meaningful posterior summaries with named indices rather than integer positions.

Built-in Processors Reference

Processor	Purpose	Key Parameters
`extract_observed_feature`	Extract a single column as observed outcome (`obs`)	`feature_name`, `test` (optional split column)
`extract_features`	Full GLM feature extraction with categorical/continuous support	`categorical_features`, `continuous_features`, `interactions`, `effect_coding`, `standardise`, `test`
`map_columns`	Rename columns to match expected names	`column_mapping` (dict)
`groupby`	Aggregate data for binomial models	`groupby_columns` (creates `n_correct`, `n_total`)
`drop_rows_with_missing_features`	Remove rows with missing values	`features` (list of columns to check)