Data Processors
Processors in hibayes transform your loaded data into the format required by your statistical models. They extract features, create indices for categorical variables, and prepare coordinates for posterior analysis. There are a number of built-in processors you can select from.
What makes up a processor?
Processors operate on the AnalysisState and transform the processed_data DataFrame into model-ready features, coords, and dims. They return the updated AnalysisState.
from hibayes.process import DataProcessor, process
from jax import numpy as jnp
@process
def my_processor(
feature_name: str = "score",
) -> DataProcessor:
"""
A processor which extracts a feature from the data.
"""
def process(state, display):
if feature_name not in state.processed_data.columns:
raise ValueError(f"Column '{feature_name}' not found")
features = {
"obs": jnp.array(state.processed_data[feature_name].values)
}
if state.features:
state.features.update(features)
else:
state.features = features
if display:
display.logger.info(f"Extracted '{feature_name}' -> 'obs'")
return state
return process- 1
- Any arguments the user can configure through the config file.
- 2
-
The inner function receives the
AnalysisStateand optional display for logging. - 3
- Provide helpful error messages when required columns are missing.
- 4
- Update existing features rather than overwriting - processors can be chained.
- 5
- Use the display logger to provide feedback during processing.
Built-in processors
Feature extraction
extract_observed_feature - Extract a single column as the observed outcome (obs). Use for simple cases where you just need the response variable.
extract_features - The main processor for GLM-style models. Extracts:
- Categorical features: factorised indices, counts, coords, and dims
- Continuous features: raw arrays (optionally standardised)
- Interactions: categorical × categorical dims, continuous × categorical slopes
Supports effect coding, custom reference categories, custom category ordering, and train/test splitting.
Data transformation
map_columns - Rename columns in the data to match expected names (e.g., rename "accuracy" to "score").
groupby - Aggregate data by specified columns for binomial models. Creates n_correct and n_total columns from individual score observations.
drop_rows_with_missing_features - Remove rows with missing values in specified columns.
Configuring processors
Processors are configured in your hibayes.yaml under the process section:
Or use the simpler list format for processors with default arguments:
Train/test splitting
The extract_features processor supports train/test data splitting via the test parameter. This is useful for model validation and out-of-sample prediction.
When test is specified:
- Features are split: Training features go to
state.features, test features go tostate.test_features - Standardisation uses train statistics: Continuous features are standardised using the mean and standard deviation from training data only, then applied to both train and test sets
- Category ordering is consistent: Categories are ordered based on the full dataset, ensuring train and test use the same index mappings
Custom processors
To use a custom processor, create a Python file with your processor definition and reference it in your config:
The @process decorator registers your function so it can be accessed by name in the config.
Features, Coords, and Dims
Processors populate three key structures on the AnalysisState:
- Features: A dictionary of arrays passed to the NumPyro model (e.g.,
{"obs": [...], "model_index": [...], "num_model": 3}) - Coords: Named coordinates for ArviZ (e.g.,
{"model": ["gpt-4", "claude", "gemini"]}) - Dims: Dimension names for parameters (e.g.,
{"model_effects": ["model"]})
These enable meaningful posterior summaries with named indices rather than integer positions.
Built-in Processors Reference
| Processor | Purpose | Key Parameters |
|---|---|---|
extract_observed_feature |
Extract a single column as observed outcome (obs) |
feature_name, test (optional split column) |
extract_features |
Full GLM feature extraction with categorical/continuous support | categorical_features, continuous_features, interactions, effect_coding, standardise, test |
map_columns |
Rename columns to match expected names | column_mapping (dict) |
groupby |
Aggregate data for binomial models | groupby_columns (creates n_correct, n_total) |
drop_rows_with_missing_features |
Remove rows with missing values | features (list of columns to check) |