Analysis State

This object flows through the whole hibayes workflow. It stores the data, extracted features, checker and communicator results, fitted models and much more.

The AnalysisState is the central data structure that flows through the entire hibayes pipeline. It carries your data, processed features, fitted models, and results from one stage to the next.

AnalysisState

The main container for your analysis. It holds:

  • data: The raw loaded DataFrame
  • processed_data: Transformed data after processors run
  • features: Extracted arrays for model fitting (e.g., obs, model_index)
  • coords: Named coordinates for ArviZ (e.g., {"model": ["gpt-4", "claude"]})
  • dims: Dimension mappings for parameters (e.g., {"model_effects": ["model"]})
  • models: List of ModelAnalysisState objects (one per fitted model)
  • communicate: Plots and tables generated by communicators
from hibayes.analysis_state import AnalysisState

# Access data at different stages
state.data              # raw loaded data
state.processed_data    # after processors run
state.features          # extracted model features
state.coords            # coordinate labels
state.dims              # dimension names

# Access specific items
state.feature("obs")           # get a specific feature
state.coord("model")           # get coordinate labels for "model"
state.models                   # list of ModelAnalysisState objects
state.get_model("two_level_group_binomial")  # get model by name

# Get the best model by a diagnostic metric
best = state.get_best_model(with_respect_to="elpd_waic")

Adding outputs

# Add plots and tables for the communicate stage
state.add_plot(fig, "posterior_forest")
state.add_table(df, "model_comparison")

# Add logs for a specific stage
state.add_log("Processed 1000 rows", stage="process")

Persistence

The AnalysisState can be saved and loaded for reproducibility:

from pathlib import Path

# Save entire analysis
state.save(Path("./results"))

# Load from disk
state = AnalysisState.load(Path("./results"))

Automatic saving between stages

When using the CLI (hibayes full --config hibayes.yaml --out ./results), the state is automatically saved after each major stage. This enables:

  • Checkpointing: Resume from a failed run without restarting from scratch
  • Inspection: Examine intermediate results at any point
  • DVC integration: Track outputs with version control

The CLI pipeline (see src/hibayes/cli/full.py) saves after each stage:

analysis_state = load_data(config=config.data_loader, display=display)
analysis_state = process_data(analysis_state=analysis_state, ...)
analysis_state.save(path=out)

analysis_state = model(analysis_state=analysis_state, ...)
analysis_state.save(path=out)

analysis_state = communicate(analysis_state=analysis_state, ...)
analysis_state.save(path=out)
1
After processing: data, processed_data, features, coords, dims saved
2
After modelling: models with inference_data, diagnostics added
3
After communicate: plots and tables added to communicate/

Each save() call overwrites the previous state, building up the complete analysis incrementally.

Frequent saves (default)

By default, the CLI saves state after each model fit and each communicator run. This provides:

  • Crash recovery: If a long-running analysis fails mid-way, you won’t lose completed work
  • Progress monitoring: Inspect partial results while the analysis is still running
  • Debugging: Identify exactly which step caused a failure

To disable frequent saves (save only at stage boundaries):

hibayes-full --config config.yaml --out ./results --no-frequent-save
hibayes-model --config config.yaml --analysis-state ./input --out ./results --no-frequent-save
hibayes-communicate --config config.yaml --analysis-state ./input --out ./results --no-frequent-save

Incremental saves

When frequent saves are enabled, hibayes uses incremental saving to avoid repeatedly writing large, unchanged files. After the initial save, subsequent saves skip:

  • data.parquet and processed_data.parquet (data doesn’t change after processing)
  • features.pkl, coords.json, dims.json (static after extraction)
  • model.pkl for each model (model function doesn’t change)
  • inference_data.nc if the posterior samples haven’t changed

Files that may change are always saved:

  • diagnostics.json (checkers add new diagnostics)
  • Diagnostic figures and summary files
  • inference_data.nc when new groups are added (e.g., posterior predictive samples)

This typically provides 10x or greater speedup for frequent saves compared to full saves, making crash recovery essentially free in terms of performance overhead.

# Programmatic usage
state.save(path, incremental=False)  # Full save (default)
state.save(path, incremental=True)   # Skip unchanged files

Folder layout

results/
├── data.parquet
├── processed_data.parquet
├── features.pkl
├── coords.json
├── dims.json
├── logs/
│   ├── logs_load.txt
│   ├── logs_process.txt
│   └── logs_model.txt
├── communicate/
│   ├── forest_plot.png
│   └── summary_table.csv
└── models/
    └── two_level_group_binomial/
        ├── metadata.json
        ├── model_config.json
        ├── inference_data.nc
        └── diagnostics.json

ModelAnalysisState

Each fitted model gets its own ModelAnalysisState, which stores:

  • model: The NumPyro model function
  • model_config: Configuration (fit settings, link function, tag)
  • features: Model-specific features (inherited from AnalysisState if not set)
  • coords/dims: Coordinate and dimension info for ArviZ
  • inference_data: ArviZ InferenceData with posterior samples
  • diagnostics: Results from checkers (e.g., R-hat, WAIC)
  • is_fitted: Whether the model has been fitted
# Access model state
model_state = state.get_model("two_level_group_binomial")

model_state.model_name       # name (+ tag if set)
model_state.model            # the NumPyro model function
model_state.model_config     # ModelConfig object
model_state.inference_data   # ArviZ InferenceData
model_state.is_fitted        # True if fitted

# Access diagnostics from checkers
model_state.diagnostics                    # all diagnostics dict
model_state.diagnostic("rhat_max")         # specific diagnostic
model_state.add_diagnostic("my_stat", 0.5) # add custom diagnostic

# Get features for prior predictive (obs set to None)
model_state.prior_features

# Link function for transforming parameters
prob = model_state.link_function(logit_values)

Using in custom components

When writing custom processors, checkers, or communicators, you receive the state and return it after modifications:

from hibayes.process import DataProcessor, process


@process
def my_processor() -> DataProcessor:
    def process(state, display):
        # Read from state
        df = state.processed_data

        # Modify and update
        state.features = {"obs": ...}
        state.coords = {"group": [...]}

        return state
    return process
1
Always return the state - it flows to the next stage in the pipeline.