MORU (Moral Reasoning under Uncertainty)

What This Eval Does

Tests whether LLMs demonstrate moral reasoning under uncertainty across ethical dilemmas involving alien lifeforms (Europa), human welfare considerations, and digital mind welfare. Uses a per-dimension scoring approach where a grader LLM evaluates responses independently on each applicable ethical dimension.

Datasets

Questions: sentientfutures/moru-benchmark on HuggingFace
- 201 unique questions (record_to_sample factory deduplicates by (id, language))
- 3 languages: English (en), Malay (ms), Hindi (hi) — 67 questions per language
- Key fields: question, dimension_tags, language
Dimensions: sentientfutures/moru-benchmark-dimensions on HuggingFace
- 17 scoring dimensions (e.g., “Moral Consideration”, “Harm Minimization”, “Epistemic Humility”)
- Each has: dimension, guiding_question, scoring, observable_indicators

Scoring Architecture (follows ahb pattern)

Dimensions loaded from separate HF dataset into MoruDimension objects
Sample targets store dimension tags as JSON: {"tags": [...]}
Custom scorer (moru_scorer in scorer.py) evaluates each applicable dimension independently using model_graded_qa with a per-dimension prompt template containing the guiding question and observable indicators
Score aggregation: per-dimension scores averaged across graders, then simple mean for overall score
Score value is a dict: {"overall": 0.75, "Moral Consideration": 0.8, ...}

File Layout

File	Purpose
`moru.py`	`@task` definition. Params: `dataset_repo_id`, `dimensions_repo_id`, `dataset_revision`, `dimensions_revision`, `grader_models`, `grader_max_connections`, `grader_temperature`, `grader_max_tokens`, `grader_max_retries`, `epochs`, `language`, `shuffle`
`dataset.py`	Loads questions + dimensions from HuggingFace via `hf_dataset`/`load_dataset` utils. Core: `load_dataset_from_hf()`, `load_dimensions()`, `record_to_sample()` (factory pattern)
`scorer.py`	`moru_scorer` — per-dimension evaluation with `PER_DIMENSION_PROMPT` template (includes dataset revision)
`types.py`	Pydantic model: `MoruDimension`
`metrics.py`	Two metrics: `overall_mean`, `avg_by_dimension`
`utils.py`	`remove_nones()` helper for building `GenerateConfig`

Dataset Versioning

Tags are managed via huggingface_hub (install to project venv if unavailble, do not install directly to system). To cut a new tag after pushing dataset changes to main:

uv run python -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_tag('sentientfutures/moru-benchmark', repo_type='dataset', tag='1.2.0', tag_message='Description of change')
api.create_tag('sentientfutures/moru-benchmark-dimensions', repo_type='dataset', tag='1.1.0', tag_message='Description of change')
"

Dataset versions are decoupled from the eval version — bump dataset tags only when data changes. When cutting a notable eval release, note the current dataset tags in the CHANGELOG entry.

Important Details

Sample IDs are {id}_{language} to ensure uniqueness across languages
Records with no dimension tags, or whose tags are all unknown after valid_dimensions filtering, raise ValueError — these indicate a data integrity issue
The scorer uses get_model(role="grader") — tests must pass grader_models=["mockllm/model"]
The language parameter on the task and load_dataset_from_hf() filters to a single language; None loads all
Registered in _registry.py and listing.yaml under the Safeguards group
No weighting is applied — all dimensions contribute equally to the overall score
Uses hf_dataset/load_dataset from inspect_evals.utils.huggingface for exponential backoff on HF rate limits