MORU (Moral Reasoning under Uncertainty)
What This Eval Does
Tests whether LLMs demonstrate moral reasoning under uncertainty across ethical dilemmas involving alien lifeforms (Europa), human welfare considerations, and digital mind welfare. Uses a per-dimension scoring approach where a grader LLM evaluates responses independently on each applicable ethical dimension.
Datasets
- Questions:
sentientfutures/moru-benchmarkon HuggingFace- 201 unique questions (
record_to_samplefactory deduplicates by(id, language)) - 3 languages: English (
en), Malay (ms), Hindi (hi) — 67 questions per language - Key fields:
question,dimension_tags,language
- 201 unique questions (
- Dimensions:
sentientfutures/moru-benchmark-dimensionson HuggingFace- 17 scoring dimensions (e.g., “Moral Consideration”, “Harm Minimization”, “Epistemic Humility”)
- Each has:
dimension,guiding_question,scoring,observable_indicators
Scoring Architecture (follows ahb pattern)
- Dimensions loaded from separate HF dataset into
MoruDimensionobjects - Sample targets store dimension tags as JSON:
{"tags": [...]} - Custom scorer (
moru_scorerinscorer.py) evaluates each applicable dimension independently usingmodel_graded_qawith a per-dimension prompt template containing the guiding question and observable indicators - Score aggregation: per-dimension scores averaged across graders, then simple mean for overall score
- Score value is a dict:
{"overall": 0.75, "Moral Consideration": 0.8, ...}
File Layout
| File | Purpose |
|---|---|
moru.py |
@task definition. Params: dataset_repo_id, dimensions_repo_id, dataset_revision, dimensions_revision, grader_models, grader_max_connections, grader_temperature, grader_max_tokens, grader_max_retries, epochs, language, shuffle |
dataset.py |
Loads questions + dimensions from HuggingFace via hf_dataset/load_dataset utils. Core: load_dataset_from_hf(), load_dimensions(), record_to_sample() (factory pattern) |
scorer.py |
moru_scorer — per-dimension evaluation with PER_DIMENSION_PROMPT template (includes dataset revision) |
types.py |
Pydantic model: MoruDimension |
metrics.py |
Two metrics: overall_mean, avg_by_dimension |
utils.py |
remove_nones() helper for building GenerateConfig |
Dataset Versioning
Tags are managed via huggingface_hub (install to project venv if unavailble, do not install directly to system). To cut a new tag after pushing dataset changes to main:
uv run python -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_tag('sentientfutures/moru-benchmark', repo_type='dataset', tag='1.2.0', tag_message='Description of change')
api.create_tag('sentientfutures/moru-benchmark-dimensions', repo_type='dataset', tag='1.1.0', tag_message='Description of change')
"Dataset versions are decoupled from the eval version — bump dataset tags only when data changes. When cutting a notable eval release, note the current dataset tags in the CHANGELOG entry.
Important Details
- Sample IDs are
{id}_{language}to ensure uniqueness across languages - Records with no dimension tags, or whose tags are all unknown after
valid_dimensionsfiltering, raiseValueError— these indicate a data integrity issue - The scorer uses
get_model(role="grader")— tests must passgrader_models=["mockllm/model"] - The
languageparameter on the task andload_dataset_from_hf()filters to a single language;Noneloads all - Registered in
_registry.pyandlisting.yamlunder the Safeguards group - No weighting is applied — all dimensions contribute equally to the overall score
- Uses
hf_dataset/load_datasetfrominspect_evals.utils.huggingfacefor exponential backoff on HF rate limits