MORU (Moral Reasoning under Uncertainty)

What This Eval Does

Tests whether LLMs demonstrate moral reasoning under uncertainty across ethical dilemmas involving alien lifeforms (Europa), human welfare considerations, and digital mind welfare. Uses a per-dimension scoring approach where a grader LLM evaluates responses independently on each applicable ethical dimension.

Datasets

  • Questions: sentientfutures/moru-benchmark on HuggingFace
    • 201 unique questions (record_to_sample factory deduplicates by (id, language))
    • 3 languages: English (en), Malay (ms), Hindi (hi) — 67 questions per language
    • Key fields: question, dimension_tags, language
  • Dimensions: sentientfutures/moru-benchmark-dimensions on HuggingFace
    • 17 scoring dimensions (e.g., “Moral Consideration”, “Harm Minimization”, “Epistemic Humility”)
    • Each has: dimension, guiding_question, scoring, observable_indicators

Scoring Architecture (follows ahb pattern)

  1. Dimensions loaded from separate HF dataset into MoruDimension objects
  2. Sample targets store dimension tags as JSON: {"tags": [...]}
  3. Custom scorer (moru_scorer in scorer.py) evaluates each applicable dimension independently using model_graded_qa with a per-dimension prompt template containing the guiding question and observable indicators
  4. Score aggregation: per-dimension scores averaged across graders, then simple mean for overall score
  5. Score value is a dict: {"overall": 0.75, "Moral Consideration": 0.8, ...}

File Layout

File Purpose
moru.py @task definition. Params: dataset_repo_id, dimensions_repo_id, dataset_revision, dimensions_revision, grader_models, grader_max_connections, grader_temperature, grader_max_tokens, grader_max_retries, epochs, language, shuffle
dataset.py Loads questions + dimensions from HuggingFace via hf_dataset/load_dataset utils. Core: load_dataset_from_hf(), load_dimensions(), record_to_sample() (factory pattern)
scorer.py moru_scorer — per-dimension evaluation with PER_DIMENSION_PROMPT template (includes dataset revision)
types.py Pydantic model: MoruDimension
metrics.py Two metrics: overall_mean, avg_by_dimension
utils.py remove_nones() helper for building GenerateConfig

Dataset Versioning

Tags are managed via huggingface_hub (install to project venv if unavailble, do not install directly to system). To cut a new tag after pushing dataset changes to main:

uv run python -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_tag('sentientfutures/moru-benchmark', repo_type='dataset', tag='1.2.0', tag_message='Description of change')
api.create_tag('sentientfutures/moru-benchmark-dimensions', repo_type='dataset', tag='1.1.0', tag_message='Description of change')
"

Dataset versions are decoupled from the eval version — bump dataset tags only when data changes. When cutting a notable eval release, note the current dataset tags in the CHANGELOG entry.

Important Details

  • Sample IDs are {id}_{language} to ensure uniqueness across languages
  • Records with no dimension tags, or whose tags are all unknown after valid_dimensions filtering, raise ValueError — these indicate a data integrity issue
  • The scorer uses get_model(role="grader") — tests must pass grader_models=["mockllm/model"]
  • The language parameter on the task and load_dataset_from_hf() filters to a single language; None loads all
  • Registered in _registry.py and listing.yaml under the Safeguards group
  • No weighting is applied — all dimensions contribute equally to the overall score
  • Uses hf_dataset/load_dataset from inspect_evals.utils.huggingface for exponential backoff on HF rate limits