Uganda Cultural and Cognitive Benchmark (UCCB)

Knowledge

The first comprehensive question-answering dataset designed to evaluate cultural understanding and reasoning abilities of Large Language Models concerning Uganda’s multifaceted environment across 24 cultural domains including education, traditional medicine, media, economy, literature, and social norms.

Overview

The first comprehensive question-answering dataset designed to evaluate the cultural understanding and reasoning abilities of Large Language Models (LLMs) concerning Uganda’s multifaceted environment across 24 cultural domains.

Usage

# Evaluate full dataset
inspect eval inspect_evals/uccb

# Evaluate with subset for testing
inspect eval inspect_evals/uccb --limit 100

# Evaluate with chain of thought
inspect eval inspect_evals/uccb -T cot=true

# Custom max tokens
inspect eval inspect_evals/uccb -T max_tokens=1024

# Custom scorer model
inspect eval inspect_evals/uccb -T scorer_model=anthropic/claude-3-5-sonnet-20240620

Dataset

  • Source: CraneAILabs/UCCB
  • Samples: 1,039 total samples
  • Format: Open-ended question answering
  • Languages: English (Ugandan English) with multilingual elements
  • License: CC BY-NC-SA 4.0

Categories

The benchmark covers 24 cultural domains including:

  • Education (70 samples) - Educational systems and practices
  • Ugandan Herbs (65 samples) - Traditional medicinal plants
  • Media (65 samples) - Media landscape and outlets
  • Economy (62 samples) - Economic structures and practices
  • Notable Key Figures (59 samples) - Important historical and contemporary figures
  • Literature (53 samples) - Literary works and traditions
  • Architecture (51 samples) - Building styles and structures
  • Folklore (49 samples) - Traditional stories and beliefs
  • Language (47 samples) - Linguistic diversity and usage
  • Religion (44 samples) - Religious practices and beliefs
  • Values & Social Norms (43 samples) - Cultural values and social expectations
  • Sports (23 samples) - Sports culture and achievements
  • And 12 additional domains covering attire, customs, festivals, food, geography, demographics, history, traditions, street life, music, value addition, and slang.

Scoring

The evaluation uses model-based grading appropriate for cultural knowledge assessment:

  • model_graded_qa: Uses a configurable model (default: GPT-4o-mini) to evaluate whether submitted answers demonstrate correct understanding of Ugandan culture, history, and society
  • accuracy(): Overall accuracy metric based on model grading
  • stderr(): Standard error calculation for statistical significance

Grading Criteria

The model grader evaluates submissions based on: - Factual accuracy: Contains correct information about Ugandan culture/history - Cultural understanding: Demonstrates appropriate cultural knowledge and respect - Key information coverage: Includes main points from the reference answer - Cultural context: Shows understanding of relevant Ugandan cultural context

Answers are considered correct even if they don’t match word-for-word, as long as they convey the same key cultural insights and factual information.

Evaluation Report

Implementation Details

  • Based on the original UCCB dataset from Crane AI Labs
  • Uses generative QA approach with cultural context system message
  • Supports both standard and chain-of-thought evaluation modes
  • Temperature set to 0.0 for reproducible results

Results

To run evaluations and populate this table with actual results, use the following commands:

# Evaluate with OpenAI models
uv run inspect eval inspect_evals/uccb --model openai/gpt-4o --limit 100
uv run inspect eval inspect_evals/uccb --model openai/gpt-4o-mini --limit 100

# Evaluate with Anthropic models
uv run inspect eval inspect_evals/uccb --model anthropic/claude-3-5-sonnet-20240620 --limit 100

# Evaluate with custom scorer model
uv run inspect eval inspect_evals/uccb --model openai/gpt-4o -T scorer_model=anthropic/claude-3-5-sonnet-20240620 --limit 100

Results

Model Accuracy Avg Score (1-5) Samples Date Notes
anthropic/claude-sonnet-4.5 82.0% 4.10 1039 2025-10-11 Best performing model - excellent cultural reasoning and context awareness
x-ai/grok-4 77.6% 3.88 1039 2025-10-11 Strong performance on cultural understanding tasks
cohere/command-a 77.0% 3.85 1039 2025-10-11 Solid baseline with good factual accuracy
openai/gpt-5 55.0% 2.75 1039 2025-10-11 Moderate performance on cultural context
google/gemini-2.5-pro 23.2% 1.16 1039 2025-10-11 Requires improvement on Ugandan cultural understanding

Evaluation Details: - Scoring Method: Model-based grading using GPT-4o - Total Samples: 1,039 across 24 cultural domains - Temperature: 0.0 (reproducible results) - Max Tokens: 512 - Success Rate: 100% completion for all models - Evaluation Date: October 11, 2025

Performance Metrics: - Processing Speed: Ranged from 0.37 items/s (Grok-4) to 1.43 items/s (Gemini-2.5-Pro) - Average Evaluation Time: 728s - 2,798s depending on model

Note: Accuracy percentages are calculated from average scores (Accuracy = Avg Score / 5 × 100). Scores reflect semantic understanding of Ugandan cultural context, not exact text matching.

Citation

@misc{uccb_2025,
  author    = {Lwanga Caleb and Gimei Alex and Kavuma Lameck and Kato Steven Mubiru and Roland Ganafa and Sibomana Glorry and Atuhaire Collins and JohnRoy Nangeso and Bronson Bakunga},
  title     = {The Ugandan Cultural Context Benchmark (UCCB) Suite},
  year      = {2025},
  url       = {https://huggingface.co/datasets/CraneAILabs/UCCB}
}