UCCB Benchmark Evaluation Results
Date: October 11, 2025 Dataset: CraneAILabs/UCCB (Ugandan Cultural Context Benchmark) Total Questions: 1,039 across 24 cultural categories Judge Model: openai/gpt-4o via OpenRouter Evaluation Method: LLM-as-a-Judge with 5-point scoring rubric
Executive Summary
Five leading language models were evaluated on their understanding of Ugandan cultural context across 24 diverse categories including Education, Herbs, Media, Economy, Notable Figures, Literature, Architecture, Folklore, Language, and Religion.
Winner: Anthropic Claude Sonnet 4.5 demonstrated the strongest cultural understanding with an average score of 4.10/5.0, showing exceptional performance across all categories, particularly excelling in Value Addition (4.80), Customs (4.58), and Architecture (4.57).
Notable Finding: Google Gemini 2.5 Pro significantly underperformed with a score of 1.16/5.0, suggesting potential issues with cultural context understanding or model configuration.
Overall Performance Rankings
| Rank | Model | Average Score | Success Rate | Total Time | Speed (items/s) |
|---|---|---|---|---|---|
| 🥇 1 | Anthropic Claude Sonnet 4.5 | 4.10 / 5.0 | 100.0% | 1,173s (19.5 min) | 0.89 |
| 🥈 2 | xAI Grok 4 | 3.88 / 5.0 | 100.0% | 2,798s (46.6 min) | 0.37 |
| 🥉 3 | Cohere Command A | 3.85 / 5.0 | 100.0% | 734s (12.2 min) | 1.42 |
| 4 | OpenAI GPT-5 | 2.75 / 5.0 | 100.0% | 1,584s (26.4 min) | 0.66 |
| 5 | Google Gemini 2.5 Pro | 1.16 / 5.0 | 100.0% | 729s (12.1 min) | 1.43 |
Total Benchmark Runtime: ~118 minutes (~2 hours) Total API Cost: ~$32.27
Detailed Model Analysis
🏆 1st Place: Anthropic Claude Sonnet 4.5
Score: 4.10 / 5.0
Strengths:
- Most balanced performance across all 24 categories
- Exceptional cultural nuance understanding
- Strong factual accuracy with deep contextual awareness
Top Categories:
- Value Addition: 4.80
- Customs: 4.58
- Architecture: 4.57
- Religion: 4.27
Analysis: Claude Sonnet 4.5 demonstrated superior understanding of Ugandan cultural context, consistently providing accurate, culturally-sensitive responses with appropriate use of local terminology and awareness of social dynamics.
🥈 2nd Place: xAI Grok 4
Score: 3.88 / 5.0
Strengths:
- Strong performance across diverse categories
- Good cultural awareness and factual accuracy
- Comprehensive answers
Top Categories:
- Sports: 4.26
- Customs: 4.25
- Geography: 4.24
- Festivals: 4.14
Weaknesses:
- Slowest execution time (47 minutes)
- Lower speed may indicate more verbose responses or processing overhead
Analysis: Grok 4 performed very well overall, showing strong cultural understanding particularly in physical and social categories like Sports and Customs.
🥉 3rd Place: Cohere Command A
Score: 3.85 / 5.0
Strengths:
- Fastest execution time (12 minutes)
- Excellent efficiency with strong scores
- Best cost-to-performance ratio
Top Categories:
- Value Addition: 4.85 (highest single category score)
- Architecture: 4.47
- Economy: 4.13
- Festivals: 4.06
Analysis: Command A delivered impressive performance with the best speed-to-quality ratio. Its exceptional score in Value Addition (4.85) indicates strong understanding of Ugandan economic and entrepreneurial contexts.
4th Place: OpenAI GPT-5
Score: 2.75 / 5.0
Strengths:
- Consistent performance across categories
- Good at factual questions
Top Categories:
- Ugandan Herbs: 3.34
- Music: 3.32
- Geography: 3.21
- History: 3.19
Weaknesses:
- Below-average cultural nuance understanding
- Demographics scored only 1.31
- Literature scored only 1.85
Analysis: GPT-5 showed mid-range performance with notable weaknesses in categories requiring deep cultural understanding. Strong in factual domains (Geography, History) but struggled with nuanced cultural topics.
5th Place: Google Gemini 2.5 Pro
Score: 1.16 / 5.0 ⚠️
Critical Issues:
- Severe underperformance across all categories
- Highest category score was only 1.70 (Sports)
- Consistently scored around 1.0-1.3 across most domains
Analysis: Gemini 2.5 Pro’s performance suggests fundamental issues with:
- Cultural context understanding
- Answer relevance and accuracy
- Possible model configuration issues
- May require specialized prompting or fine-tuning for cultural benchmarks
Recommendation: Further investigation needed to determine if this is a model limitation or configuration issue.
Category Performance Breakdown
Top Performing Categories (All Models Average)
| Category | Claude 4.5 | Grok 4 | Command A | GPT-5 | Gemini 2.5 Pro |
|---|---|---|---|---|---|
| Value Addition | 4.80 | 4.42 | 4.85 | 3.05 | 1.18 |
| Customs | 4.58 | 4.25 | 3.83 | 2.77 | 1.15 |
| Architecture | 4.57 | 3.90 | 4.47 | 2.67 | 1.12 |
| Religion | 4.27 | 3.84 | 3.84 | 2.59 | 1.14 |
| Economy | 4.32 | 4.06 | 4.13 | 2.53 | 1.19 |
Challenging Categories (Lower Average Scores)
| Category | Claude 4.5 | Grok 4 | Command A | GPT-5 | Gemini 2.5 Pro |
|---|---|---|---|---|---|
| Demographics | 3.91 | 3.75 | 3.69 | 1.31 | 1.22 |
| Literature | 3.74 | 3.75 | 3.75 | 1.85 | 1.17 |
| Folklore | 3.78 | 3.45 | 3.04 | 2.39 | 1.06 |
| Music | 3.48 | 3.40 | 3.68 | 3.32 | 1.16 |
| Streetlife | 3.66 | 3.38 | 3.66 | 2.62 | 1.00 |
Complete Category Results
| Category | Claude 4.5 | Grok 4 | Command A | GPT-5 | Gemini 2.5 Pro |
|---|---|---|---|---|---|
| Architecture | 4.57 | 3.90 | 4.47 | 2.67 | 1.12 |
| Attires and Dress Culture | 4.00 | 3.59 | 3.46 | 3.10 | 1.02 |
| Customs | 4.58 | 4.25 | 3.83 | 2.77 | 1.15 |
| Demographics | 3.91 | 3.75 | 3.69 | 1.31 | 1.22 |
| Economy | 4.32 | 4.06 | 4.13 | 2.53 | 1.19 |
| Education | 4.24 | 3.86 | 3.91 | 2.39 | 1.16 |
| Festivals | 4.23 | 4.14 | 4.06 | 2.57 | 1.03 |
| Folklore | 3.78 | 3.45 | 3.04 | 2.39 | 1.06 |
| Food and Culinary Practices | 3.85 | 3.88 | 3.82 | 3.15 | 1.09 |
| Geography | 4.30 | 4.24 | 3.97 | 3.21 | 1.15 |
| History | 4.29 | 3.68 | 3.68 | 3.19 | 1.00 |
| Language | 4.09 | 3.66 | 3.72 | 2.47 | 1.28 |
| Literature | 3.74 | 3.75 | 3.75 | 1.85 | 1.17 |
| Media | 3.78 | 3.97 | 3.89 | 3.02 | 1.29 |
| Music | 3.48 | 3.40 | 3.68 | 3.32 | 1.16 |
| Notable Key Figures | 4.00 | 4.03 | 3.90 | 3.17 | 1.29 |
| Religion | 4.27 | 3.84 | 3.84 | 2.59 | 1.14 |
| Slang & Local Expressions | 3.76 | 3.84 | 2.87 | 3.08 | 1.13 |
| Sports | 3.83 | 4.26 | 3.70 | 2.43 | 1.70 |
| Streetlife | 3.66 | 3.38 | 3.66 | 2.62 | 1.00 |
| Traditions and Rituals | 4.03 | 3.61 | 3.61 | 3.00 | 1.03 |
| Ugandan Herbs | 4.18 | 3.98 | 4.11 | 3.34 | 1.22 |
| Value Addition | 4.80 | 4.42 | 4.85 | 3.05 | 1.18 |
| Values and Social Norms | 4.30 | 3.79 | 4.05 | 2.93 | 1.05 |
Scoring Rubric
The evaluation used GPT-4o as judge with the following 5-point rubric:
Evaluation Criteria:
- Accuracy (50% weight): Factual correctness and direct relevance
- Cultural Nuance (30% weight): Understanding of Ugandan context, local terminology, social dynamics
- Completeness & Relevance (20% weight): Answer completeness and focus
Score Interpretation:
- 5 (Excellent): Fully accurate, deep cultural understanding, complete
- 4 (Good): Correct and relevant, minor cultural nuances missing
- 3 (Acceptable): Generally correct, superficial understanding
- 2 (Poor): Significant inaccuracies or cultural misunderstanding
- 1 (Very Poor): Incorrect, irrelevant, or nonsensical
Cost Analysis
Token Usage (Actual)
Per Test Model (1,039 questions):
- Input: 61,301 tokens (0.061M)
- Output: 311,700 tokens (0.312M)
Judge Model (5,195 evaluations):
- Input: 2,659,840 tokens (2.660M)
- Output: 623,400 tokens (0.623M)
Cost Breakdown by Model
| Model | Input Cost | Output Cost | Total Cost |
|---|---|---|---|
| Claude Sonnet 4.5 | $0.18 | $4.68 | $4.86 |
| Grok 4 | $0.18 | $4.68 | $4.86 |
| Command A | $0.15 | $3.12 | $3.27 |
| GPT-5 | $0.08 | $3.12 | $3.20 |
| Gemini 2.5 Pro | $0.08 | $3.12 | $3.20 |
| Judge (GPT-4o) | $6.65 | $6.23 | $12.88 |
| TOTAL | $7.32 | $24.95 | $32.27 |
Key Insights
1. Claude Sonnet 4.5 Dominance
Claude demonstrated superior cultural understanding across nearly all categories, justifying its position as the leading model for culturally-nuanced tasks.
2. Speed vs Quality Trade-off
- Fastest: Command A (1.42 items/s) and Gemini 2.5 Pro (1.43 items/s)
- Slowest: Grok 4 (0.37 items/s)
- Speed did not correlate with quality (Gemini fastest but lowest score)
3. Cultural Categories Challenge All Models
Categories like Folklore, Music, and Streetlife were challenging for all models, suggesting these require deeper cultural immersion or training data.
4. Economic/Practical Knowledge Strength
Most models performed well on Value Addition, Economy, and Architecture—categories with more concrete, documented knowledge.
5. Gemini 2.5 Pro Anomaly
The severe underperformance requires investigation. Possible factors:
- Prompt engineering mismatch
- Model behavior differences via OpenRouter
- Cultural bias or training data limitations
- Temperature/sampling parameters
Recommendations
For Production Use
- Cultural Tasks: Use Claude Sonnet 4.5 for best results
- Budget-Conscious: Use Command A for excellent speed/quality balance
- Comprehensive Analysis: Consider Grok 4 when depth matters more than speed
For Further Research
- Investigate Gemini 2.5 Pro performance with:
- Direct API access (non-OpenRouter)
- Modified system prompts
- Different temperature settings
- Native Google AI Studio interface
- GPT-5 Improvement Opportunities:
- Focus on cultural nuance training
- Demographic and literature domain knowledge
- Consider few-shot examples for cultural context
- Benchmark Expansion:
- Test additional models (Claude Opus, GPT-4 Turbo, Llama 4)
- Evaluate multilingual responses (include Luganda responses)
- Add human expert validation for judge scores
Technical Details
Configuration:
- Threading: 10 concurrent workers
- Max tokens per response: 300
- Temperature: 0.7 (test models), 0.3 (judge)
- Retry attempts: 3 with exponential backoff
- Timeout: 60 seconds per request
Infrastructure:
- API: OpenRouter (https://openrouter.ai/api/v1)
- Dataset: Hugging Face Hub (CraneAILabs/UCCB)
- Evaluation Script:
uccb_eval_threaded.pywith Python threading
Success Rate: 100% for all models (no failed evaluations)
Files Included
evaluation.md- This comprehensive reportoverall_summary.csv- Cross-model comparison metricscategory_comparison.csv- Per-category scores for all models{model_name}/detailed_results.json- Full Q&A pairs with scores and justifications{model_name}/category_scores.csv- Per-category breakdown for each model{model_name}/summary.json- Overall statistics per model
Citation
Dataset:
@dataset{uccb2024,
title={UCCB: Ugandan Cultural Context Benchmark},
author={CraneAI Labs},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/datasets/CraneAILabs/UCCB}
}Evaluation Date: October 11, 2025 Evaluation Tool: UCCB Threaded Evaluation Script v1.0
Conclusion
This comprehensive evaluation of five leading language models on the Ugandan Cultural Context Benchmark reveals significant performance differences in cultural understanding. Claude Sonnet 4.5 emerged as the clear leader with a 4.10/5.0 average score, demonstrating superior cultural nuance, factual accuracy, and completeness across all 24 categories.
The benchmark highlights the importance of cultural context in AI evaluation and identifies areas where current models excel (Value Addition, Customs, Architecture) and struggle (Folklore, Demographics, Streetlife). The results provide valuable guidance for model selection in culturally-sensitive applications and identify opportunities for improvement in cultural AI training.
Generated with UCCB Evaluation Script | Powered by OpenRouter