UCCB Benchmark Evaluation Results

Date: October 11, 2025 Dataset: CraneAILabs/UCCB (Ugandan Cultural Context Benchmark) Total Questions: 1,039 across 24 cultural categories Judge Model: openai/gpt-4o via OpenRouter Evaluation Method: LLM-as-a-Judge with 5-point scoring rubric

Executive Summary

Five leading language models were evaluated on their understanding of Ugandan cultural context across 24 diverse categories including Education, Herbs, Media, Economy, Notable Figures, Literature, Architecture, Folklore, Language, and Religion.

Winner: Anthropic Claude Sonnet 4.5 demonstrated the strongest cultural understanding with an average score of 4.10/5.0, showing exceptional performance across all categories, particularly excelling in Value Addition (4.80), Customs (4.58), and Architecture (4.57).

Notable Finding: Google Gemini 2.5 Pro significantly underperformed with a score of 1.16/5.0, suggesting potential issues with cultural context understanding or model configuration.

Overall Performance Rankings

Rank	Model	Average Score	Success Rate	Total Time	Speed (items/s)
🥇 1	Anthropic Claude Sonnet 4.5	4.10 / 5.0	100.0%	1,173s (19.5 min)	0.89
🥈 2	xAI Grok 4	3.88 / 5.0	100.0%	2,798s (46.6 min)	0.37
🥉 3	Cohere Command A	3.85 / 5.0	100.0%	734s (12.2 min)	1.42
4	OpenAI GPT-5	2.75 / 5.0	100.0%	1,584s (26.4 min)	0.66
5	Google Gemini 2.5 Pro	1.16 / 5.0	100.0%	729s (12.1 min)	1.43

Total Benchmark Runtime: ~118 minutes (~2 hours) Total API Cost: ~$32.27

Detailed Model Analysis

🏆 1st Place: Anthropic Claude Sonnet 4.5

Score: 4.10 / 5.0

Strengths:

Most balanced performance across all 24 categories
Exceptional cultural nuance understanding
Strong factual accuracy with deep contextual awareness

Top Categories:

Value Addition: 4.80
Customs: 4.58
Architecture: 4.57
Religion: 4.27

Analysis: Claude Sonnet 4.5 demonstrated superior understanding of Ugandan cultural context, consistently providing accurate, culturally-sensitive responses with appropriate use of local terminology and awareness of social dynamics.

🥈 2nd Place: xAI Grok 4

Score: 3.88 / 5.0

Strengths:

Strong performance across diverse categories
Good cultural awareness and factual accuracy
Comprehensive answers

Top Categories:

Sports: 4.26
Customs: 4.25
Geography: 4.24
Festivals: 4.14

Weaknesses:

Slowest execution time (47 minutes)
Lower speed may indicate more verbose responses or processing overhead

Analysis: Grok 4 performed very well overall, showing strong cultural understanding particularly in physical and social categories like Sports and Customs.

🥉 3rd Place: Cohere Command A

Score: 3.85 / 5.0

Strengths:

Fastest execution time (12 minutes)
Excellent efficiency with strong scores
Best cost-to-performance ratio

Top Categories:

Value Addition: 4.85 (highest single category score)
Architecture: 4.47
Economy: 4.13
Festivals: 4.06

Analysis: Command A delivered impressive performance with the best speed-to-quality ratio. Its exceptional score in Value Addition (4.85) indicates strong understanding of Ugandan economic and entrepreneurial contexts.

4th Place: OpenAI GPT-5

Score: 2.75 / 5.0

Strengths:

Consistent performance across categories
Good at factual questions

Top Categories:

Ugandan Herbs: 3.34
Music: 3.32
Geography: 3.21
History: 3.19

Weaknesses:

Below-average cultural nuance understanding
Demographics scored only 1.31
Literature scored only 1.85

Analysis: GPT-5 showed mid-range performance with notable weaknesses in categories requiring deep cultural understanding. Strong in factual domains (Geography, History) but struggled with nuanced cultural topics.

5th Place: Google Gemini 2.5 Pro

Score: 1.16 / 5.0 ⚠️

Critical Issues:

Severe underperformance across all categories
Highest category score was only 1.70 (Sports)
Consistently scored around 1.0-1.3 across most domains

Analysis: Gemini 2.5 Pro’s performance suggests fundamental issues with:

Cultural context understanding
Answer relevance and accuracy
Possible model configuration issues
May require specialized prompting or fine-tuning for cultural benchmarks

Recommendation: Further investigation needed to determine if this is a model limitation or configuration issue.

Category Performance Breakdown

Top Performing Categories (All Models Average)

Category	Claude 4.5	Grok 4	Command A	GPT-5	Gemini 2.5 Pro
Value Addition	4.80	4.42	4.85	3.05	1.18
Customs	4.58	4.25	3.83	2.77	1.15
Architecture	4.57	3.90	4.47	2.67	1.12
Religion	4.27	3.84	3.84	2.59	1.14
Economy	4.32	4.06	4.13	2.53	1.19

Challenging Categories (Lower Average Scores)

Category	Claude 4.5	Grok 4	Command A	GPT-5	Gemini 2.5 Pro
Demographics	3.91	3.75	3.69	1.31	1.22
Literature	3.74	3.75	3.75	1.85	1.17
Folklore	3.78	3.45	3.04	2.39	1.06
Music	3.48	3.40	3.68	3.32	1.16
Streetlife	3.66	3.38	3.66	2.62	1.00

Complete Category Results

Category	Claude 4.5	Grok 4	Command A	GPT-5	Gemini 2.5 Pro
Architecture	4.57	3.90	4.47	2.67	1.12
Attires and Dress Culture	4.00	3.59	3.46	3.10	1.02
Customs	4.58	4.25	3.83	2.77	1.15
Demographics	3.91	3.75	3.69	1.31	1.22
Economy	4.32	4.06	4.13	2.53	1.19
Education	4.24	3.86	3.91	2.39	1.16
Festivals	4.23	4.14	4.06	2.57	1.03
Folklore	3.78	3.45	3.04	2.39	1.06
Food and Culinary Practices	3.85	3.88	3.82	3.15	1.09
Geography	4.30	4.24	3.97	3.21	1.15
History	4.29	3.68	3.68	3.19	1.00
Language	4.09	3.66	3.72	2.47	1.28
Literature	3.74	3.75	3.75	1.85	1.17
Media	3.78	3.97	3.89	3.02	1.29
Music	3.48	3.40	3.68	3.32	1.16
Notable Key Figures	4.00	4.03	3.90	3.17	1.29
Religion	4.27	3.84	3.84	2.59	1.14
Slang & Local Expressions	3.76	3.84	2.87	3.08	1.13
Sports	3.83	4.26	3.70	2.43	1.70
Streetlife	3.66	3.38	3.66	2.62	1.00
Traditions and Rituals	4.03	3.61	3.61	3.00	1.03
Ugandan Herbs	4.18	3.98	4.11	3.34	1.22
Value Addition	4.80	4.42	4.85	3.05	1.18
Values and Social Norms	4.30	3.79	4.05	2.93	1.05

Scoring Rubric

The evaluation used GPT-4o as judge with the following 5-point rubric:

Evaluation Criteria:

Accuracy (50% weight): Factual correctness and direct relevance
Cultural Nuance (30% weight): Understanding of Ugandan context, local terminology, social dynamics
Completeness & Relevance (20% weight): Answer completeness and focus

Score Interpretation:

5 (Excellent): Fully accurate, deep cultural understanding, complete
4 (Good): Correct and relevant, minor cultural nuances missing
3 (Acceptable): Generally correct, superficial understanding
2 (Poor): Significant inaccuracies or cultural misunderstanding
1 (Very Poor): Incorrect, irrelevant, or nonsensical

Cost Analysis

Token Usage (Actual)

Per Test Model (1,039 questions):

Input: 61,301 tokens (0.061M)
Output: 311,700 tokens (0.312M)

Judge Model (5,195 evaluations):

Input: 2,659,840 tokens (2.660M)
Output: 623,400 tokens (0.623M)

Cost Breakdown by Model

Model	Input Cost	Output Cost	Total Cost
Claude Sonnet 4.5	$0.18	$4.68	$4.86
Grok 4	$0.18	$4.68	$4.86
Command A	$0.15	$3.12	$3.27
GPT-5	$0.08	$3.12	$3.20
Gemini 2.5 Pro	$0.08	$3.12	$3.20
Judge (GPT-4o)	$6.65	$6.23	$12.88
TOTAL	$7.32	$24.95	$32.27

Key Insights

1. Claude Sonnet 4.5 Dominance

Claude demonstrated superior cultural understanding across nearly all categories, justifying its position as the leading model for culturally-nuanced tasks.

2. Speed vs Quality Trade-off

Fastest: Command A (1.42 items/s) and Gemini 2.5 Pro (1.43 items/s)
Slowest: Grok 4 (0.37 items/s)
Speed did not correlate with quality (Gemini fastest but lowest score)

3. Cultural Categories Challenge All Models

Categories like Folklore, Music, and Streetlife were challenging for all models, suggesting these require deeper cultural immersion or training data.

4. Economic/Practical Knowledge Strength

Most models performed well on Value Addition, Economy, and Architecture—categories with more concrete, documented knowledge.

5. Gemini 2.5 Pro Anomaly

The severe underperformance requires investigation. Possible factors:

Prompt engineering mismatch
Model behavior differences via OpenRouter
Cultural bias or training data limitations
Temperature/sampling parameters

Recommendations

For Production Use

Cultural Tasks: Use Claude Sonnet 4.5 for best results
Budget-Conscious: Use Command A for excellent speed/quality balance
Comprehensive Analysis: Consider Grok 4 when depth matters more than speed

For Further Research

Investigate Gemini 2.5 Pro performance with:
- Direct API access (non-OpenRouter)
- Modified system prompts
- Different temperature settings
- Native Google AI Studio interface
GPT-5 Improvement Opportunities:
- Focus on cultural nuance training
- Demographic and literature domain knowledge
- Consider few-shot examples for cultural context
Benchmark Expansion:
- Test additional models (Claude Opus, GPT-4 Turbo, Llama 4)
- Evaluate multilingual responses (include Luganda responses)
- Add human expert validation for judge scores

Technical Details

Configuration:

Threading: 10 concurrent workers
Max tokens per response: 300
Temperature: 0.7 (test models), 0.3 (judge)
Retry attempts: 3 with exponential backoff
Timeout: 60 seconds per request

Infrastructure:

API: OpenRouter (https://openrouter.ai/api/v1)
Dataset: Hugging Face Hub (CraneAILabs/UCCB)
Evaluation Script: uccb_eval_threaded.py with Python threading

Success Rate: 100% for all models (no failed evaluations)

Files Included

evaluation.md - This comprehensive report
overall_summary.csv - Cross-model comparison metrics
category_comparison.csv - Per-category scores for all models
{model_name}/detailed_results.json - Full Q&A pairs with scores and justifications
{model_name}/category_scores.csv - Per-category breakdown for each model
{model_name}/summary.json - Overall statistics per model

Citation

Dataset:

@dataset{uccb2024,
  title={UCCB: Ugandan Cultural Context Benchmark},
  author={CraneAI Labs},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/CraneAILabs/UCCB}
}

Evaluation Date: October 11, 2025 Evaluation Tool: UCCB Threaded Evaluation Script v1.0

Conclusion

This comprehensive evaluation of five leading language models on the Ugandan Cultural Context Benchmark reveals significant performance differences in cultural understanding. Claude Sonnet 4.5 emerged as the clear leader with a 4.10/5.0 average score, demonstrating superior cultural nuance, factual accuracy, and completeness across all 24 categories.

The benchmark highlights the importance of cultural context in AI evaluation and identifies areas where current models excel (Value Addition, Customs, Architecture) and struggle (Folklore, Demographics, Streetlife). The results provide valuable guidance for model selection in culturally-sensitive applications and identify opportunities for improvement in cultural AI training.

Generated with UCCB Evaluation Script | Powered by OpenRouter