Design Decisions Log

Maintain a log of key design decisions, including:

The decision made
Rationale behind it
Alternative approaches considered
Potential impact on evaluation results

Example format:

Date	Decision	Rationale	Impact
YYYY-MM-DD	Decision made	Why this approach was chosen	How it affects the evaluation
2025-11-19	Only implement SAD-mini	labour intenstive and non-trivial to do SAD-lite. Sad-mini is also useful as is.	Only 5 of the 16 tasks will be implemented
2025-11-19	Load needed zip files from GitHub API	The cloned repository size is +1GB while only downloding the SAD-mini files are about 2.5 MB	Will break if the location of these files are changed at the reference repository. There is no official hugging face or similar source.
2025-11-19	All choices order is randomized for all Mini benchmarks	It is not explicit in the report but it seems reasonable and looking at the data sources we have `choice_order: null` on all source files. This was checked by regex on `choices_order: (?!null)` and finding no matches. Also it is unclear how we would enforce an order given we only have fields for the correct and incorrect responses.	Will break if the location of these files are changed at the reference repository. There is no official hugging face or similar source.
2025-11-20	Randomly uniformly chooses between text above/below and question option 1 or 2 for stages_full.	It is not explicit in the report when they use which positioning or which question option so I assume they randomly pick one.	The record to sample conversion for stages_full is not deterministic without providing seed.
2025-11-24	Use the MCQ built in Solver and Scorer	I don’t find it obvious how the MCQ questions are laid out in the paper.	We are formating questions as A), B) instead of (A), (B) as they do. This also means that the Answer assist is just ‘Answer:’
2025-12-02	Don’t use the MCQ built in solver. Format (A), (B), …	In all examples this is the case in the report and the answer assist is stated as ‘Answer: (’ which is evidence that the alternatices should start with ‘(’.
2025-12-02	Grade when an answer has a correct format in a leniant way.	The grading is not exactly specified in the report and we don’t emphesize format clearly in the prompt so I think it make sense to be leniant in the grading as well.