Design Decisions Log

Maintain a log of key design decisions, including:

  • The decision made
  • Rationale behind it
  • Alternative approaches considered
  • Potential impact on evaluation results

Example format:

Date Decision Rationale Impact
YYYY-MM-DD Decision made Why this approach was chosen How it affects the evaluation
2025-11-19 Only implement SAD-mini labour intenstive and non-trivial to do SAD-lite. Sad-mini is also useful as is. Only 5 of the 16 tasks will be implemented
2025-11-19 Load needed zip files from GitHub API The cloned repository size is +1GB while only downloding the SAD-mini files are about 2.5 MB Will break if the location of these files are changed at the reference repository. There is no official hugging face or similar source.
2025-11-19 All choices order is randomized for all Mini benchmarks It is not explicit in the report but it seems reasonable and looking at the data sources we have choice_order: null on all source files. This was checked by regex on choices_order: (?!null) and finding no matches. Also it is unclear how we would enforce an order given we only have fields for the correct and incorrect responses. Will break if the location of these files are changed at the reference repository. There is no official hugging face or similar source.
2025-11-20 Randomly uniformly chooses between text above/below and question option 1 or 2 for stages_full. It is not explicit in the report when they use which positioning or which question option so I assume they randomly pick one. The record to sample conversion for stages_full is not deterministic without providing seed.
2025-11-24 Use the MCQ built in Solver and Scorer I don’t find it obvious how the MCQ questions are laid out in the paper. We are formating questions as A), B) instead of (A), (B) as they do. This also means that the Answer assist is just ‘Answer:’
2025-12-02 Don’t use the MCQ built in solver. Format (A), (B), … In all examples this is the case in the report and the answer assist is stated as ‘Answer: (’ which is evidence that the alternatices should start with ‘(’.
2025-12-02 Grade when an answer has a correct format in a leniant way. The grading is not exactly specified in the report and we don’t emphesize format clearly in the prompt so I think it make sense to be leniant in the grading as well.