Design Decisions Log
Maintain a log of key design decisions, including:
- The decision made
- Rationale behind it
- Alternative approaches considered
- Potential impact on evaluation results
Example format:
| Date | Decision | Rationale | Impact |
|---|---|---|---|
| YYYY-MM-DD | Decision made | Why this approach was chosen | How it affects the evaluation |
| 2025-11-19 | Only implement SAD-mini | labour intenstive and non-trivial to do SAD-lite. Sad-mini is also useful as is. | Only 5 of the 16 tasks will be implemented |
| 2025-11-19 | Load needed zip files from GitHub API | The cloned repository size is +1GB while only downloding the SAD-mini files are about 2.5 MB | Will break if the location of these files are changed at the reference repository. There is no official hugging face or similar source. |
| 2025-11-19 | All choices order is randomized for all Mini benchmarks | It is not explicit in the report but it seems reasonable and looking at the data sources we have choice_order: null on all source files. This was checked by regex on choices_order: (?!null) and finding no matches. Also it is unclear how we would enforce an order given we only have fields for the correct and incorrect responses. |
Will break if the location of these files are changed at the reference repository. There is no official hugging face or similar source. |
| 2025-11-20 | Randomly uniformly chooses between text above/below and question option 1 or 2 for stages_full. | It is not explicit in the report when they use which positioning or which question option so I assume they randomly pick one. | The record to sample conversion for stages_full is not deterministic without providing seed. |
| 2025-11-24 | Use the MCQ built in Solver and Scorer | I don’t find it obvious how the MCQ questions are laid out in the paper. | We are formating questions as A), B) instead of (A), (B) as they do. This also means that the Answer assist is just ‘Answer:’ |
| 2025-12-02 | Don’t use the MCQ built in solver. Format (A), (B), … | In all examples this is the case in the report and the answer assist is stated as ‘Answer: (’ which is evidence that the alternatices should start with ‘(’. | |
| 2025-12-02 | Grade when an answer has a correct format in a leniant way. | The grading is not exactly specified in the report and we don’t emphesize format clearly in the prompt so I think it make sense to be leniant in the grading as well. |