I wonder how it's even possible to evaluate this kind of thing without data leakage. Correct answers to specific, factual questions are only possible if the model has seen those answers in the training data, so how reliable can the benchmark be if the test dataset is contaminated with training data?
Or is the assumption that the training set is so big it doesn't matter?
Or is the assumption that the training set is so big it doesn't matter?