Agreed. We need to talk about these issues as a community and come up with effective strategies to prevent data contamination. We've done this for web development and the result was a panoply of tools to validate form input. Now we need to do this for our back-end tools.
For instance, statistical programming languages like R, which generally operate on nice tabular data, should come with built in methods to enforce data validity. I should be able to tell R that I expect certain columns to only contain values from 0-1 and no NAs. This is an easy example because languages like R are somewhat dictatorial about how they want you input and process data, but one can imagine the same sort of methods built into the base libraries of general purpose languages.
On the other hand, maybe we need to be more rigorous on the whole about data validation. We accept that automated/continuous testing is an effective mechanism for preventing bugs. We need the same automated systems to check data files and flag them when there are issues.
For instance, statistical programming languages like R, which generally operate on nice tabular data, should come with built in methods to enforce data validity. I should be able to tell R that I expect certain columns to only contain values from 0-1 and no NAs. This is an easy example because languages like R are somewhat dictatorial about how they want you input and process data, but one can imagine the same sort of methods built into the base libraries of general purpose languages.
On the other hand, maybe we need to be more rigorous on the whole about data validation. We accept that automated/continuous testing is an effective mechanism for preventing bugs. We need the same automated systems to check data files and flag them when there are issues.