> The job of a scientist is really not to ship software, that's what a team of engineers would do.
I think that this is the real problem - in academia there is this idea that learning good practices is like a 'dirty' thing that is not required, while instead it would speed up the work and make it more reliable. if you look at chemistry or medicine, there researches have good practices for managing the lab and respect them.
> in academia there is this idea that learning good practices is like a 'dirty' thing that is not required
I think you got me wrong. Shipping quality software is not 'dirty' but requires a specialised focus. One can not do everything by yourself - science and engineering are complementary skills. In your example of chemistry, the chemist who designs a molecule does not spend time to ship the molecule to the world.
Except that it wouldn't speed things up at all. Academia writes run-once code, which changes spec fourteen times in one week. Their use case is orthogonal to industry.
Have you considered that maybe the academics actually know what they are doing?
Lol I spent 5 years in academia, and I have a PhD in CS - I know what I'm talking about. Specs of code change in academia as in industry, I was able to write unit tests and document my code also in academia. And I know in medicine and chemistry time to publish are much longer - but that is not connected with the fact that they know how to properly use a microscope, clean the lab, and keep an inventory.
If you don't write unit tests how do you hedge the possibility of having bugs in your code?
Most scientists have no training in computer science, much less engineering, but still need to do it sometimes to build experiments. They've largely taught themselves. You are not the norm.
I've taught dozens of grad students enough programming to get the job done and it would have been a total waste of time to make the code that robust. They need experimental results next week, with only one computer ever expected to run the code, not a product demo.
The software isn't their research project, it's a nuisance that they have to deal with. Accordingly they neither want to nor have time to do it perfectly. I cannot blame them.
That said, there should be a system to encourage actual trained programmers to get involved, including coauthorship and consideration in tenure decisions. The current system is bad, I'm just saying it's not the scientists fault here. This is just literally not their ___domain of interest or expertise, and I would rather they focus on the thing they're uniquely good at.
> if you look at chemistry or medicine, there researches have good practices for managing the lab and respect them.
Their studies / experiments last years.
In CS/ML/Applied Math you sometimes have to write an experiment with a deadline next week. Excuse me if when I'm trying to scramble for a deadline at 3am I don't have my mind toward TDD or I'm not neatly packaging everything in a docker.
Hey, I feel you - and I understand the pressure - i have been in that situation. The point is that this:
> you sometimes have to write an experiment with a deadline next week.
shouldn't happen. And yes, at the moment is like this - sometime you will have to hack. But if the all community start to push for proper practices, instead of just saying "is as it is" - there will be less papers, with more quality.
I keep mine in clear view on the shelf in the hope that its collected wisdom will radiate outward and suffuse into my code. Not happened yet, but perhaps it has a useful psychological effect as a shrine to algorithms; whenever I am tempted by a quick, cheap hack I see the books and am steered back to the righteous path.
Actually I usually just do the cheap hack anyway but it is reassuring to know that it is there.
"a general search can be machine learning" I don't get this sentence: Machine learning is about building a mathematical model of sample data, known as "training data".
Actually, any unsupervised method, including clustering, still has training data. The only difference is it doesn't have a target y variable in the training set to minimize the error metric, hence the name unsupervised.
But the definition you mention is right. Yet, any dataset that you use to fit your model will be your training set, even if you don't have a train test split or the like, because you used it to train your model over.