For my master thesis, I implemented a new and fancy algorithm. The code seemed to be fine and dandy for the usual, simple test cases. After using more elaborate test cases, I found cases that didn't work well.
After contacting the author, who indicated he didn't have such problems, I literally spent months trying to debug the code. When I finally gave up and re-wrote my implementation basically from scratch, and found the same problems, I contacted the author again. He then indicated that indeed there was a problem with the method for these cases, that he understood the problem and found a way to fix it. In hindsight, the problem was not hard to understand (but still, the claims in the paper were unwarranted IMO).
Conclusion? I wish I was a math prodigy, then I would have spotted the problem instantly. Also, be wary of claims made in papers.
Most papers are wrong. Not necessarily wrong in an essential way, but most of them overstate their claims.
There's a rule of thumb in the machine learning community: if you want an algorithm that does X well, find a paper on doing X, and implement the method that the paper compares its proposal against. That way you're near state of the art, but you're also using a method that many people have used and tested.
I know it's a bit of a tangent, but proactive, large-scale logging of models like this (such as those used in machine learning) may become desirable to meet the requirements of GDPR. If you have to be able to explain how an algorithm made a decision, you need to be able to pull up data like this somehow.
I am a pragmatist looking at ways to work within the law, not an idealist looking to change the law, so it doesn't matter what I believe about people with power. They make decisions and we work out solutions to work within or around them.
A while back I was working on a system doing fairly complex engineering calculations and I implemented detailed logging of both the values used and the actual calculations performed.
This allowed me to be able to generate a spreadsheet (with the values and calculations in place) that could show a non-developer exactly how the outputs had been calculated (you could use the features of Excel to add visual annotations of precedents and dependencies).
this might seem normal but perhaps not in mathematics? There's the story about homotopy type theory that first had to be invented including a supporting programming language to formalize and automatically error check proofs, which is still a niche topic in maths (because it seems faster to think barely, I gather. Edit: and because of the problem of bootstrapping such a system and Not Invented Here syndrome).
I implemented a library that does exactly this -- https://github.com/IGITUGraz/SimRecorder (In case anyone finds it useful). It supports storing data in both hdf5 and redis (although I wouldn't recommend redis for storing large numpy arrays)
I don't know the author's use-case specifically, but that could fall over in two ways:
1) if you've got models that are re-generated periodically based on new inputs/algorithm tweaks, then you can potentially end up with quite a few of these as you scale.
2) if you want to track the details that/debug the reason your production system made a given decision, you need to log not just your model but all of the parameters that went into that decision. If that type of decision happens many times a day, then you can end up with some pretty massive logs to go through.
In either case, storing that historical data in your transactional database can be a bit of a load, so it's ideal to keep it separate if you get any kind of volume. I've actually bumped into 2) at one job.
After contacting the author, who indicated he didn't have such problems, I literally spent months trying to debug the code. When I finally gave up and re-wrote my implementation basically from scratch, and found the same problems, I contacted the author again. He then indicated that indeed there was a problem with the method for these cases, that he understood the problem and found a way to fix it. In hindsight, the problem was not hard to understand (but still, the claims in the paper were unwarranted IMO).
Conclusion? I wish I was a math prodigy, then I would have spotted the problem instantly. Also, be wary of claims made in papers.