Glib answer: Don't work alone! There's two ways to think about this: 1 - Your pr...

dkersten · on Nov 6, 2019

My point was less about the specific issue I hit and more that 1) external circumstances that a restart won't resolve can cause failures, because 2) we're human and no matter how hard we try, even with a large team, things do slip through.

The difference with having a large team is less that all possible failure cases will get protected against (although more eyes and code review does help), but more that someone can always be available to fix it when something unexpected happens.

In my particular case, the majority of the system kept running fine. The part that failed was a streaming system which receives updates in realtime from an external system. The error actually was localised to one particular type of updates, but that type stopped working because I didn't protect defensively enough against errors in that one particular case (I do have my database queries protected against errors, but this one slipped through). This caused other systems to not get these updates, so things that relied on them stopped working. Its not that they crashed, they just never received the updates they were waiting for.

Of course the fix is to trap all exceptions, log/notify, ignore and continue, so that at least one piece of bad update doesn't affect other updates, but again, my main point was that we're human, so can't possibly protect against everything that might cause a non-recoverable (without human intervention) error.

> Finally, I have basic sanity checks for things like making sure a string is really a UUID

Yes, I did add this too after I hit this issue and its a good point: validate EVERYTHING even if you generate it and think you can assume it will be good.

> Don't work alone!

That's the real solution, but sometimes its not possible.

Thanks for your detailed response, though, its appreciated.

gwbas1c · on Nov 6, 2019

> In my particular case, the majority of the system kept running fine. The part that failed was a streaming system which receives updates in realtime from an external system.

Is your product too complicated for a single-person business?

As a solo programmer, I can write and develop extremely complicated systems. These systems can be so complicated that I don't have time to run them, find customers, support customers, ect.

That, ultimately, is why I don't see myself running a single-person business anytime soon. I really enjoy complicated programming, and if I have to also handle ops, support, sales, ect, then what I program needs to be too simple to remain interesting.