How exactly did you get the stations working? Be specific.

neilv · 2024-05-29T02:43:44 1716950624

"Be specific" sounds like a high-stakes interview, so I'd better answer. :)

* SITUATION: Factory reports MVP factory stations for pilot customer "not turning on" for the day, reason unknown.

* TASK: Get stations up in time for production line, so startup doesn't go out of business.

* ACTION: Determined cause was unexplained networking problem outside our control, and that was triggering some startup time checks. But that the stations were resilient enough that I could carefully edit the code on the stations to relax some timeouts, and enough packets were getting through that we might be able to operate anyway. After that worked for one station, carefully changed the remaining ones.

* RESULT: Factory stations worked for that production day, and our startup was therefore still in business. We were later advised of the submarine cable failure, and the factory acknowledged connectivity problems. Our internal post mortem discussions included why the newer boot code hadn't been installed, and revisiting backup connectivity options. From there, thanks to various early-startup engineering magic, and an exciting mix of good and bad luck, we eventually finished a year contract in a high-quality brand's factory production line successfully.

No technical wizardry in the immediate story, but a lot of various smart things we'd done proactively (including triaging what we did and didn't spend precious overextended time on), plus some luck (and of course the fact that some Internet infrastructure heroes' work had them routing any packets at all)... all came together... and got us through a freak failure of a submarine cable we'd never heard of, which could've ended our startup right there.

(Details on Action, IIRC: Assumed command of the incident, and activated Astronaut Mode manner. Attempted to remote access, and found network very poor. Alerted factory of network problem in their connectivity to AWS Singapore, but they initially thought there was no problem. Could tell there was some very spotty connectivity problems (probably including using `mtr-tiny`), so focused troubleshooting stations on that assumption. Realized how the station would behave in this situation, and that the boot-ish checks for various network connectivity were probably timing out. Or, less likely, there could be a bug in handling the exceptional situation. Investigating, very slowly due to poor Internet, found that the stations didn't have the current version of the boot code, which would've reported diagnostics better on the station display, so factory personnel might see it and tell us. Using `vi`, made careful, minimal changes to the timeouts directly on one station, in the old version of the the code on there (in either Python or Bash; I forget). Restarted station, and it worked. Carefully did the same to the other stations. Everything worked, and other parts of the station software had been done smart enough that they could cope with the production day's demands. Despite the poor connectivity, and the need to do network requests for each production item that passed through the station, before it could move on.)

8372049 · 2024-05-29T05:42:02 1716961322

I love the five point order-esque style, warms my military heart

darkwater · 2024-05-29T14:14:04 1716992044

Is it me reading it wrong or are you using two "startup" meanings in sentences very close to each other?

neilv · 2024-05-29T15:05:51 1716995151

I realized that, and so tried to say "boot" for one of the term senses, but missed an occurrence, which might've made it even more confusing.