I was working a large project for a wafer fab company, and occasionally the compiler would crash during full builds with SIGILL (illegal instruction, for those who aren’t familiar with the signal). Compiler bugs are never fun, and this was particularly vexing because it was so inconsistent.
It took me awhile, but eventually I got around to thinking: What could cause the compiler to execute an illegal instruction? What could cause an illegal instruction at all?
I removed the outer case from my computer, and sure enough, all of the fans had died. The CPU was overheating during intense, long-running builds. Replaced the fans and the “bug” went away!
*This is my first comment since I created my account in 2009. I hope I did it right! ;-)
A friend of mine recently bought a computer with a really decent GPU (he needs to process significant amounts of non-English content through Whisper, which requires him to use the large model), and Whisper was running much, much slower than expected. It was a custom build, assembled by a small-ish company here in Poland. He opened the machine up, and it turned out there was some kind of foam inside that the company put there to secure all the components during transport. Thankfully, it was discovered and removed early enough not to cause any damage, but we were all very surprised.
It was a hand-me-down K6-II with (I think) 233 MHz clock rate. The thing had a tiny fan on top of the cooler that was roughly 4x4cm (if even that). The poor thing generally worked quite nicely, but had stuck bearings so it required a little nudge to spin every time the machine was turned on. I didn't have the side panel on because that was just too much of a hassle and usually would just reach in and start the thing blindly.
I forgot that one day and the machine had been running for about an hour (low load, so nothing too bad). I reached in and promptly gave myself quite a nasty burn blister because I touched the cooler instead of the fan.
Nice story. I've had something similar happen to myself. My computer generally worked well, the few times I tried to game though, it crashed after a while like clockwork. Turns out the fans on the PSU had died. Replaced it and never had those issues again.
Not a hardware bug, but in embedded I ran into a fun one early into my first job. I setup a CI pipeline that took a PR number and used it as the build number in a MAJOR.MINOR.BUILD scheme for our application code. CI pipeline done, everything worked hunky-dory for a while, project continued on. A few months later, our regression tests started failing seemingly randomly. A clue to the issue was closing the PR and opening a new one with the exact same changes would cause tests to pass. I don’t remember exactly what paths I went down in investigation, but the build number ended up being one of them. Taking the artifacts and testing them manually, build number 100 failed to boot and failed regression, build 101 passed. Every time.
Our application was stored at (example) flash address 0x8008000 or something. The linker script stored the version information in the first few bytes so the bootloader could read the stored app version, then came the reset vector and some more static information before getting to the executable code. Well, it turns out the bootloader wasn’t reading the reset vector, it was jumping to the first address of the application flash and started executing the data. The firmware version at the beginning of the app was being executed as instructions. For many values of the firmware version, the instructions the data represented were just garbage ADD r0 to r1 or something, and the rest of the static data before getting to the first executable code also didn’t happen to have any side effects, but SOMETIMES the build number would be read as an instruction that would send the micro off into lala land, hard fault or some other illegal operation.
Fixed the bootloader to dereference the reset vector as a pointer to a function and moved on!
From CI pipeline to bootloader would make me about turn and nope out of embedded so fast if that was my first job.
That level of skill requirement is like a department in one. Hopefully that company had some patient seniors.
Early 90s, doing the first implementation of scheduler activations in a real kernel on a real machine. There's an occasional bug that shows up, we think it's a race condition or something. After lots and lots of debugging and thinking, end up in the debugger approaching a line where we think the bug manifests (not caused, but manifests). Looks something this:
int g = 2;
if (g) {
printf ("yes\n");
} else {
printf ("no\n");
}
Obviously most of the time we see "yes", but every once in a while we see "no". Even in the debugger, using stepi, we hit the conditional, we confirm with the debugger that g is indeed non-zero. Totally impossible for the conditional to ever print "no", right?
------------
Well, when you're writing a re-entrant kernel context switch (as scheduler activations requires), you'd better damn well remember to restore ALL the registers on the processor, in particular the one that stores the result of a recent compare instruction.
We had skimped on this tiny step IIRC, one extra instruction in the context switch code); the kernel is interrupted after the compare instruction but before the jump; scheduler activations dictates switching to a new thread; when we come back to the original thread, the apparent result of the comparison is reversed, and we print "no".
At least the paper got an award at Usenix that year :)
Once upon a time, we got a panicked email from a customer whose OmniOutliner file would no longer open. He’d written a novel in it and was understandably keen to not lose his work.
Sure enough, when we opened his file with the debugger attached, it crashed immediately. Curiously, the crash was deep inside Apple’s XML parsing code, which we used indirectly by saving the file in their XML-variant of a property list.
Looking at the file in a text editor, we eventually found a funny-looking character where there should’ve been an angle bracket (an opening or closing bracket of an XML element). Inspecting it in a hex editor revealed that the difference between the actual character and what it should’ve been was precisely one bit.
How on Earth could that happen?! A bit more sleuthing (haha) uncovered more of these aberrations, and it didn’t take long before we realized that they occurred at regular intervals.
We patched it up, emailed it back to the customer, and suggested he check his RAM. He soon replied, thanking us but then asking, “How did you know I had bad RAM from my novel?!”
I encountered a similar issue once. The first indication something was wrong was weird corruption issues across a variety of services in our kubernetes cluster. In particular I focussed in on a service that took gzipped messages from a queue, which was reporting that some messages could not be decompressed.
First I confirmed that I could pull the corrupt message from the queue and it was in fact corrupt - so the problem was not in the consumer (which was throwing the error) or (probably) the queue, but rather the producer which created the compressed message.
On a hunch, I took a corrupted message (about 64KB in total) and wrote a quick program that took each bit of the message and tried the decompress operation with that bit flipped. Sure enough, there was one bit at offset 13000 or so which, if flipped, made the message decompress and at least visually appear intact.
Anyway, it turned out to be a single node with a hardware issue of some kind - rather than diagnose it fully we ended up just replacing the node. Repairing all the corrupted stuff that services on that node sent out was a much bigger concern.
> under JDK1.4.1 once 2036 files are open any subsequent opens will delete the file that was supposed to be opened.
Obviously this is bad.
It was worse to debug. "Opening files" includes opening Java class files or JARs, so we'd see a system with some class files or jars missing and spent ages trying to work out why deployment was failing.
Then I saw files class files disappear in front of me while I was using the system. That was one of the biggest WTF moments of my career. I assumed someone else was on the computer, then I assumed a virus, then hardware corruption.
It didn't occur to us to think the JVM would delete files instead of opening them for a long time.
My most memorable hardware bug was noware near as hard as this, but I'll never forget it.
Intel was trying to sell the 960s and sent us a dev board with that CPU. Nobody in the company could get it to boot up. It would power up but nothing would show up on the serial port. Eventually it was my turn to look and for some reason I happened to notice a pullup capacitor on the UART VCC. I looked at the schematics and indeed it was there. A simple jumper to bypass it (back in those days we had big, manly components; none of that surface mount shit) and what hey: the serial console responded. It had booted up just fine, but was mute.
After that we could do development but it was immediately clear to me that the 960 was DoA. It's not like we were the first to get that board!
I was debugging a TI DSP based board that I designed. It would come up, execute some of my code, and die. It took 3 weeks and lots of back and forth with TIs tech support before I found out that some of the ground pins were left disconnected. The guy that laid out the PCB didn't connect them though they were connected in the schematic. We went back to the PCB editor, zoomed way in, and lo and behold there was this tiny unconnected segment.
I don't remember the details any more (this was like 30+ years ago) but I think TI's support was on the right track and I was convinced this can't be true because I had the connection in the schematic. If you probe this with a DVM or a scope it will look connected (because the pins are connected internally) and so it's really hard to find out.
This taught me a valuable lesson that anything can be debugged with enough time and persistence. Some things take longer and that's life.
Fucking floating pins, man. They're the worst, because they work during bringup and development, and only fuck you over when scaling to mass production
Instead of a pull-up resistor between the uart vcc pin and vcc, there was an electrolytic cap. That is why I put “capacitor” in italics — I was trying to emphasize how nonsensical it was. To the DC power to the chip the cap was of course an open circuit and thus by shorting it out I powered the chip up.
The chip was supposed to read out a unique ID, but instead read out all zeros. Doubly weird, because it was a flash chip. You’d expect a blank flash chip to spit out all 0xff, not all 0x00.
I ran it past the lead EE, and the lead software engineer, and the chip co FAEs, and they all said I must have done something wrong.
But they all came back later having repro’ed my demo.
Two months of kicking it up the chip co later, I got a nice note from the CEO of that chip company saying “Thanks for the bugfix” - with a bottle of Dom Perignon.
My recent weird "bug" was when I installed a new Linux distro, just last week, to get away from weird graphical issues with KDE (switched to PopOS for hardware support).
On boot, my mouse started moving really erratically. I would try to move it and it would just jump all around the screen, but only with Razor mouse, not my logitech one.
Great, I think, I traded display issues for mouse driver issues. But it was weird, because it was fine during the live USB.
I spent a bit of time debugging inputs etc, maybe it's a weird driver issue.
I suddenly remembered in my HS days when the school ordered new mousepads which had bright yellow lines on them from some logo, making them incompatible with the "new" laser mice.
I was working on a rx drug pricing system right out of college. I couldn’t always get my price calculations to match what a major insurance carrier came up with and the contract clearly stated the formula. Turned out the big carrier had a bug in their calculations that surfaced only under a specific set of circumstances. I felt very proud of myself for figuring out their big and did a detailed write up and submitted it to the carrier. Their response was “yeah we know, we’re not going to fix it though”. That floored me but I was right out of college and pretty naive hah.
I once (about 10y ago) experienced hardware that got tired. A customer replaced the usual hard disks with shiny new Seagate SMR drives, because they had more storage capacity. Funny thing is that they could not handle the sustained 100MB/s we were feeding them. So after about 20 minutes they started slowing down and after half an hour they stopped working for about 20 minutes and then they were fine again. Obviously the customer complained about our storage product and forgot to mention this small fact. Once we figured it out we had good laugh.
That's interesting. My old server about 10 years ago had a Seagate black which died. I replaced it with a Seagate green. I notice things started slowing down and down when the disc writes got heavy. It could freeze up for minutes at a time, then recover without any errors. It took me weeks to realise what was happening because… Because I don't actually know why. In hindsight it was obvious. Maybe the Seagate green was a SMR drive. Either way, it was nasty and caused a lot of frustration.
A quick check just now and it seems that the Seagate green were SMR. Fuckers never put that on the box did they. Bastards.
I bought one of the first versions of those shiny Seagate SMR drives, specifically to store my (encrypted) backups. It failed after a couple of months, so I returned it and got a replacement. Which failed after a couple of months. So I joined the large chorus of "I'll never buy Seagate again".
We do try to space them out a bit, to avoid too much repetition, but anything up to once a year is fine. This one hasn't had a thread since 2017, so completely ok.
This story is fascinating in a lot of ways, but one which jumps out at me is: I don’t think the particular pre-“aha!” wondering about timing would ever occur to me in the domains I’ve worked. I guess maybe I’d discover it in the repro isolation process because that elimination is often very illuminating (it’s basically how I taught myself to program!), but it wouldn’t ever come to mind unless I was staring at it while debugging.
Say what you want about the ills of high level abstractions, but not having to think about the implementation details of clock sync all the way down to the metal is a pretty nice convenience when you can afford it.
I was implementing a TCP split proxy(using Adam Dunkels' lwIP stack) on a custom SoC with a 16-way multi-core(ARM+MIPS ISA mishmash) for the data plane. Memory was divided into different regions each with a specific set of policies. I had gotten my single-core proxy working and then added a Mutex to the TCP control block to parallelize my code across all the cores. Testing resulted in a fatal crash. After rolling back the checkins one by one, i narrowed down the problem to the load-link/store-conditional instructions(LL/SC) used to implement the Mutex. Now i was stuck with no clue as to why executing these instructions resulted in a crash. Cue me cursing everything about the chip in my cubicle. One of the senior engineers who was there in the beginning during the design of the SoC and hence knew its quirks heard my lamentation, came over, took a look, and promptly solved the problem. Remember the different policies for the different memory regions i mentioned earlier? It turns out that i had placed my TCP control bock and hence the Mutex in it in a certain region of memory where LL/SC instructions were inadmissible thus resulting in the crash. Shifting that data structure to a different region of memory solved the problem.
Lesson learned: When working on a custom SoC take nothing for granted even hardware instructions.
Having spent the better part of 30 years working on/with/around embedded systems, I can't even count how many bugs I've bumped into that were hiding inbetween sofware and hardware. Or between software and compiler/tools/OS. Or between hardware and spooky RF black magic.
I got one. Embedded linux board with a radio transceiver that would occasionally crash when the radio sent a packet out.
Found two problems.
The reset push button was acting like an antenna. Caused the reset chip to assert reset. Confusingly the reset pin was bidirectional so hard to tell who was asserting it. 10k pullup fixed that.
But the problem persisted at a much reduced level.
Finally figured out the RF was tripping the over current protection on the low drop out regulator for the processors 1.1V core supply. Voltage would sag about 50mV which was enough to send the uP into the weeds.
The first one was actually a surprise since often you don't see that. Most random stuff are really really bad crap antenna's.
The second one is pretty simple. The voltage regulator has a internal circuit to detect over current and turn off the regulator. So if current is too high or the part gets too hot it turns it off preventing the regulator from letting the smoke out. In this case somehow the RF was tripping that circuit for about 50ms.
My favorite bug this month was while setting up a development environment with the AVR-ICE.
I tried to save some company money by not buying the (optional) case and programming cable assembly -- figured I could just use another not-80$ SWD cable (also 3d-printed a case and a SOT-23-6 programming adapter).
After much cursing and hair pulling, I noticed that the header for the SWD cable was installed upside-down on the PCB. So the red wire on the ribbon cable was pin 10 instead of pin 1. In their defense, they did correctly indicate this on the solder mask, I just didn't see it through the (opaque) case.
My best guess as to why the cable assembly costs 80$ is that they again reverse the pin order on it to silently fix the bug on the PCB instead of just shipping a standard cable.
It turned out to be worth the engineering time to deal with the bug, but not by as much as I hoped. It's a pretty neat product despite this bug, definitely more modern than the venerable STK500 that I used previously (which itself had been converted to a USB device after the level converter failed).
Worked on a chart generating service in Java some 20 years ago. At that time IBM released their JVM. Upon first tests it worked perfectly and significantly faster than Sun's JVM. After testing it further, making tens of thousands charts, we deployed it to production. However, in production it would stop misteriosly after some time. Added a lot of logging, there were no issues in our code. After a while I realized it failed somewhere after 65536 charts were made! That was pretty suspicious. There's nothing in our code that would overflow some 16-bit counter, it worked under another JVM, and crash was not a Java exception. If I remember correctly it was not even a crash at all but entire process would freeze.
It turned out it was a problem with that specific IBM JVM. We created a new thread for each chart, and that JVM froze after 65536 created threads! Moral of the story, if you already test with tens of thousands requests, make sure it's at least 64k tests.
A decade ago i worked with sortation devices for postorder companies, and one of our clients reported that they sometimes had issues with items that were sorted wrong, but were unable to reproduce it. They used trays for sorting, and each tray had a barcode with a uniqe id.
I spent a LONG time looking at logs until i ended up enabling debug logging, and because the site was on a 1200 baud modem i had the client burn the logs to a DVD media and ship them to us.
I ended up writing i piece of perl code to parse the logs and insert them into a MySQL database where i could then trace the individual sorter trays by id, and by some obscure miracle of sleep deprevation and too much coffee, i manage to find a correlation.
Turns out the bug only showed up when a tray had been used for sorting inbound items, then reused for sorting outbound items, and when used for sorting inbound items again (not outbund, which would reset it), then the bug would happen.
The fix was traced to a single line in an if/else statement.
Time to fix : around 1 hour including tests.
Time to find bug : around 300 hours.
Of something more relevant to the article, i used to write operating systems for mobile phones, and we spent A LONG time debugging an issue where our brand new display driver was acting up.
After attaching a lauterbach debugger we finally managed to track it down to the compiler.
Turns out :
int i = 1+2+3;
would mean i=3 in the code as the compiler only considered the first two variables in the assignment list.
Another fun feature of that compiler was the fact that when you incremented heap memory past the memory page, it would forget to increment the page pointer, meaning it simply just wrapped to 0 and the memory you referenced was nowhere near what you'd expect :)
I once did some low-level GPU programming on a project aimed for the Samsung Galaxy S8. It was a case with extra features like an iris and fingerprint scanner, connected via the USB port.
It would work perfectly on our test phone and occasionally crash on other phones. Long story short, we narrowed it down to it crashing on phones with a specific SoC that was used in other parts of the world.
For some reason, when you copied an image straight from the phone camera (used to recognize and align eyes compared to the infrared iris camera) to the GPU and tried to access it, it would segfault in the non-western SoC. The data wasn't initialized yet.
My (hurried, we were releasing next month) fix was to add a rsdebug("This fixes a crash!\0"); to the code. The extra delay to go to the kernel and back fixed the race condition almost all the time. Someone later fixed my code from 99.99% stable to 100%, but I was in another project by that time, so I have no idea what they did.
My most horrible hardware "bug" that drove me up the wall was long forgotten wireless keyboard stuffed in a closet and acting up with my PC when cat would decide to visit. The distance was just right that it was very intermittent.
I had that with a wireless mouse I'd forgotten about.
It was made harder because every time I began trying to figure things out, I'd disturb the cat. Who then left. It was a black mouse on a black surface in a poorly lit area.
I laughed so hard when I realized what had been causing my problem.
> As a programmer, you learn to blame your code first, second, and third... and somewhere around 10,000th you blame the compiler. Well down the list after that, you blame the hardware.
I wish this were the case. The average programmer blames whatever library/third-party/etc. they're using, then somewhere around the 10,000th they might blame their own code.
(I run a third-party service and everything is always my fault, even syntax errors.)
This feels like it might be selection bias. I’m cautious to say that with more certainty, I hope my caveat below makes that clear.
No doubt these people exist, and I’ve worked with people who were similarly quick to stop debugging and come bother me with their problems that had nothing to do with whatever underlying thing I’d built. But the vast majority of my users have not come to me at all, because either they haven’t had any problems or they’ve done the appropriate legwork to investigate their own stuff first. A few come to me with legitimate bugs.
Caveat: these relative proportions have varied a lot depending on context, of course. But I can’t think of any context where the worst case was even conceivably close to average.
You’re right, I probably am biased. I run an HTTP API, and a lot of devs are unfamiliar with how to use HTTP APIs, especially when interfacing with an API without an SDK.
But take a look at any moderately popular OSS and the issues will be littered with “issues” that are actually due to the app code, not the OSS.
With that being said -- it was probably hyperbolic to use the term “average”, but meh.
Sorry, the particular bias I wanted to point out could be more clear. I think it’s possible that many of your users solve their own problems before they ever report anything at all. You won’t hear from them, so their problem solving demeanor won’t register unless you’re assuming some users are self-serving their problems. The “selection” part of this bias is applying your experience from people who report issues to people who experience issues. It’s very easy to get a very wrong picture of your user base by assuming the people who show up are representative of the people who use the thing you made!
It’s also very possible the thing you made is so effortless or free of potential problems that the only people having them are having their own problems! That introduces another bias: the only problems are self error. Unfortunately that means you built the thing well and you only get crap feedback, and unfortunately for you it makes feedback itself seem suspect.
If there's a bug in 3p code, they'd need to open up a PR to the open source library and be stalled on 3 weeks for the maintainer to see it. If it's a one-line bug in their own code, it's one glance at a stack trace.
There are many strange assumptions here. Why in your example is the library open source? Even when it is, why would the developer be expected to know how to fix it? Why in your example is the bug in the developer's code a "one-line" bug fixable by "one glance at a stack trace"?
The point is that, if the bug's cause is not immediately obvious, some developers tend to jump to "it's the 3rd party library", because in many cases they can then claim to be unable to fix it, or offload the responsibility to the 3rd party.
The assumptions were to minimize the amount of work for the developer in this case. If it wasn't an open source library then they'd need to roll their own, or scrap the entire feature if it isn't workable, both of which involve a good deal work than opening up a PR.
I also like to espouse a philosophy that problems should be investigated from inside out. Start with what you had direct control over, assume the issue is with something you did. Then work your way out.
However, I have watched more than one person do the exact opposite: assume everything else was wrong before even looking at their own contributions.
And this holds not just for programming, but for any endeavor.
My first job in the 1990s was programming VB for MS Access. I learned the hard way that if the bug wasn't obvious, it was probably Microsoft's fault. And over and over again I was able to demonstrate that it really WAS their fault.
My next job was in Perl. It took me a couple of years to find a bug that wasn't my own mistake.
A friend of mine learned to write linux kernel drivers. We had to write tests for hardware of a big name manufacturer. We had a USB cd-rom drive that failed some test unexplainably. We contacted the engineers and they were very responsive about the firmware, we also performed many mechanical tests. My friend decided to flex his driver writing muscles and spent a whole day modifying the linux kernel to carefully investigate bit by bit what was being sent from the cd-rom. After a very long investigation he categorically said: "It can only be the cable."
It was the cable indeed. And we could have discovered with much less effort.
I was writing the motor controller code for a new submersible robot my PhD lab was building. We had bought one of the very first compact PCI boards on the market, and it was so new we couldn't find any cPCI motor controller cards, so we bought a different format card and a motherboard that converted between compact PCI bus signals and the signals on the controller boards. The controller boards themselves were based around the LM629, an old but widely used motor controller chip.
To interface with the LM629 you have to write to 8-bit registers that are mapped to memory addresses and then read back the result. The 8-bit part is important, because some of the registers are read or write only, and reading or writing to a register that cannot be read from or written to throws the chip into an error state.
LM629s are dead simple, but my code didn't work. It. Did. Not. Work. The chip kept erroring out. I had no idea why. It's almost trivially easy to issue 8-bit reads and writes to specific memory addresses in C. I had been coding in C since I was fifteen years old. I banged my head against it for two weeks.
Eventually we packed up the entire thing in a shipping crate and flew to Minneapolis, the site of the company that made the cards. They looked at my code. They thought it was fine.
After three days the CEO had pity on us poor grad students and detailed his highly paid digital logic analyst to us for an hour. He carted in a crate of electronics that were probably worth about a million dollars. Hooked everything up. Ran my code.
"You're issuing a sixteen-bit read, which is reading both the correct read-only register and the next adjacent register, which is write-only", he said.
Is showed him in my code where the read in question was very clearly a *CHAR*. 8 bits.
"I dunno," he said - "I can only say what the digital logic analyzer shows, which is that you're issuing a sixteen bit read."
Eventually, we found it. The Intel bridge chip that did the bus conversion had a known bug, which was clearly documented in an 8-point footnote on page 79 of the manual: 8 bit reads were translated to 16 bit reads on the cPCI bus, and then the 8 most significant units were thrown away.
In other words, a hardware bug. One that would only manifest in these very specific circumstances.
We fixed it by taking a razor knife to the bus address lines and shifting them to the right by one, and then taking the least significant line and mapping it all the way over to the left, so that even and odd addresses resolved to completely different memory banks. Thus, reads to odd addresses resolved to addresses way outside those the chip was mapped to, and it never saw them. Adjusted the code to the (new) correct address range. Worked like a charm.
But I feel bad for the next grad student who had to work on that robot. "You are not expected to understand this."
Heh messing with the traces is pretty nuts. In the book Where Wizards Stay Up Late about the origins of the internet there was a similar story. Some grad student needed a delay in some execution path so he cut the trace and used a very long wire to introduce the delay. I can’t remember the story exactly but it was like the ultimate brute force fix, using the laws of physics as the hack.
This is a much better war story than most of its kind, thanks. I was terrified that you were just blithely doing a 16 bit read, I'm glad there was a much better explanation.
Please do get started. This sound like an awesome story! (For HN, that is, maybe not so much for dinner parties).
Just thinking about the thought process you had is entertaining in my end, because my procrastinating self would have delayed the part about going into the ceiling as the last possible resort... So what evidence was strong enough that you figured there must have been something wrong with a particular cable, or all cables, and just a specific section?
That particular section of ethernet, which served about half of a floor of an entire building was intermittent. It would work the vast majority of the time, but would have mysterious periodic outages, depending on the phase of the moon, or other mysterious cosmic events.
Working at a university, no one would pay for me to have a time ___domain reflectometer (aka TDR), which would have helped determine if there was a bad spot in the Ethernet cable somewhere.
One of the assistant directors of the lab told me that I didn't need a TDR, because I could just diagnose the problem with a signal generator and an oscilloscope. (Which is what a TDR basically is, but just bundled up to be convenient to use.) I ended up doing just that, and was able to determine that there was a reflection happening on that section of Ethernet cable, but IIRC, narrowing down the ___location relied on me having more knowledge than I had about the speed of light in an RG-58 cable. And also, wheeling around a working with the oscilloscope was a PITA.
Eventually I found someone to loan me a real TDR, and was able to determine how far down the cable the problem was occurring. Of course, even with that knowledge, determining where the problem was occurring was a challenge, since the cable snaked in a out of everyone's offices.
I followed the cable, applying the TDR at various points, until I got close to where the reflection was occurring, and it seemed to be occurring where the cable ran through the ceiling for a while.
I should note that all while I'm doing this, people are griping heavily, since it required disconnecting that section of Ethernet, meaning that people couldn't get their work done.
In any case, I got a ladder, pried up some ceiling tiles, looked up into the ceiling, and found a section of the Ethernet cable that had been spliced. At first I figured that one of the splices was bad, but eventually I noticed that the cable that had been spliced in was RG-59.
In case you don't already know, RG-58 and RG-59 look almost identical to each other. IIRC, the only real way to tell the difference was by reading the print on the cable.
Whoever spliced in that piece of cable should be drawn and quartered, but once I replaced that bit of cable with RG-58, everything then worked fine from then on, with no more intermittent outages.
1. Mac build crashes with "illegal instruction" due to AVX512 instruction that the Mac CPU doesn't support. Problem is though that the AVX512 code is in its own file, and this particular function is only called if AVX512 is supported by the CPU. So this code should never even run, and in fact it doesn't, so what gives?
Turns out that the AVX file is compiled with -mavx512f (sensible enough), and that this file includes a header that defines:
const float SQUARE_ROOT_OF_2 = (float)sqrt(2.0f);
Turns out that GCC compiles this to code including AVX512 instructions, which get executed completely bypassing the "if AVX512 is supported" check.
Fix: Change the constant to a numeric value.
2. Code crashes oddly on shutdown. Debugging shows destructors run twice.
Turns out the project is split into a large number of libraries one of which is 'shared', and includes very general purpose stuff like logging. 'shared' then gets linked into other libraries, which get linked into the resulting binary.
When linking statically this has the fun result that libfoo links to libshared, libbar links to libshared, and then libfoo and libbar make up the binary. Now there are two copies of libshared that end up in the binary, and this results in static variables being constructed and destructed twice.
Software but a nice bug: A really long time ago I worked at a company creating a nice portal application in ASP.NET (version 1.1) for a client. Was cool to build. Client did not follow our guide on how to install the application. We tolled them it should run on a separate machine and they just crammed it together with 7 other web applications. Since the portal had a login feature where people could change their resume with was quite sensitive. At the certain time we got a call that people saw each others resume. We used most desktops in our office to simulate the issue and wrote scripts to simulate different users. it took weeks to finally reproduce it.
As it turned out it was not our problem but a bug in ASP.NET... we had lots of calls with multiple offices of Microsoft. At some point we heard nothing back from them. We wrote our own state manager to avoid the issue but that also did not solve it.
A few months later .NET 2.0 came out. One of the item in the release notes was that a fix was made that when too many requests on a ASP.NET server (IIS) would make the http.dll (not sure on the name) serve a cached version of the previous request..
We lost the account and a 100K of work that was never paid and we almost wend to court on this one...
> As a programmer, you learn to blame your code first, second, and third... and somewhere around 10,000th you blame the compiler. Well down the list after that, you blame the hardware.
So, first - in many settings, the hardware is more likely to be the source of the problem than your compiler; the question is what has more churn - the compiler code or the chip you run on.
But regardless - the compiler is much higher than the 10,000'th item on the blame list. Even mature, popular compilers have bugs! Hell, they have many known, open bugs! The subtle ones, which don't manifest easily, can stay open for quite a long time. See:
I personally have encountered and even filed several of them, and it's not like I was trying. Some of these were even the result of "Why does my code not work?" questions on StackOverflow.
One tip, though: Play one compiler against another when you begin suspecting your compiler, or the hardware. The buggy behavior will often be different. And of course run multiple times to check for variation in behavior, like the author had.
After the 10,000th bug in your program that turned out to be a bug in your code, not the compiler, the compiler ends up a long way down the list.
Even if compilers do have bugs, the code you write (OK, the code I write) has a lot more bugs than the compiler does.
"It's never the compiler" is like "It's never lupus". Sure, sometimes it is actually the compiler, or lupus. But 9,999 times out of 10,000, no, it's not. You want to think it is, because a bug in someone else's code is much easier on the ego than yet another bug in your code. But reality and experience is a harsh teacher, and cares not for your ego.
If you've found several compiler bugs without even trying, and often enough compared to the number of bugs your write yourself that the compiler is a likely culprit when you come across something hinky, well, by my reckoning that puts you in pretty rarefied company. Congrats on being a 100x developer, or whatever. For the rest of us mortals, it's never the compiler.
It really depends what sorts of compiler bugs you're referring to. I think you're thinking of codegen bugs, which are indeed rare. But on the front-end/parsing side for something like C++, it's really not uncommon to encounter buggy compiler behavior. This isn't an indictment of their developers though - the spec is just super complicated and confusing with weird corner cases and ambiguities.
Can confirm that I routinely run into situations where I think the C++ compiler should be able to _nearly_ trivially infer the types code I'm writing, but for some reason or another, it can't.
I have no idea if these situations are due to the spec, bugs, or a mixture of both, but they do happen, and pretty frequently at that.
That depends a lot on your programming environment
Today, for major compilers like gcc, I'd agree. But I've worked with a lesser known compiler in the '90s, that would miscompile things on a monthly basis. Sometimes just renaming a local variable would 'fix' it.
Another world is embedded, where you might work with a compiler provided by the hardware vendor. These tend to be focused on hardware, with software like compilers as an afterthought. I've heard plenty of stories of decades-old vendor-patched gcc compilers or even some half-tested monstrosity from the vendor itself. Stumbling upon compiler bugs is distressingly easy over there.
Same goes for hardware. Any x86 based desktop/laptop/server is ridiculously stable, even considering overclocking and ram bit flips. But embedded hardware is a whole different can of worms. I've stabilized crashing software by adding a capacitor to a hardware board.
None of this means I'm a 100x dev, just that different parts of IT have different experience.
> But regardless - the compiler is much higher than the 10,000'th item on the blame list. Even mature, popular compilers have bugs! Hell, they have many known, open bugs!
I don't even understand when compilers started being thought as these perfect, bug-free programs. It's been some kind of gradual change over the decades. A lot of people seem surprised when I mention that around 15 years ago -O3 in gcc was practically unusable. I don't mean "it would actually degrade performance", I mean "it would break your program".
One of the GCC 3.3.x releases (possibly 6?) was completely broken if you attempted to build from source because they applied some last minute patches that didn't compile on most systems and never tested it.
That said, the reputation in those days was that the compiler was essentially perfect circa GCC 2.7, and the rewrite to support C99 created innumerable bugs that would never be fixed, especially if you used that sketchy -O2 option. -O0 for correctness and -O1 for ahem, "speed".
It took awhile to get back there, but mainstream compilers are probably the most correct and reliable software in common use today, often exceeding large parts of the OS they're running on. It's only with weird, obscure targets and new features that you tend to run into the rough edges that are actually broken.
I would bet that a substantial fraction of the rumoured GCC 3 bugs were originally caused by undefined behaviour in the compiled source, as new advanced compiler optimizations were being introduced in the GCC 3 series. (On the other hand, those newly introduced optimizations likely did have many bugs to begin with...)
At the time of GCC 2.7, no one really expected pointer access through the wrong type or signed integer overflow to cause anything weird. The CPU would just read whatever was in memory at the pointed address, and integers would wrap around just like in assembly.
There are comments in studiomdl (Valve Software's model compiling program) that say stack fixups in Visual Studio 2002 or 2003 broke Release builds so optimisations are turned off for that project.
I certainly remember a reasonable stretch of time where the advice was that the Linux kernel would not compile correctly with the latest gcc with optimisations on, and you should use the known good version of gcc instead.
TBH, I'm surprised by that. I would have though compiler authors would not have released optimization options in this state - when such breakage is encountered by testers of nightlies or beta releases.
A good chunk of what people (used to) term "optimization bugs" fell under that umbrella. Program works w/ optimization off, but fails w/ optimization on. Or optimization would "fail" because of program bugs like uninitialize memory that "works" w/ optimization off.
Even with old compilers, if you turned optimization on and saw problems, it was almost always an issue with the code being compiled, not the compiler. But that's often not how the blame was laid.
My favorite one:
The company I once worked for used an outdated version of sqlite (3.8.6) in one of their products. The databases used got bigger and bigger and in a very big project one of the "already known to be slow"-queries took more than an hour on my laptop making the tool unusable.
On a quiet day, I was able to save the temporary table used as part of the process and run the problematic query against it in an isolated fashion.
The query returned an extremely high number of results and when I discovered this I questioned my SQL-fu, my sanity and my trust in computers.
I found that we were hit by a bug that was fixed 6 years before I discovered it (https://sqlite.org/src/info/6f2222d550f5b0ee7ed). Sqlite's query planner assumed that a field with a not null constraint can never be null, which isn't the case for the right hand table in a left join.
I fixed it by adding a not null check in the query and then later by updating the library. After that the 1 hour query ran in ~700 ms.
Not a "hard" bug but a useful lesson in any case. I worked on a set of stress tests for a major middleware product and came into the office on a Monday morning to check the 72-hour over-weekend runs. We were getting close to release date and things were settling down so I wasn't expecting anything major. Except they'd ALL failed. It took us far longer than I'd care to admit to figure out what had gone wrong - I wasn't working on it non-stop but I definitely remember it taking quite some time. I think it was a colleague who figured it out later that week.
Anyway, what had happened was that our Perl test harness was tracking time elapsed in the 72-hour run as seconds since the Unix epoch, but was comparing them using the lexicographical order operator (lt versus <). Everything worked until the time ticked over from 999,999,999 seconds to 1,000,000,000.
I just looked up those timestamps to check my memory, and I can now see why fixing it wasn't our top priority that week... the 999999999/1000000000 transition happened the weekend before 9/11.
Many years ago I was working on a device driver for a position sensor. After deployment the customer complained that every time they started another process on the monitoring machine where the driver was installed the position sensor readings registered a slight movement of the object the position sensor was attached to. The object was about 20m away from the monitoring machine and weighed many tons (20?). After hearing the report I remarked that it looked like the first ever documented case of telekinesis. They were not amused.
After a cross-continent trip to the customer site to instrument everything it turned out that due to the physics of the sensor (ultrasound waves traveling in a metal rod, reflecting from a magnet) the exact reading was slightly sensitive to the time elapsed from the previous reading, which in turn depended on the CPU load on the host machine. I fixed that in the driver by making sure the probing of the sensor happened in fixed time intervals, independent of the sensor reading frequency from the user space.
My weirdest was a server that would randomly stop responding to traffic. Debugging for multiple days (including full factory resets) only to figure out that the clip had broken and would disconnect occasionally depending on air flow through the rack. The link light would stay on so there was no way to tell by looking at it :(
Working in embedded systems nowadays, it's funny to think of a time where a hardware designer would claim its impossible for it to be a HW bug.
Perhaps it was rarer back then, but these days cross-talk is carefully addressed, and my HW designer friends have nightmares of these kind of issues slipping through.
One of my worst ones was a compiler bug for a PLC which would cause a floating point operation that underflowed to become NaN instead of 0.0 (which is very common if you are writing set-point tracking code!) and then throw out the loop, so the controller would reach set point and then slowly start drifting as is accumulated error, but only for the rest of the "line"* of that code. So if you split your calculations across multiple lines then you were fine, but if you tried to group your operations sensibly then it no linger worked.
PLCs do the loop for you, you only write the body
*it was some bastardised version of ladder logic (itself a bastard representation of code) with functional blocks, so "line" = rung. I no longer work with PLCs.
My memory is pretty crap, and I've been around the block a few times so I'm not stating that this truly is my worst bug but it was ... bad.
I was working in embedded, developing part of the control software for ... something. The microcontroller had a vendor-developed C compiler, with some extensions. It was a 16-bit chip more or less, so addressing large areas of memory was complicated. The flash was larger than 64 KB, so in order to write all of it you had to use more than 16 bits.
Luckily, the vendor compiler had an extension like in the DOS days, where you could add the proprietary "far" modifier to a pointer in order to signal that you wanted lots of range. Like this:
unsigned char far *ptr = FLASH_BASE;
or something. Imagine then my surprise when I was looping over the flash (I think I was computing a CRC to validate software integrity, or something) and
ptr++;
simply failed to reach all of it. I read the generated code, and the compiler was emitting 16-bit arithmetic, completely ignoring the fantastic "far" modifier. I changed it to something like
ptr = (unsigned char *) ((uint32_t) ptr + 1);
and got the proper code, and it worked.
At that point, I was like "whoa, I found a compiler bug, gonna report it!" and sent off the details to our field applications engineer from the vendor.
...
Who came back with "yeah, we know, but we choose this behavior since it gives better performance" or something along those lines.
That just completely killed my trust in that vendor, and any interest in working with them again. As a person who has been writing code for close to 40 years, that kind of attitude just blows my mind, and really makes me upset. You're supposed to trust the compiler, a compiler bug should be rare. Correctness is important, these here programming things are hard enough without having the compiler lie to you.
Gosh, it makes me upset even now just thinking about it. Heh.
My hardest bug ever was also hardware related.
Back in 1998, I was working on a game called "Trucks".
When playing with the network, I noticed that the game was sometimes desynchronized.
To understand the problem, I had to save tons of logs and manually compared them, in order to find what happened.
After a large effort, I discovered that some floating-point values were different.
Then, I realized that some of our computers were Pentium with the FDIV bug.
I faced a similar quantum bug in my teen years. I was very much into Android custom roms and flashing phones. I got my hands on Galaxy Y, a cheap Android phone running gingerbread. So, while flashing the phone, the the flash process always failed after around 20-30%. I suspected loose cable connection and tried again. The subsequent flashes failed even early around 5%. So I waited for a while and tried again. The same loop started - first flash fails around 20% and subsequent flash fails around 5%.
During this, I noticed the phone gets heated up above average of what I had experience with. I suspected motherboard might be faulty. Then, a random idea struck me and I put the phone in freezer wrapped in a cloth. After an hour or so, I again started the flash process. The phone was still wrapped in cloth aside the table to keep it cool during the process. And lo and behold, the process completed without a hitch.
In later years I realised it was surely due to the cheap and substandard flash memory that Samsung supplies to other countries compared to the western counterparts.
I've seen things you people wouldn't believe... JVMs leaking memory on Ericsson and Motorola dumbphones... I watched devs work without debuggers or console. All those moments will be lost in time, like tears in rain... time to retire.
Outside of really weird shit™, the hardest part of software development is working with broken systems. Just today, I was trying to do something that is critical for a large chunk of our software working. I had to:
- Realize the API wasn't working in strange ways
- Talk to the team, who are unhelpful
- Try to figure out what is wrong, but fail
- Try alternative ways to do what we need to do
- Come to a lot of dead ends, or working solutions that were not viable
- Discuss more with our teams
- Eventually realize what needed to be done to get the correct output (API is confirmed broken in strange ways)
- Implement this, just to continue to do what I am actually trying to accomplish
I feel this one for sure. Lately I've been working on something that pulls data from a system that was part of an acquisition by a much larger company. That large company would clearly rather shut this system down, but regulators would be unhappy with them. So instead they've just let it decay such that it has become quite flaky. Most of my code is compensating for or working around deficiencies.
I've never seen such annoying ads on any website: the ad size changes every ~30 seconds which rearranges the text flow of the article completely and I get lost.
I was working a large project for a wafer fab company, and occasionally the compiler would crash during full builds with SIGILL (illegal instruction, for those who aren’t familiar with the signal). Compiler bugs are never fun, and this was particularly vexing because it was so inconsistent.
It took me awhile, but eventually I got around to thinking: What could cause the compiler to execute an illegal instruction? What could cause an illegal instruction at all?
I removed the outer case from my computer, and sure enough, all of the fans had died. The CPU was overheating during intense, long-running builds. Replaced the fans and the “bug” went away!
*This is my first comment since I created my account in 2009. I hope I did it right! ;-)