> Bad…link speed…and bad…link width…were other concerning PCIe faults. These faults can be difficult to detect without some sort of automated tool…
Most modern PCIe PHY now have the ability to serve interrupts when down-training occurs much like they would for AER errors or hard link failures. Does FB use their own silicon in these data centers? Having this feature enabled is crucial when you get up to gen4 speeds. Weirdly I don’t see any detail about the gen used here though.
OP claims it shouldn’t be “difficult to detect (...) because the hardware is working” because most commercially sold host controller chips would generate interrupt and report errors, unless Facebook is using something nonstandard that don’t.
The hardware is reporting the errors to the kernel but not crashing the system. It's "difficult to detect" because unless you are specifically monitoring for those stats, the only issue you'll see is degraded performance on an occasional machine (assuming you are watching carefully enough to even discern the performance delta). Some of the error counters are even predictive of an issue rather than something that is actively impacting performance. The FB software is basically scraping those messages and bus stats into JSON that can be consumed by their monitoring infrastructure.
Most modern PCIe PHY now have the ability to serve interrupts when down-training occurs much like they would for AER errors or hard link failures. Does FB use their own silicon in these data centers? Having this feature enabled is crucial when you get up to gen4 speeds. Weirdly I don’t see any detail about the gen used here though.