> Bad…link speed…and bad…link width…were other concerning PCIe faults. These fau...

maccam94 · on June 4, 2021

FB doesn't want hardware to run at lower than rated speeds. Their tool allows them to detect when it happens and remediate the issue.

numpad0 · on June 4, 2021

OP claims it shouldn’t be “difficult to detect (...) because the hardware is working” because most commercially sold host controller chips would generate interrupt and report errors, unless Facebook is using something nonstandard that don’t.

maccam94 · on June 4, 2021

The hardware is reporting the errors to the kernel but not crashing the system. It's "difficult to detect" because unless you are specifically monitoring for those stats, the only issue you'll see is degraded performance on an occasional machine (assuming you are watching carefully enough to even discern the performance delta). Some of the error counters are even predictive of an issue rather than something that is actively impacting performance. The FB software is basically scraping those messages and bus stats into JSON that can be consumed by their monitoring infrastructure.