AFAIK you have to think about how many different 512b paths are being driven when this happens, like each cycle in the steady-state case is simultaneously (in the case where you can do two vfmadd132ps per cycle):
- Capturing 2x512b from the L1D cache
- Sending 2x512b to the vector register file
- Capturing 4x512b values from the vector register file
- Actually multiplying 4x512b values
- Sending 2x512b results to the vector register file
.. and probably more?? That's already like 14*512 wires [switching constantly at 5Ghz!!], and there are probably even more intermediate stages?
- Capturing 2x512b from the L1D cache
- Sending 2x512b to the vector register file
- Capturing 4x512b values from the vector register file
- Actually multiplying 4x512b values
- Sending 2x512b results to the vector register file
.. and probably more?? That's already like 14*512 wires [switching constantly at 5Ghz!!], and there are probably even more intermediate stages?