Perhaps a better way to describe it is that during the additions, the accumulator only ever needs to be shifted right in order to be aligned with the next number to add, never left. So the accumulator can be 53+3-bit wide and carry at each step the 53+3-bit representation of the partial sum already computed, and this is enough to produce the correctly rounded result after all numbers have been added.
This explanation assumes that the N partial products are added sequentially. I don't know what techniques are used to achieve the 4- or 5-cycle latencies of modern desktop processors. Implementations with more parallelism and/or that are part of an FMA unit have it harder, but I still expect that they don't naively compute a full width partial result before rounding it (in the case of the FMA, it would be impractical as the partial result could be thousands of bits wide)
Looks like you're actually computing the full product. The only optimization you're doing is forgetting the low order bits as soon as you're done with them. The author is asking to keep them, which sounds reasonable. You initially claimed they were "never computed in the first place", and we're asking you to back up that claim if you still think it's true.
This explanation assumes that the N partial products are added sequentially. I don't know what techniques are used to achieve the 4- or 5-cycle latencies of modern desktop processors. Implementations with more parallelism and/or that are part of an FMA unit have it harder, but I still expect that they don't naively compute a full width partial result before rounding it (in the case of the FMA, it would be impractical as the partial result could be thousands of bits wide)