There's at least 2 reasons (+1 non-reason) why it is slow.
1. L1 cache lines are 64-bytes long. By fetching column-wise, you are wasting the bandwidth between L1 and main-memory. L1 cache will always fetch 64-bytes. By moving "with" the cache, you allow the L1 --> Main Memory data-transfers to be far more efficient.
2. Virtual Memory is translated by the TLB before it returns the actual value. Moving within a 4kB page is more efficient than moving across pages.
The non-reason:
* Hardware prefetcher probably works, even on column-oriented data.
All of these reasons hold even if the SIMD-optimizer fails. If the SIMD-optimizer is actually working, you'll more efficiently load/store to L1 cache. But this is likely a memory-bound problem and optimizing the core isn't as important.
1. L1 cache lines are 64-bytes long. By fetching column-wise, you are wasting the bandwidth between L1 and main-memory. L1 cache will always fetch 64-bytes. By moving "with" the cache, you allow the L1 --> Main Memory data-transfers to be far more efficient.
2. Virtual Memory is translated by the TLB before it returns the actual value. Moving within a 4kB page is more efficient than moving across pages.
The non-reason:
* Hardware prefetcher probably works, even on column-oriented data.
All of these reasons hold even if the SIMD-optimizer fails. If the SIMD-optimizer is actually working, you'll more efficiently load/store to L1 cache. But this is likely a memory-bound problem and optimizing the core isn't as important.