The "sum of outer products" interpretation (actually all these interpretations c...

The "sum of outer products" interpretation (actually all these interpretations consist in choosing a nesting order for the nested loops that must be used for computing a matrix-matrix product, a.k.a. DGEMM or SGEMM in BLAS) is the most important interpretation for computing the matrix-matrix product with a computer.

The reason is that the outer product of 2 vectors is computed with a number of multiplications equal to the product of the numbers of elements of the 2 vectors, but with a number of memory reads equal to the sum of the numbers of elements of the 2 vectors.

This "outer product" is much better called the tensor product of 2 vectors, because the outer product as originally defined by Grassmann, and used in countless mathematical works with the Grassmann meaning, is a different quantity, which is related to what is frequently called the vector product of 2 vectors. Not even "tensor product" is correct historically. The correct term would be "Zehfuss product", but nowadays few people remember Zehfuss. The term "tensor" was originally applied only to symmetric matrices, where it made sense etymologically, but for an unknown reason Einstein has used it for general arrays and the popularity of the relativity theory after WWI has prompted many mathematicians to change their terminology, following the usage initiated by Einstein.

For long vectors, the product of 2 lengths is much bigger than their sum and the high ratio between the number of multiplications and the number of memory reads, when the result is kept in registers, allows reaching a throughput close to the maximum possible on modern CPUs.

Because the result must be kept in registers, the product of big matrices must be assembled from the products of small sub-matrices. For instance, supposing that the registers can hold 16 = 4 * 4 values, i.e. the tensor/outer product of 2 vectors of length 4, each tensor/outer product, which is an additive term in the computation of the matrix-matrix product of two 4x4 sub-matrices, can be computed with 4 * 4 = 16 fused multiply-add operations (the addition is used to sum the current tensor product with the previous tensor product), but only 4 + 4 = 8 memory reads.

On the other hand, if the matrix-matrix product were computed with scalar products of vector pairs, instead of tensor products of vector pairs, that would have required twice more memory reads than fused multiply-add operations, therefore it would have been much slower.