I've been waiting for this for 6th months. Thanks to Sheldon Axler for making it available for free. This is intended to be a second book on Linear Algebra.
For a first book I suggest "Linear Algebra: Theory, Intuition, Code" by Mike X Cohen. It's a bit different than a typical math textbook, it has more focus on conversational explanations using words, although the book does have plenty of proofs as well. The book also has a lot of code examples, which I didn't do, but I did appreciate the discussions related to computing; for example, the book explains that several calculations that can be done by hand are numerically unstable when done on computers (those darn floats are tricky). For the HN crowd, this is the right focus, math for the sake of computing, rather than math for the sake of math.
One insight I gained from the book was the 4 different perspectives of matrix multiplication. I had never encountered this, not even in the oft-suggested "Essence of Linear Algebra" YouTube series. Everything I had seen explained only one of the 4 views, and then I'd encounter a calculation that was better understood by another view and would be confused. It still bends my mind to think all these different perspectives describe the same calculation, they're just different ways of interpreting it.
At the risk of spamming a bit, I'll put my notes here, because this is something I've never seen written down elsewhere. The book has more explanation, these are just my condensed notes.
4 perspectives on matrix multiplication
=======================================
1 Element perspective (all possible dot / inner products)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(row count × row length) × (column length × column count)
In AB, every element is the dot product of the corresponding row of A
and column of B.
The rows in A are the same length as the columns in B and thus have
dot products.
2 Layer perspective (sum of outer product layers)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(column length × column count) × (row count × row length)
AB is the sum of every outer product between the corresponding columns
in A and rows in B.
The column count in A is the same as the row count in B, thus the
columns and rows pair up exactly for the outer product operation. The
outer product does not require vectors to be the same length.
3 Column perspective (weighted sums / linear combinations)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(column length × column count) × (column length × column count)
In AB, every column is a weighted sum of the columns in A; the weights
come from the columns in B.
The weight count in the columns of B must match the column count in A.
4 Row perspective (weighted sums / linear combinations)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(row count × row length) × (row count × row length)
In AB, every row is a weighted sum of the rows in B; the weights come
from the rows in A.
The weight count in the rows of A must match the row count in B.
The most important interpretation IMO is that a matrix is a specification for a linear map. A linear map is determined by what it does to a basis, and the columns of a matrix are just the list of outputs for each basis element (e.g. the first column is `f(b_1)`. The nth column is `f(b_n)`). If A is the matrix for f and B the matrix for g (for some chosen bases), then BA is the matrix for the composition x -> g(f(x)). i.e. the nth column is `g(f(b_n))`.
The codomain of f has to match the ___domain of g for composition to make sense, which means dimensions have to match (i.e. row count of A must be column count of B).
It's debatable what the "most important" perspective is. For example, if I need a bunch of dot products between two sets of vectors, that doesn't seem like a linear map or a change of basis (not to me at least), and yet it's exactly what matrix multiplication is, just calculating a bunch of dot products between two sets of vectors.
Or when I think about the singular value decomposition, I'm not thinking about linear maps and change of basis, but I am thinking about a sum of many outer product layers.
If you don't have a linear map in mind, why do you write your dot products with one set of column vectors and another set of row vectors? Computationally, the best way to do dot products would be to walk all of your arrays in contiguous memory order, so the row/column thing is an unnecessary complication. And if you have more than 2 matrices to multiply/"steps of dot products to do in a pipeline", there's almost certainly a relevant interpretation as linear maps lurking.
Outer products are one way to define a "simple" linear map. What SVD tells you is that every (finite dimensional) linear map is a sum of outer products; there are no other possibilities.
This is great! I eventually figured out all these interpretations as well, but it took forever. In particular, the "sum of outer products" interpretation is crucial for understanding the SVD.
The "sum of outer products" interpretation (actually all these interpretations consist in choosing a nesting order for the nested loops that must be used for computing a matrix-matrix product, a.k.a. DGEMM or SGEMM in BLAS) is the most important interpretation for computing the matrix-matrix product with a computer.
The reason is that the outer product of 2 vectors is computed with a number of multiplications equal to the product of the numbers of elements of the 2 vectors, but with a number of memory reads equal to the sum of the numbers of elements of the 2 vectors.
This "outer product" is much better called the tensor product of 2 vectors, because the outer product as originally defined by Grassmann, and used in countless mathematical works with the Grassmann meaning, is a different quantity, which is related to what is frequently called the vector product of 2 vectors. Not even "tensor product" is correct historically. The correct term would be "Zehfuss product", but nowadays few people remember Zehfuss. The term "tensor" was originally applied only to symmetric matrices, where it made sense etymologically, but for an unknown reason Einstein has used it for general arrays and the popularity of the relativity theory after WWI has prompted many mathematicians to change their terminology, following the usage initiated by Einstein.
For long vectors, the product of 2 lengths is much bigger than their sum and the high ratio between the number of multiplications and the number of memory reads, when the result is kept in registers, allows reaching a throughput close to the maximum possible on modern CPUs.
Because the result must be kept in registers, the product of big matrices must be assembled from the products of small sub-matrices. For instance, supposing that the registers can hold 16 = 4 * 4 values, i.e. the tensor/outer product of 2 vectors of length 4, each tensor/outer product, which is an additive term in the computation of the matrix-matrix product of two 4x4 sub-matrices, can be computed with 4 * 4 = 16 fused multiply-add operations (the addition is used to sum the current tensor product with the previous tensor product), but only 4 + 4 = 8 memory reads.
On the other hand, if the matrix-matrix product were computed with scalar products of vector pairs, instead of tensor products of vector pairs, that would have required twice more memory reads than fused multiply-add operations, therefore it would have been much slower.
For a first book I suggest "Linear Algebra: Theory, Intuition, Code" by Mike X Cohen. It's a bit different than a typical math textbook, it has more focus on conversational explanations using words, although the book does have plenty of proofs as well. The book also has a lot of code examples, which I didn't do, but I did appreciate the discussions related to computing; for example, the book explains that several calculations that can be done by hand are numerically unstable when done on computers (those darn floats are tricky). For the HN crowd, this is the right focus, math for the sake of computing, rather than math for the sake of math.
One insight I gained from the book was the 4 different perspectives of matrix multiplication. I had never encountered this, not even in the oft-suggested "Essence of Linear Algebra" YouTube series. Everything I had seen explained only one of the 4 views, and then I'd encounter a calculation that was better understood by another view and would be confused. It still bends my mind to think all these different perspectives describe the same calculation, they're just different ways of interpreting it.
At the risk of spamming a bit, I'll put my notes here, because this is something I've never seen written down elsewhere. The book has more explanation, these are just my condensed notes.
4 perspectives on matrix multiplication
=======================================
1 Element perspective (all possible dot / inner products) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 Layer perspective (sum of outer product layers) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3 Column perspective (weighted sums / linear combinations) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4 Row perspective (weighted sums / linear combinations) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~