Looks like a reasonable overview of applications of dense linear algebra operations, with very specific applications in mind.
I feel like iterative Krylov subspace methods should be around somewhere, but to be honest I'm not sure what the applications space for linear inverse problems looks like in the deep learning ___domain. So maybe that wouldn't quite fit (despite me finding it to be pretty cool, and these methods sort of eating the lunch of most other inverse methods).
I'd round out the course maybe with a read through of Trefethen-Bau, “Numerical Linear Algebra” for those new to the topic area.
So, what applications are iterative Krylov subspace methods good for? Can you give us a sense of where they're useful, and what (in general) inverse methods in this space are?
The sibling comments are great, so I'll comment in a more direct application-oriented way: Krylov methods provide a way to expose most the machinery of linear algebra to you by means of only forward matrix-multiplication with the relevant matrix.
A cool thing is that means you can do linear algebra operations with a matrix (including solving linear systems) without ever needing to actually explicitly construct said matrix. We just need a recipe to describe the action of multiplication with a matrix in order to do more advanced things with it. This is why they are so popular for sparse matrices, but their suitability can go beyond just that.
Projecting problems into Krylov spaces also tends to have a noticeable regularizing effect.
Krylov subspace methods have been the top algorithms for years for solving linear systems Ax = b and eigenvalue problems Ax = λx. They basically find x as a polynomial in A: x = a_0b + a_1Ab + a_2A^2b + ... . Why a polynomial? Because it's easy to construct, and A^{-1} can be represented as a polynomial in A (with a finite number of terms).
More generally: subspace methods find a solution x as a linear combination of a few column vectors that form a matrix V. So x = Vy, where y is a very small vector. Now only if you're lucky you'll find AVy = b, but in general AVy /= b. So therefore you require the residual to be perpendicular to V or something like that: V'(AVy - b) = 0. In the end you solve a very small problem (V'AV)y = V'b for y.
GMRES for instances solves AVy = b in the least squares sense: (AV)'AVy = (AV)'b.
(warning, I am not a numerical computation specialist)
Krylov subspaces are a (the ?) common framework to compute eigenvalues and singular values, and it works for sparse matrices as well. IOW, it is extremely useful for most problems that can be expressed as matrix factorization (e.g. recommendation using collaborative filtering, initial page rank-like algos).
I remember using Conjugate Gradient (https://en.wikipedia.org/wiki/Conjugate_gradient_method) for my 4th year university project, which involved finding the steady state of large and complex systems (e.g. railway signalling systems), that were expressed as sparse matrices with millions of rows/columns.
It worked really well and executed faster than other iterative methods such as Jacoby/Gauss-Seidel, but the conjugate gradient method was a bit unstable and at times I would not be able to converge toward a solution - I think it was due to the high number of multiplications involved which may cause over-/underflow (but I can't remember).
The matrices themselves are a representation of a Markov chain.
Looks like a good course.
I think it would benefit if they added some module on implementing some basic Linear system of equations solvers, like gradient or steepest descent. Or even GMRES/MINRES or so.. The amout of knowledge that i gained from trying to implement these was remarkable.
GMRES/Krylov spaces seem like they ought to have more of a place in deep learning. But maybe thats the mathematical part of me having a hammer and wanting to see everything as a nail.
One possibility that I haven't seen: if you have a neural network graph with a small subgraph that is not 100% explicit (i.e. I have nodes with cyclic dependency that have to be solved collectively via newton-type method), the gradient across this subgraph can be solved at the converged state by solving a standard matrix-vector linear inverse problem applied to the gradients, without needed the AD engine to follow every operation through the iterates of the non-linear newton solver.
For sparsity and conditioning reasons, in my current work we do something like this for graph-based specification of engineering systems using GMRES to find the gradients across these subsystems. We're not doing deep learning, but the underlying machinery is basically the same.
Krylov spaces often come up in derivations for optimization methods in DL, and have been studied as an explicit technique on their own. Eg see https://arxiv.org/abs/1111.4259
However more recent theoretical and empirical results show that these kinds of approaches don't deal well with the huge saddle point problem in DL loss functions. But momentum based techniques work great.
Interesting, I'll have to look deeper at what has been done in this space. As I understand it, saddle points are handled by SGD-based methods because they are a bit more "noisy" in their steps through the problem space. In short, are the Krylov methods (or perhaps other approaches) a bit too precise in their attraction to critical points?
Absolutely not. Krylov methods are one method for solving a linear system and can absolutely be used with stochastic optimization methods. Generally speaking, most optimization algorithms require the solution of some linear system and we can use a Krylov method or we can use a direct solve or we can integrate the direct solve with the Krylov method as a preconditioner. However, this has nothing to do with the stochastic piece of the algorithm. If we want to use a random projection or some kind of other technique to compile the gradient, we can absolutely do so while using a Krylov method.
Now, if we're just doing a stochastic steepest descent algorithm, there are not really any linear systems to solve, so this doesn't come up. If we want to use a stochastic Newton algorithm, this does. It comes up here specifically because forming the full Hessian may be too expensive in terms of computation or memory, so if we're smart and use a Krylov method that only requires Hessian-vector products, we can effectively use second-order information without taking the penalties of computation and memory. Yes, there is a loss here since we only use a piece of the spectrum of the Hessian and this is where the art and science and preconditioning come to play to insure that we use this effectively. However, we can use stochastic algorithms here. In fact, the entire field of stochastic PDE constrained optimization deals very precisely with these tools and does so in a surprisingly effective manner.
The advantage that Krylov methods have over a direct solve is that we retain the ability to use the direct solve if we want in the context of a preconditioner. However, different Krylov methods have different properties that we can use to our advantage. For example, conjugate gradient works on a minimization principle based on a quadratic form. This integrates well in trust-region methods that require us to minimize a quadratic model. GMRES minimizes the residual of the linear system, which gives us a monotonically decreasing residual at each iteration, which can be important for different algorithms. Which Krylov method to use in any particular situation depends on the properties desired and the systems involved. And, I can't stress enough, we don't lose out on direct solves. Direct solves can be fantastic preconditioners and if we really can compute an inverse exactly our Krylov iterations converges in one iteration. As such, we lose a little bit of computation and memory, but not that much.
Anyway, sorry for coming on a little bit hard, but I wanted to try and nip some amount of confusion in the bud quickly. Stochastic methods are naturally and fully integrated into inexact optimization schemes that rely on Krylov solves. They are the standard tool for many fields that use optimization. They may not be wildly used in ML, but that has little to do with their efficacy computationally.
Whilst what you say is accurate, it has nothing to do with my comment which you are replying to, which has specifically about the shape of the loss functions in deep learning, and recent research on how it impacts optimization of these functions.
The grand parent comment asked, "In short, are the Krylov methods (or perhaps other approaches) a bit too precise in their attraction to critical points?" to which you replied, "Exactly. Saddle points are like a magnet for them!" There's little room for ambiguity here and this is categorically false. Your linear solver has nothing to do with saddle points and Krylov methods do not attract them.
> Locality: traditional runtime computations focus on Big O, the number of operations computed. However, for modern computing, moving data around in memory can be very time-consuming
I need to nitpick here... Big O notation is a way to describe growth rates of functions. You can count data movements (or anything else) with Big O.
Moreover, "moving data around in memory can be very time-consuming" means that in the end, it is still about time, not about memory.
So the correct way would be to translate memory access to time, but that means modelling the memory hierarchy, modelling caches in general, and finally perhaps modelling the specifically used caching strategies.
The limits on memory access are physical, illustrated by grass hoppers famous video about nanoseconds and the speed of light.
Should computer algorithms always assume they need to model caches since they are never going away? When determining the computational complexity, time and memory are treated with equivalence, but real memory doesn't and never will behave that way.
Yes, the memory hierarchy is discussed in the course, and algorithms are shown that are designed to take account of this. Older algorithms tended to only consider computation time, not memory access time.
Candid question: isn't the growth rate of functions the derivative of the Big O complexity instead?
O(1) does not mean that the growth rate is constant. It means that the number of operations is constant, and its derivative being zero, that means the growth is nil.
O(1) means the number of operations is bounded. f = O(g) measures growth in the sense that f/g = O(1), ie. g grows quickly enough to cancel out any tendency of f to go to infinity.
For someone with an ancient undergrad math background and only "interested observer" level of machine learning knowledge, would it be better to do this course before tackling the deep learning one?
No, start with fast.ai's deep learning course. They have a top down approach so you start with connecting layers to solve deep learning problems and then dig down into the theory behind the thing.
If you want a bottom up approach, go look at Ng's coursera course.
Absolutely right. There's very little linear algebra required in deep learning. This Computational Linear Algebra course mainly covers decompositions, which aren't used much in deep learning.
Thanks for sharing this, it seems like a lot of interesting material is being discussed. The audience seems to be more like the hacker news visitor than the average student though, as it feels like little hand holding is provided.
I've just started lecture 1 but I already felt some minor frustrations:
- One of the links in the first lecture is to a notebook about intro to convolutions but that notebook is just a big code dump.
- After executing the exercises, you lose the expected answer. It might be better if the answers were included as a comment in the code fragment.
- Sometimes the given answers are not actually the answer but just the computation performed as part of getting the answer. I.e. for the matrix-matrix products section in lecture 1 the suggested answer is just the resulting matrix from doing the matrix product, but according to the question in the text the answer should be the actual cheapest shop.
- Is this a USF course or a fast.ai course?
I don't know if the author is planning on improving the material, because right now it feels a bit like a beta version.
Probably to emphasize that it is a practical course. Numerical courses, at least those I've seen or completed, tend to focus on algorithms and their properties, not implementation.
> Jeremy and I developed this material for a numerical linear algebra course we taught in the University of San Francisco’s Masters of Analytics program, and it is the first ever numerical linear algebra course, to our knowledge, to be completely centered around practical applications and to use cutting edge algorithms and tools,
Like... most of it?
You can't do linear algebra "safely" without doing error analysis. So lots of decompositions and operations are very useful for proofs and finding bounds - but can't be use directly to compute values. It's why a good fraction of stuff people do with linear algebra is numerically garbage
I found the part which use Zorn's Lemma pretty un-computational :) It's equivalent to the axiom of choice and is used to show that every vector space has a basis
- Computational as in 'using programming and computers'
- Computational as in 'solving a math problem' (as opposed to 'theoretical': definitions/theorems/proofs/lemmas).
My guess is that the posted link means the first kind, hence almost all of linear algebra texts are non-computational, i.e., you can become an expert in linear algebra without knowing how to program, and without knowing a single programming language.
For the second kind, most of beginner and intermediate linear algebra is computational, but not all. There's plenty of theory of linear algebra, and it has connections with representation theory, abstract algebra, as well as analysis, topology, and geometry. Study of infinite-dimensional vector spaces is purely non-computational.
I feel like iterative Krylov subspace methods should be around somewhere, but to be honest I'm not sure what the applications space for linear inverse problems looks like in the deep learning ___domain. So maybe that wouldn't quite fit (despite me finding it to be pretty cool, and these methods sort of eating the lunch of most other inverse methods).
I'd round out the course maybe with a read through of Trefethen-Bau, “Numerical Linear Algebra” for those new to the topic area.