Computational Linear Algebra

thearn4 · on July 23, 2017

Looks like a reasonable overview of applications of dense linear algebra operations, with very specific applications in mind.

I feel like iterative Krylov subspace methods should be around somewhere, but to be honest I'm not sure what the applications space for linear inverse problems looks like in the deep learning ___domain. So maybe that wouldn't quite fit (despite me finding it to be pretty cool, and these methods sort of eating the lunch of most other inverse methods).

I'd round out the course maybe with a read through of Trefethen-Bau, “Numerical Linear Algebra” for those new to the topic area.

jph00 · on July 23, 2017

Krylov subspace methods are heavily covered in the course. Trevathan is the main text we recommend in the course.

jph00 · on July 23, 2017

Specifically, it's sections 9 and 10 here: https://github.com/fastai/numerical-linear-algebra/blob/mast...

knlje · on July 23, 2017

Just a remark. I cannot see sections 9 and 10 in your link. They only go up to section 8.

jph00 · on July 23, 2017

Sorry - I meant: videos 9 & 10.

thearn4 · on July 23, 2017

Very cool! I seem to have missed that in thumbing through the outline.

chris_st · on July 23, 2017

So, what applications are iterative Krylov subspace methods good for? Can you give us a sense of where they're useful, and what (in general) inverse methods in this space are?

thearn4 · on July 23, 2017

The sibling comments are great, so I'll comment in a more direct application-oriented way: Krylov methods provide a way to expose most the machinery of linear algebra to you by means of only forward matrix-multiplication with the relevant matrix.

A cool thing is that means you can do linear algebra operations with a matrix (including solving linear systems) without ever needing to actually explicitly construct said matrix. We just need a recipe to describe the action of multiplication with a matrix in order to do more advanced things with it. This is why they are so popular for sparse matrices, but their suitability can go beyond just that.

Projecting problems into Krylov spaces also tends to have a noticeable regularizing effect.

stabbles · on July 23, 2017

Krylov subspace methods have been the top algorithms for years for solving linear systems Ax = b and eigenvalue problems Ax = λx. They basically find x as a polynomial in A: x = a_0b + a_1Ab + a_2A^2b + ... . Why a polynomial? Because it's easy to construct, and A^{-1} can be represented as a polynomial in A (with a finite number of terms).

More generally: subspace methods find a solution x as a linear combination of a few column vectors that form a matrix V. So x = Vy, where y is a very small vector. Now only if you're lucky you'll find AVy = b, but in general AVy /= b. So therefore you require the residual to be perpendicular to V or something like that: V'(AVy - b) = 0. In the end you solve a very small problem (V'AV)y = V'b for y.

GMRES for instances solves AVy = b in the least squares sense: (AV)'AVy = (AV)'b.

There's a lot more to it of course.

cdavid · on July 23, 2017

(warning, I am not a numerical computation specialist)

Krylov subspaces are a (the ?) common framework to compute eigenvalues and singular values, and it works for sparse matrices as well. IOW, it is extremely useful for most problems that can be expressed as matrix factorization (e.g. recommendation using collaborative filtering, initial page rank-like algos).

edshiro · on July 23, 2017

I remember using Conjugate Gradient (https://en.wikipedia.org/wiki/Conjugate_gradient_method) for my 4th year university project, which involved finding the steady state of large and complex systems (e.g. railway signalling systems), that were expressed as sparse matrices with millions of rows/columns. It worked really well and executed faster than other iterative methods such as Jacoby/Gauss-Seidel, but the conjugate gradient method was a bit unstable and at times I would not be able to converge toward a solution - I think it was due to the high number of multiplications involved which may cause over-/underflow (but I can't remember).

The matrices themselves are a representation of a Markov chain.

jjtheblunt · on July 24, 2017

That book is superlative! We taught from it at Urbana-Champaign even 20 years ago in grad school..it's still fantastic.

rabreu08 · on July 23, 2017

Looks like a good course. I think it would benefit if they added some module on implementing some basic Linear system of equations solvers, like gradient or steepest descent. Or even GMRES/MINRES or so.. The amout of knowledge that i gained from trying to implement these was remarkable.

jph00 · on July 23, 2017

Gradient descent is covered. We even show a pytorch implementation :)

rabreu08 · on July 23, 2017

Great! It's 2AM here and i totally missed it.

I will definitely give this course a try and spread the word. Thanks

thearn4 · on July 23, 2017

GMRES/Krylov spaces seem like they ought to have more of a place in deep learning. But maybe thats the mathematical part of me having a hammer and wanting to see everything as a nail.

One possibility that I haven't seen: if you have a neural network graph with a small subgraph that is not 100% explicit (i.e. I have nodes with cyclic dependency that have to be solved collectively via newton-type method), the gradient across this subgraph can be solved at the converged state by solving a standard matrix-vector linear inverse problem applied to the gradients, without needed the AD engine to follow every operation through the iterates of the non-linear newton solver.

For sparsity and conditioning reasons, in my current work we do something like this for graph-based specification of engineering systems using GMRES to find the gradients across these subsystems. We're not doing deep learning, but the underlying machinery is basically the same.

jph00 · on July 23, 2017

Krylov spaces often come up in derivations for optimization methods in DL, and have been studied as an explicit technique on their own. Eg see https://arxiv.org/abs/1111.4259

However more recent theoretical and empirical results show that these kinds of approaches don't deal well with the huge saddle point problem in DL loss functions. But momentum based techniques work great.

thearn4 · on July 23, 2017

Interesting, I'll have to look deeper at what has been done in this space. As I understand it, saddle points are handled by SGD-based methods because they are a bit more "noisy" in their steps through the problem space. In short, are the Krylov methods (or perhaps other approaches) a bit too precise in their attraction to critical points?

jph00 · on July 23, 2017

Exactly. Saddle points are like a magnet for them!

kxyvr · on July 23, 2017

Absolutely not. Krylov methods are one method for solving a linear system and can absolutely be used with stochastic optimization methods. Generally speaking, most optimization algorithms require the solution of some linear system and we can use a Krylov method or we can use a direct solve or we can integrate the direct solve with the Krylov method as a preconditioner. However, this has nothing to do with the stochastic piece of the algorithm. If we want to use a random projection or some kind of other technique to compile the gradient, we can absolutely do so while using a Krylov method.

Now, if we're just doing a stochastic steepest descent algorithm, there are not really any linear systems to solve, so this doesn't come up. If we want to use a stochastic Newton algorithm, this does. It comes up here specifically because forming the full Hessian may be too expensive in terms of computation or memory, so if we're smart and use a Krylov method that only requires Hessian-vector products, we can effectively use second-order information without taking the penalties of computation and memory. Yes, there is a loss here since we only use a piece of the spectrum of the Hessian and this is where the art and science and preconditioning come to play to insure that we use this effectively. However, we can use stochastic algorithms here. In fact, the entire field of stochastic PDE constrained optimization deals very precisely with these tools and does so in a surprisingly effective manner.

The advantage that Krylov methods have over a direct solve is that we retain the ability to use the direct solve if we want in the context of a preconditioner. However, different Krylov methods have different properties that we can use to our advantage. For example, conjugate gradient works on a minimization principle based on a quadratic form. This integrates well in trust-region methods that require us to minimize a quadratic model. GMRES minimizes the residual of the linear system, which gives us a monotonically decreasing residual at each iteration, which can be important for different algorithms. Which Krylov method to use in any particular situation depends on the properties desired and the systems involved. And, I can't stress enough, we don't lose out on direct solves. Direct solves can be fantastic preconditioners and if we really can compute an inverse exactly our Krylov iterations converges in one iteration. As such, we lose a little bit of computation and memory, but not that much.

Anyway, sorry for coming on a little bit hard, but I wanted to try and nip some amount of confusion in the bud quickly. Stochastic methods are naturally and fully integrated into inexact optimization schemes that rely on Krylov solves. They are the standard tool for many fields that use optimization. They may not be wildly used in ML, but that has little to do with their efficacy computationally.

jph00 · on July 24, 2017

Whilst what you say is accurate, it has nothing to do with my comment which you are replying to, which has specifically about the shape of the loss functions in deep learning, and recent research on how it impacts optimization of these functions.

kxyvr · on July 24, 2017

The grand parent comment asked, "In short, are the Krylov methods (or perhaps other approaches) a bit too precise in their attraction to critical points?" to which you replied, "Exactly. Saddle points are like a magnet for them!" There's little room for ambiguity here and this is categorically false. Your linear solver has nothing to do with saddle points and Krylov methods do not attract them.

fdej · on July 23, 2017

> Locality: traditional runtime computations focus on Big O, the number of operations computed. However, for modern computing, moving data around in memory can be very time-consuming

I need to nitpick here... Big O notation is a way to describe growth rates of functions. You can count data movements (or anything else) with Big O.

vog · on July 23, 2017

Moreover, "moving data around in memory can be very time-consuming" means that in the end, it is still about time, not about memory.

So the correct way would be to translate memory access to time, but that means modelling the memory hierarchy, modelling caches in general, and finally perhaps modelling the specifically used caching strategies.

snarfy · on July 23, 2017

The limits on memory access are physical, illustrated by grass hoppers famous video about nanoseconds and the speed of light.

Should computer algorithms always assume they need to model caches since they are never going away? When determining the computational complexity, time and memory are treated with equivalence, but real memory doesn't and never will behave that way.

SAI_Peregrinus · on July 23, 2017

Grace Hopper, not grass hopper. The latter is an insect, the former was an awesome computer engineer who invented the compiler.

zzazzdsa · on July 23, 2017

Then again, both have something to do with bugs on some level...

snarfy · on July 23, 2017

Well, crap. It was a 4am post, and too late to edit now.

jph00 · on July 23, 2017

Yes, the memory hierarchy is discussed in the course, and algorithms are shown that are designed to take account of this. Older algorithms tended to only consider computation time, not memory access time.

halflings · on July 23, 2017

Candid question: isn't the growth rate of functions the derivative of the Big O complexity instead? O(1) does not mean that the growth rate is constant. It means that the number of operations is constant, and its derivative being zero, that means the growth is nil.

eeereerews · on July 23, 2017

O(1) means the number of operations is bounded. f = O(g) measures growth in the sense that f/g = O(1), ie. g grows quickly enough to cancel out any tendency of f to go to infinity.

SonOfLilit · on July 23, 2017

Yes, big O is not growth rate, it's a way to describe growth rate. It has an interesting definition, recommended reading.

GregBuchholz · on Aug 5, 2017

Anyone have a comparison to the "Code the Matrix" book and lectures?

http://codingthematrix.com/

https://cs.brown.edu/video/channels/coding-matrix-fall-2014/

lemming · on July 23, 2017

For someone with an ancient undergrad math background and only "interested observer" level of machine learning knowledge, would it be better to do this course before tackling the deep learning one?

wodenokoto · on July 23, 2017

No, start with fast.ai's deep learning course. They have a top down approach so you start with connecting layers to solve deep learning problems and then dig down into the theory behind the thing.

If you want a bottom up approach, go look at Ng's coursera course.

http://course.fast.ai

jph00 · on July 23, 2017

Absolutely right. There's very little linear algebra required in deep learning. This Computational Linear Algebra course mainly covers decompositions, which aren't used much in deep learning.

will_pseudonym · on July 23, 2017

If I wanted to learn linear algebra, I would reinvent PageRank.

jph00 · on July 23, 2017

It's a very helpful exercise, and something we do in lesson 9 of this course.

colorincorrect · on July 23, 2017

as a casual on-looker, no. this is a very practical tutorial, and you should be reading other things depending on what you are interested about in ML

flor1s · on July 24, 2017

Thanks for sharing this, it seems like a lot of interesting material is being discussed. The audience seems to be more like the hacker news visitor than the average student though, as it feels like little hand holding is provided.

I've just started lecture 1 but I already felt some minor frustrations:

- One of the links in the first lecture is to a notebook about intro to convolutions but that notebook is just a big code dump.

- After executing the exercises, you lose the expected answer. It might be better if the answers were included as a comment in the code fragment.

- Sometimes the given answers are not actually the answer but just the computation performed as part of getting the answer. I.e. for the matrix-matrix products section in lecture 1 the suggested answer is just the resulting matrix from doing the matrix product, but according to the question in the text the answer should be the actual cheapest shop.

- Is this a USF course or a fast.ai course?

I don't know if the author is planning on improving the material, because right now it feels a bit like a beta version.

ceyhunkazel · on July 23, 2017

Top-down approach is the best approach to teach and learn well done!

will_pseudonym · on July 23, 2017

I'm hooked by the title and excited by teaching people about the linear algebra gospel. Powering search engines forever.

unityByFreedom · on July 23, 2017

I'm excited to do this after I get through as much of the DL course as I can. Maybe that's a bit backwards but whatever.

Thanks for your hard work, Rachel! Really curious what you two will get up to next.

MarkMMullin · on July 28, 2017

Nice - I was taken by the "acceptable accuracy" comment, much less stomach acid inducing than "numerical stability" :-)

taw55 · on July 23, 2017

Would one get much out of this course with only minimal ML background?

jph00 · on July 23, 2017

Yes it doesn't directly cover ML much. It does assume some basic linear algebra, but provides links for where to learn that quickly.

kyrre · on July 23, 2017

Why 'computational' and not 'numerical'?

dom0 · on July 23, 2017

Probably to emphasize that it is a practical course. Numerical courses, at least those I've seen or completed, tend to focus on algorithms and their properties, not implementation.

> Jeremy and I developed this material for a numerical linear algebra course we taught in the University of San Francisco’s Masters of Analytics program, and it is the first ever numerical linear algebra course, to our knowledge, to be completely centered around practical applications and to use cutting edge algorithms and tools,

stuartaxelowen · on July 23, 2017

... What part of linear algebra isn't computational?

geokon · on July 23, 2017

Like... most of it? You can't do linear algebra "safely" without doing error analysis. So lots of decompositions and operations are very useful for proofs and finding bounds - but can't be use directly to compute values. It's why a good fraction of stuff people do with linear algebra is numerically garbage

mmmmmmmike · on July 23, 2017

Many theorems let you know that something exists but don't tell you how to compute it efficiently (an orthogonal basis, eigenvalues, inverses, etc.)

Also, sometimes algorithms can be invented and empirically shown to have good complexity properties before there are proofs.

zodiac · on July 23, 2017

I found the part which use Zorn's Lemma pretty un-computational :) It's equivalent to the axiom of choice and is used to show that every vector space has a basis

fizixer · on July 23, 2017

Depends on what you mean by 'computational'.

- Computational as in 'using programming and computers'

- Computational as in 'solving a math problem' (as opposed to 'theoretical': definitions/theorems/proofs/lemmas).

My guess is that the posted link means the first kind, hence almost all of linear algebra texts are non-computational, i.e., you can become an expert in linear algebra without knowing how to program, and without knowing a single programming language.

For the second kind, most of beginner and intermediate linear algebra is computational, but not all. There's plenty of theory of linear algebra, and it has connections with representation theory, abstract algebra, as well as analysis, topology, and geometry. Study of infinite-dimensional vector spaces is purely non-computational.

dbranes · on July 23, 2017

One of my current favorite things: categorification of linear algebra via higher category theory: https://qchu.wordpress.com/2016/05/31/higher-linear-algebra/

ianai · on July 23, 2017

I've been wanting to do a math refresher in linear or modern algebra for a while...tempting.