Exact mapping between Variational Renormalization Group and Deep Learning (2014)

eximius · on Nov 30, 2016

Can anyone on here can elaborate whether this a) makes sense at all and b) what its implications are for physics and deep learning.

charleshmartin · on Nov 30, 2016

See my blog: https://charlesmartin14.wordpress.com/2015/04/01/why-deep-le...

I try to highlight the paper and some of the history and relevance, in a general way, but also hitting the math hard.

the more general idea, which I am still formulating, is that DL systems are very different from traditional ML (ala VC theory)

In traditional ML systems, we tune the regularizer to optimize the capacity of the learner

In deep learning, we would optimize both the capacity (entropy) of the learner, and the optimization problem (energy function)

This is also what happens in the stat mech of protein folding, where the energy is optimized, even when we are at minimum capacity. This gives rise to a funneled energy landscape

https://charlesmartin14.wordpress.com/2015/03/25/why-does-de...

Similar behavior is seen generally in stat mech near a critical point, and this is why the RG analogy is relevant for me.

We should be able to see this behavior if we simply plot the entropy vs energy of say an RBM. Im not entirely sure yet how general this is, but it works for MNIST.

I discuss (some of) this in more detail in a video from a talk I gave this summer at MMDS https://www.youtube.com/watch?v=kIbKHIPbxiU

kdelok · on Nov 30, 2016

You might be interested to speak with David Wales (http://www.ch.cam.ac.uk/person/dw34). He has recently been looking into properties of the landscapes of neural networks and has considerable experience in energy landscapes.

dldl · on Nov 30, 2016

Is it related to the predatory/prey population problems we did in Linear Algebra?

Something like this: http://www.math.umd.edu/~jmr/246/predprey.html

mlechha · on Nov 30, 2016

Interesting. I'm curious as to why you make that connection? What reminded you of predator prey systems?

mholt · on Nov 30, 2016

Machine learning, specifically deep neural networks, are missing something rather significant as a science: some sort of universal, underlying, fundamental theorem that explains neural network topographies. Physics have laws of motion, for example. Neural nets are still "twiddle these values and see what happens" -- that kind of science represents more than a small number of papers.

This paper suggests that there is a close relationship ("one-to-one mapping") between neural networks and something in physics called variational renormalization groups. They say:

> RG plays a central role in our modern understanding of statistical physics and quantum field theory. A central finding of RG is that the long distance physics of many disparate physical systems are dominated by the same long distance fixed points. This gives rise to the idea of universality – many microscopically dissimilar systems exhibit macroscopically similar properties at long distances Physicists have developed elaborate technical machinery for exploiting fixed points and universality to identify the salient long distance features of physics systems. It will be interesting to see, what, if any of this more complex machinery can be imported to deep learning.

If we can grasp some theory underpinning neural networks, the science of machine learning will be less "let's try this topography/architecture and see what happens" and will be more of "if we add a layer of this size in this part of the network, we will expect these results, or results to change in <specific way>".

Anyway, maybe physics can teach us something about neural networks.

mathgenius · on Nov 30, 2016

The RG used in statistical physics also suffers from being ad-hoc. There is a constant stream of physics papers that propose a new "architecture" for doing RG and comparing the numerics to other architectures, with very little (it seems) understanding of why these things do or do not work as well. For example: https://arxiv.org/abs/1412.0732

genericpseudo · on Nov 30, 2016

This is very much not my area of physics (I worked in condensed matter), but from what I know I'm not seeing anything which looks outrageously implausible.

I think the intuition is something like this, and I'm going to be really informal here so please indulge me while I have a go at the pop-science version (and experts, please correct me!)

One of the ways deep learning is useful, compared to other feature extraction schemes, is that (in most formulations) it's a multi-scale method (it learns features at various length scales from very short to very long, specifically as you ascend/descend layers in the network).

The renormalization group is a mathematical framework for reasoning about how physics changes between length scales. Very informally, the hardest unsolved problem in physics is building a single theory which works for both the very short and the very big – very big problems (astrophysics) have an effective theory in the form of general relativity, very very short length scales have effective theory in the form of quantum field theory (basically all of modern particle physics), and the rest of physics (e.g. quantum mechanics, which from a purely theoretical perspective just is the entirety of chemistry and materials science...) is likewise somewhere in between on the great cosmic measuring stick.

So far so Powers of Ten, right?

The challenge is building a mathematical structure for transforming between these length scales in such a way as to have a coherent single theory which works for all of them. This is what people mean when they talk about "unification" in physics. The best mathematical framework we have for that thus far is the renormalization group.

So this paper – which I have no more than skimmed – purports to show that these two mathematical structures are (in some deep sense) more or less the same shape, which would make sense given that the problems they are solving are the same kind of shape (making sense of something at a range of length scales).

So it could be wrong, but it's not obviously misguided. Is it useful? In terms of practical machine learning applications right now, probably not (and ditto, but more strongly, in terms of theoretical physics). But if it means a bunch of mathematical techniques worked out in one ___domain can be ported over to the other (in practice, from particle physics to analyzing deep neural networks), that'd be useful!

charleshmartin · on Nov 30, 2016

The paper is pretty simple. It just shows that Hinton's scheme for introducing hidden variables, and then minimizing, is a type of variational RG, introduced by Kadanoff 10 years earlier

Beyond that, the analogy is not very strong, IMHO. But it is cool.

genericpseudo · on Nov 30, 2016

Nice to know I've not completely lost my physical intuition. (Used to study solid-solid phase transitions.)

ssivark · on Nov 30, 2016

As a complement to all the other comments so far: It is plausible that any goal directed inference must "compress" the raw data into a small number of bits required to make a decision on the goal. For physicists "renormalization group flow" (RG) is nothing but a theory behind that process of "simplifying" a theory enough to make predictions (towards a specific goal). It is also natural that this compression process is iterative. There are many implementations of RG that fall within the framework, each with their own pros and cons, which makes each of them suitable for solving different problems.

One of the more recent examples is the "Multi-scale Entanglement Renormalization Ansatz" (MERA) tensor network which accounts quantum entanglement in the RG framework (none of the previous formulations did, to my knowledge). In my opinion, that is a significant qualitative breakthrough, whose consequences are starting to bear fruit, and will continue for the next decade. OP compares MERA with RBMs and claims that they're similar.

For an information theoretic perspective of this relation between deep learning and RG, refer to this fascinating paper by Naftali Tishby and Noga Zaslavsky, that brings together ideas from different fields: https://arxiv.org/abs/1503.02406

Regarding the implications:

1. For deep learning: We might have a shot at coming up with a theory for how DNNs work. This will help us design (much) better DNNs because today's designs are pretty ad-hoc and might be far from optimal. It might also give us some explanator power in reasoning about the way DNNs work.

2. For physics: It's not clear what the payoffs would be. There has been some recent work trying to use DNNs to solve physics-ish problems (eg: guessing the phase of the Ising mode, given MCMC generated samples). The hope is that in trying to study this relation, we uncover a better understanding of how RG works. That would have potentially tremendous consequences, for every thing from designing superconductors to understanding quantum gravity.

PS: I'm a physicist and I've been exploring this area over the last year or so; if anyone is interested in discussing details, feel free to ping me.

mturmon · on Nov 30, 2016

Here's an interesting follow up from a noted researcher (Tali Tishby) with one foot in the stat mech camp, and one in the information theory camp: https://arxiv.org/pdf/1503.02406.pdf

MrQuincle · on Nov 30, 2016

What's the decimation operator?

Authors claim an exact mapping, but fail to define a few of the essential elements of any renormalization group approach.