First, thanks to the publisher and authors for making this freely available!
I retired recently after using neural networks since the 1980s. I still spend at least 10 hours a week keeping up with DL, RL, etc. it seems like the roof has blown off the field recently, progress increases exponentially. I like material that makes me think of NNs with different intuitions. I am working on a CC licensed book consisting my experiments, in Jupiter notebook/Colab form - expect me to be shamelessly plugging that in a few months.
In the book, I especially loved this quote:
“You can hide a lot in a large-N matrix.” – Steve Shenker – John McGreevy
That’s such a fascinating background to have! It must be strange to retire and, far from the cliché of your skills having been made obsolete and irrelevant, instead you’re an expert on what’s probably the forefront of modern technology.
If you can spare a moment to answer: is there any knowledge from the 80s neural net ‘summer’ which you think has been forgotten now? As someone who’s concerned by both poor performance and overfitting (strongly correlated), I thought it was a shame that lots of the research around pruning (optimal brain damage, optimal brain surgeon, etc) has been forgotten[0].
[0] It feels to me as though the ML community has convinced itself - in defiance of information theory - that highly overparameterised models are totally OK, and can even successfully extrapolate with greater than random accuracy in the general case (‘double descent’). I worry that lots of these models are effectively succeeding only at interpolation problems, and only by virtue of the massive hardware advances that let us memorise the entire training set in these enormous models - basically Runge’s phenomenon writ large. People seem to be convincing themselves of magical things that are not mathematically sound.
Pruning ( both structured and unstructured) is an extremely active area of research and development. Since the Ampere generation, Nvidia GPU’s have HW support for sparsified networks.
WRT to overparameterized models, even the largest models such as Google’s PALM 540b model has several orders of magnitude fewer connections than the human brain. Remembering a set of observations doesn’t forbid deducing a general pattern. Imagine trying to understand why the Pythagorean Theorem is true without first memorizing what it states! Children often benefit from seeing and even memorizing many concrete examples before being taught the abstract principles which explain them all. Kepler deduced his Three Laws of Planetary Motion by analyzing four decades of careful astronomical observations compiled by Tycho Brahe. Implicit and explicit regularizations bias modern learning algorithms to seek generally simpler solutions in “wide flat basins of the loss surface.” For instance, the humble skip connection acts to help smooth out the loss surface.
It's really neat to find people on HN who've been working on those structures for such a long time.
If you can indulge me, what is a lesser known or obscure book on neural networks or an adjacent topic that you think would deserve to be read?
Perhaps some of the really old texts by Kahonen, Carver Mead, etc.?
For more modern material, there are a few new good books on Transformers. Transformers are interesting because they were designed for efficiency: layers the same size, encoding both data and time sequencing information in each sample (so recurrent NNs aren’t required), etc.
I have been enjoying Natural Language Processing with Transformers [1]. It's largely focused on the Huggingface library, but Chapter 3 has a very nice walkthrough that builds up the encoder portion of an encoder-decoder Transformer from "scratch" (it still uses some primitives found in PyTorch like nn.Embedding). The decoder portion is covered in less depth and they instead refer folks to Karpathy's awesome minGPT [2], which implements a decoder-only (GPT-style) Transformer in ~300 lines of nicely-commented Python+PyTorch code.
For a higher-level conceptual view of how Transformers work, you can check out the now-classic "Illustrated Transformer" series [3] and this programmer-oriented explanation (with code in Rust) from someone at Anthropic [4].
Read up on the calculus of variations if you want to be obscure with respect to neural networks. Genetic algorithms for hyper-parameter search is also interesting.
The "deep" part of DNNs has basically thrown mathematicians and statisticians into an infinite loop that they can't quite compute yet. It's a brand new world and we need them to participate.
> we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem
I am surprised and a bit disappointed this paper does not mention mean field theory or dynamical isometry at all.
Mean field theory applies methods from physics - namely random matrix and free probability theory - to derive an exact analytical solution for information flow through a neural network.
It turns out that simply initializing the weights of a plain CNN using a delta-orthogonal kernel allows all frequency components (Fourier modes) to propagate through the network with minimal attenuation. Specifically, networks train well when their input-output Jacobians exhibit dynamical isometry, namely the property that the entire distribution of singular values is close to 1. This technique effectively solves the exploding/vanishing gradient problem.
The impact is shocking: the time to train a NN to a given accuracy becomes independent of network depth. No tricks like batch normalization, dropout, or anything else are needed. This insight has been proven for a wide range of architectures from plain FFNs to CNNs, RNNs, and even transformers.
I highly recommended reading the papers “How to Train a 10,000 Layer Neural Network” [1], and “ReZero is All You Need: Fast Convergence at Large Depth” [2].
The level of analysis deep learning systems get is considerably out-of-proportion, and speaks to a superstitious view of how ML works. Namely, that it uncovers latent representational structure not present in the data.
(Incidentally, its clear this book doesnt make this mistake, already identifying that NN is basically kNN).
Rather ML algorithms are just rememberings of data coupled with various degrees of compression, with kNN having zero compression -- just a straight weights=data; and NNs having fairly significant levels, where weights=ensemble(compressions(data)).
We should therefore regard the ML step as incredibly trivial, it is just a clever process of averaging over the data given to it. The whole "magic" of ML, such as it is, is *only* in the data. And this is where the word "data" hamstrings our ability to see properly.
Everything *isnt* data; and "data" isnt some source of information. The world exists, and "data" is just what we call any measurement of any part of it by any means. "Data" is only *relevant* to the probelm we're trying to solve if we do incredible amounts of experimental work to carve-the-world along its joints, ie., to have the right concepts; and incredible amounts of work to measure along its joints, ie., to have the right units. *And then* to eliminate all the coincidences and irrelevances. *And then* to provide that to a machine, which at this point, does basically nothing but automate our effort.
Almost all data it is possible to collect is useless, indeed, an infinite amounts is useless. The magic of ML is a sleight-of-hand trick -- we dont really need to know how its averaging of our data does anything useful -- it almost never does.
Rather, it is our "experimental design" which produces the usefulness of the system. ML algorithms are just interpolations and averages through data prepared to produce useful averages by (literally millenia) of human ingenuity.
It takes actual intelligence to do this because the world isnt data, and almost any measurement one cares to make (with eyes, even) produces endless ambiguities and coincidences that you have to "be in the world" to resolve; and resolution is a dynamic process which you "have to be here for".
"Namely, that it uncovers latent representational structure not present in the data."
Who thinks this? You're talking to straw-men. The latent structure is absolutely present in the data, but a series of transforms makes the structure more readily available. (Where else would it come from? Forest spirits whispering to the model?)
The 'availability' is made evident by training a linear regression for classification on the learned representation (keeping the base network fixed). This works decently-well for a good representation, and not at all for raw pixels. The point is that you can use 'simple' algorithms - including kNN - on this transformed view of the inputs; it's not magical at all.
FWIW, I find the classification problem space to be riddled with conceptual holes, while other application areas are much less problematic. The use of neural networks for compression makes a lot of sense, and much of representation learning can be thought of as augmentation-invariant compression for some family of augmentations. Some of the recent big advances are really about getting away from the problems of classification.
Meanwhile, reinforcement learning, diffusion learning, recurrent networks, and GANs all bring different kinds of 'interactivity' into the learning problem, and resolve your complaint about ML algorithms working on 'just data.' The iterative processes in these models provide space to push against the data with decisions, and react accordingly.
I find your complaints to be lacking in curiosity or even knowledge about what people are actually doing in the field... Great results keep piling up and ML systems are used by billions of people every day, while you continue to call no-true-scotsman...
Representations in the animal sense can be observed as that which is in common across inferential contexts inn which animals are confronted by the need to use them:
My model of "pen" can be observed in action when: I write with a pen; I'm asked to pass a pen; when I'm asked "what color pen do you like to write with?" and so on.
What is this structure "pen"? Is it a latent pattern within my experience? No! It is a means of regulating my body, imagination, emotion, etc. "The pen is mightier than the sword!" is something I can agree, or disagree with; I have a feeling about it. And in inferential contexts which require me to feel, I feel.
This is what animals acquire when they understand the world. This is what everyone believes ML is uncovering, whether they realise it or not. We just dont have these ML-like templates, so we don't realise how thin and brittle they are.
When ML is run on text, is detects character patterns which form "character templates" that other text can be compared to. I cannot ask, sincerely, what the machine thinks about whether "the pen is mightier than the sword" -- because it has no attitude towards that proposition.
The character structure of the phrase "the pen is mightier than the sword" relative to a body of text, has nothing to do with any concepts -- that structure has nothing to do with the inferential contexts in which those concepts will become useful.
Rather, this thin character-space text structure can be repeated to people in the form of an illusion: we the dumb apes who have the concepts are fooled into thinking that some machine assembling of these words had something to say.
This is a very cheap trick, and fails pretty trivially. The relevant structure to be learnt, the representation, has nothing to do with sequences of characters.
Though the industry likes to talk about "latent structure" in this data, it isnt latent. It is right there on the surface, indeed: it just is that surface.
When the machine is challenged to do anyything other than report this average surface structure, it fails. ML systems dont work. They improve ROI on the margins, but automating basically broken inferential rules at scale -- it is the automation here, and speed that is useful. Not the actual quality of inference.
"They improve ROI on the margins, but automating basically broken inferential rules at scale -- it is the automation here, and speed that is useful. Not the actual quality of inference."
This is trivially incorrect, and again shows a lack of curiosity. A couple obvious counterexamples are AlphaGo (and descendants) and AlphaFold, both of which have quality of inference far out stripping previous efforts. In AlphaFold's case the results are pushing forward a whole field of science. Plenty of systems at this point in history outperform human experts, and not just on speed metrics.
'"The pen is mightier than the sword!" is something I can agree, or disagree with; I have a feeling about it. And in inferential contexts which require me to feel, I feel.'
What are these feelings? Are they necessary for learning? Will airplanes never fly until they learn to flap their wings like birds?
'we the dumb apes who have the concepts are fooled into thinking that some machine assembling of these words had something to say.'
Still arguing with straw men here...
'The relevant structure to be learnt, the representation, has nothing to do with sequences of characters.'
Of course it does. For BERT, the output is a conditional probability of words based on surrounding context.
OP makes great points that are not being addressed here. If you believe that animal intelligence is reducible to abstract symbol shuffling then it should be possible to extend the existing approaches to actual robotic automatons but it is clear that these models are unable to deal with the dynamic structure of the world and there is no reasonable avenue to actually make that happen. So either intelligence is reducible to abstract symbol shuffling or there is a semantic fib happening here that substitutes "abstract symbol shuffling" for "intelligence". The fact that we imbue these abstract symbols with meaning is where the intelligence comes from, i.e. people attribute intelligent behavior to these systems because we have learned to attribute intelligent behavior to the production of abstract symbols.
It's clear to me that current mathematics is insufficient for creating intelligence (let alone general intelligence). If you know of anyone that makes a coherent case for the mathematical basis of intelligence then I would like to see those references.
The OP makes overly broad arguments about AI that get details the details wrong and have obvious counterexamples in current ML techniques and results.
On the ML side, there's been steady progress with an accumulation of very visible wins, accompanied by nearly-invisible adoption of techniques in day-to-day usage. (eg, keyboard text completion and text to speech systems.)
Much of this progress has been in specific subdomains, as combining domains requires massively more effort and investment. However, it's starting to happen; CLIP methods create joint embedding spaces for multiple modalities, and have led to the zero-shot learning we see exhibited in DALL-E. This is stuff that simply wasn't possible five years ago, building on sub-___domain tools which have greatly increased in their quality such as BERT.
These joint embedding spaces will get better, encompass more modalities, and fuel further results. For example, as joint text-video embeddings become more powerful, we'll have embeddings which jointly encode text and physics.
As a result, I would be extremely hesitant to write off robotics applications. There's no big-headlines breakthrough on ML robotics /right now/, but we also didn't have a clear road to zero-shot image generation from text inputs a couple years ago. Notice that many of the existing 'big' results were obviously impossible until they very much were not.
Finally, I don't believe that we have a good enough definition of 'intelligence' to say with any certainty whether current mathematics is sufficient or not. Don't underestimate the potential for simple components to produce complex behaviors, though. But also, don't expect Pinnochio; intelligence may be broader than the human implementation.
OpenAI shuttered their robotics division because they did not see any viable path to commercial applications so they pivoted to generating pixel art. Similarly, DeepMind has not been making any claims of achieving AGI because they're smart enough to realize that statistical modeling of physical systems is a very small subset of what counts as intelligence.
I'm not dismissive of the progress in the field. What I find confusing is why so many people are convinced that what we are seeing with these abstract symbol shuffling systems is intelligence. All it does is confuse the average person about what these tools are capable of because at the moment they are only capable of amplifying biases in existing data sets. No statistical model can escape this trap and at the moment we essentially have automated bias amplifiers that are being sold as some kind of revolution in designing intelligent systems.
Hardware is expensive to iterate on. ML research is already expensive, without worrying about hardware. I expect we'll see plenty of additional attempts in robotics, regardless of what makes economic sense for OpenAI in the short run.
"No statistical model can escape this trap"
Your claim here is that intelligence requires innovation?
AlphaGo certainly went beyond the bounds of the existing training data. Likewise, zero-shot learning (as we see in Dall-e 2) demonstrates the ability to combine concepts combinatorially, rather than drawing from raw prior observation.
I still wouldn't call this intelligence, but it's yet another indication of how the goalposts move in the conversation. (Never mind that we typically at this point ask to satisfy indicators which most humans could not satisfy...)
For just about any simple indicator of intelligence there's been a concerted effort to make a neutral network with that property. And most of them have had a degree of success, moreso over time. The 'confusion' comes because these simple indicators have repeatedly been set and overcome.
To me, these arguments are vaguely reminiscent of the philosophical arguments from the 80s and 90s. I also remember some people using Go as an example of a problem to which ML approaches won't work. We've gone from people giving up on computers solving Go, to human Go masters retiring because AlphaGo is impossible to beat.
I mean, AlphaZero is trained solely on self-play. It is not even given the rules of the game, it exists in the world where it is rewarded or punished by the 'laws of physics' of the Go board the way we exist in a environment with physical rules that constrain and reward or punish our biology.
To say that AlphaZero is just data compression of the inputs seems hand wavy. It is data compression only in the sense that phenomena from the world is stream of data, and humans developing representations of that data (eg laws of physics) around that data are a compression of it.
But AlphaZero wasn't given a huge feed of pre-played world data. Rather, it interacted with, and poked around in a simulated environment, until it was able to make good predictions on how its interactions would turn out. I learn that dropping a ball falls to the ground, and so I can make a prediction of what happens if I drop a ball. How is AlphaZero predicting the outcome of moves purely from self-play just another kNN? If so, why isn't our brain's learning just a kNN then?
My claim is that intelligence is more than just statistical associations and abstract symbol shuffling. It's impressive what large statistical models can do but they still can not solve sudoku so something is clearly missing here because neural networks do not have feedback loops and backtracking. It's like saying all we need to do is continue building bigger and bigger abaci and stacking them in just the right way as to emulate the statistical properties of the real world. Dall-e is dazzling but it is still a statistical model with no symbolic understanding (it's still a giant abacus). It's obvious that people have symbolic understanding (e.g. written language, mathematics, solving sudoku, writing code/software, etc.). So if people are the benchmark of intelligence (dubious but let's assume for the sake of argument) then at what point do you suppose there will be statistical models with symbolic understanding? Furthermore, what reason is there to believe that larger and larger statistical models are going to get us closer to non-human intelligent systems that do more than generate stimuli adapted to our senses?
There is also a meta-problem that no one seems to address when discussing AI. All the systems we have built rely on compositional symbolic systems (mathematics) for expressing statistical associations and human interpretation of their inputs/outputs. Clearly there is something people can do that no existing AI system can which is to generate a symbolic description of statistical models that can be adapted to various data sets.
I could say more here but none of what I'm saying is anything new. Others much more capable of describing the issues and shortcoming of the existing approaches to AI have written books exploring the issues in much more detail, e.g. Gary Marcus, Melanie Mitchell, Douglas Hofstadter, Gian-Carlo Rota, etc.
"It's impressive what large statistical models can do but they still can not solve sudoku so something is clearly missing here because neural networks do not have feedback loops and backtracking."
Again, the things you ask for exist. Recurrent networks and reinforcement learning both have feedback loops. (And there's a reasonable argument that residual networks can be interpreted as 'unrolled' recurrent networks.)
Here's a completely random paper on reinforcement learning for Sudoku with non-zero win rates (and a few other games): https://arxiv.org/abs/2102.06019
I'm not sure anyone's bothered to take a real crack at Sudoku specifically. It's another example of a weak indicator, though: someone will happily solve it if you're willing to call it the bar for intelligence. Given where we're at on game-playing generally, it seems very doable with current technology.
"at what point do you suppose there will be statistical models with symbolic understanding?"
Understanding again has no real definition, so this is open to endless argument. I think it's fair to say that DALL-E understands what an astronaut looks like, though.
NN == kNN is a really interesting thought, but I think it’s the wrong framing.
All learning is bound by kNN, including, I would argue, human intelligence. When we compare and contrast things, be it words, objects, or images, the very best we can do is to compare those “units” to all other examples that we have ever observed.
There is no such thing as generalising beyond kNN. The real magic in generalising is not how many nearest neighbours you are able to compare against, the magic is in _which_ qualities you use to determine the kNN. Another way to state this is: intelligence is in the distance metric/embedding/compression, not the ability to process all examples.
When we say that humans are able to generalise, what we really mean is that we are able to compare and contrast specific subsets of qualities for a given object against all other things objects we have knowledge for. The way we choose and weight those specific qualities is what results in our “intelligence”.
So when we say that NN == kNN, I think this is actually a really impressive achievement for neural nets. Neural networks are able to produce kNN level results with orders of magnitude lower compute. Humans too, are able to do this, and if we’re being very thorough, we sometimes fall back to kNN-like processes
Intelligence is in the properties of the data supplied to the kNN. Not in its distance function.
Yes, kNN is the bound if you formulate learning as: there exists an f_true: Pixels->Animals; we have an infinite sampling of (Pixels, Animals); kNN performs at 100%. Therefore kNN is the bound.
Whereas I dont think this has much to do with learning. Learning for me is: the function probably doesnt exist, we dont know what its ___domain would even be, and we have no idea how to sample that space. You know: the history of human civilization, but also: the history of a day of any animal's life.
We encounter problems daily whose functions can only be modelled by partial input and output domains, and our job is to find ways of resolving the inherent ambiguities in those situations. Ie., to find what parts of the world are relevant, to find ways of measuring it, and hence to find more data until the ambiguity resolves itself.
kNN being a perfect algorithm is a catastrophe for this whole formulation of AI. All it means is that the whole field of machine learning has as its limit what you can do if you already know the answer. You can see from this that it isnt learning at all.
The situation is exactly backwards: ML assumes relevant well-carved unambiguous data. The whole problem of learning is precisely to produce it!
>kNN being a perfect algorithm is a catastrophe for this whole formulation of AI. All it means is that the whole field of machine learning has as its limit what you can do if you already know the answer. You can see from this that it isnt learning at all.
Doesn't the existence of unsupervised training and models that solve previously unsolved problems somewhat invalidate the NN=kNN analogy then? The protein folding problem didn't have an already known overall answer for instance. The model clearly 'learned' representational information that we are not aware of in order to be so much more effective at solving the problem than we are.
Or similarly with the Starcraft or Dota AI, while some strategies learned were similar to humans, others were entirely novel. I think that's getting pretty close to 'learning' as you've defined it.
I'd argue that 'survive and reproduce' is not all too different as a goal from 'maximize this score' when it comes to problems as open ended as playing MOBAs. There isn't much of a dataset and there are endless numbers of ways to go about solving the problem.
I think you are probably right about needing to be "in the world" (if by that, you mean being able to interact with it and see how its state, and therefore the data you have about it, changes), but I feel you are being too hasty in ruling out any situation where humans have had any role in preparing the data.
In the evolution of natural intelligence, physics provides the ___domain distribution and Darwin has explained what the reward function is. It does not follow that no learning can occur in an environment where either or both of the ___domain and reward function are defined by humans.
As far as I can tell, this was the explicit goal of AlphaGo Zero. Even if we were to accept your position that the data it was started with "contained the solution", then the fact it does rather better than humans at finding that solution is, by itself, significant. (In this view, one might characterize the evolution of natural intelligence as finding, within the biosphere ___domain, a rather successful solution to the survival problem.)
It would seem to me to be begging the question to say that the success of AlphaGo Zero is not learning because the environment was human-specified (At the same time, as Go is a highly-constrained environment, there is no reason to suppose that its successes indicate we are anywhere near AGI.)
I dont mind people preparing data to build intelligent machines -- I mind what data they are preparing, and subsequently, what their claims about these machines are.
If you can build a hand to grasp objects, and train its substructure to grasp this-way-and-that -- fine. So long as it can, in the end, also train itself... as we all do when we type.
Nature provides something to bootstrap learning, but it isnt "data" in the sense in which ML requires data. It isnt relevant, quantified, premeasured. The "data" nature provides is in our biochemistry .. how we react to our environment, etc.
If AI research can produce an intelligence which is able to formulate the very terms of problems it wishes to solve, great. I dont see "summarising the solution" as a strategy which is event in the ballpack.
To be more specific, do you regard the data AlphaGo Zero was initialized with as being relevant, quantified, and premeasured to the point that no learning occurred between then and its defeat of expert players? I don't think the data they had as they learned to become experts was any less relevant, quantified, and premeasured - and quite possibly more so, if they read about tactics and strategy in the game.
While I agree that nature's data is less relevant, quantified, and premeasured than what current ML feeds on, I don't see that as establishing that there is a relevant qualitative distinction that renders this divide unbridgeable in principle. Every organism that senses its environment is processing data.
With regard to formulating the very terms of problems it wishes to solve, I have no difficulty in seeing that this has not been achieved yet, and personally, I don't expect it any time soon. At the same time, you seem to be very close to saying that no artificial system could do this because its goals are always, in some sense, those of its creators. To be clear, I would regard such a position as mostly avoiding the issues.
We don't. You're taking the learning out of the process you're discussing by using an infinite sample. Of course, then, the machine learning part becomes trivial, and it's just an optimization problem for minimal training error!
If we have all the data, like infinite data, then we have the classification for every one of the (finite) number of possible images already. At that point, why stop at kNN? Use a hashmap!
> Intelligence is in the properties of the data supplied to the kNN. Not in its distance function.
If this were true, then all people who observe the same training samples would learn equally well and predict equally well. But we don't. Some of us invent a better loss function than others or reject irrelevant data better. The representation formed by the data is surely derived, not fixed in its raw form. Forming a better representation of the data is the art of learning better. I'm convinced learning and remembering what you learn is more than kNN; in humans it involves data compression and recontextualization as you improve your loss function or tie your old representation to newly learned data — old raw data and representations that heretofore you had not believed were relevant to that objective (or lesson).
> all people who observe the same training samples would learn equally well and predict equally well. But we don’t.
How do you know that we don’t? It’s an impossible experiment to conduct, right? Since same training samples == every stimuli experienced since birth, and even then you have the problem of possibly-different “initial conditions” i.e. development of the brain in the womb. Maybe if we did replicate that, we would learn & predict equally well.
This comment is making a "slight of hand" itself: how do we define the metric for kNN? That will directly impact kNN performance.
Using Euclidean norm over pixels on imagenet will not get you anywhere.
By recasting classification into a kNN lens you're basically (in the large data regime) recasting the problem to kernel learning under some underspecified RBF kernel with an unknown "proximity" metric instead of Euclidean.
In grad school, someone in my lab actually tried scaling kernel methods directly to imagenet too. I don't think that ended up working naively (an interesting neural variant worked for CIFAR10, though [1]).
If all NNs are doing is efficiently learning an appropriate kNN metric (or, more idiomatically and generally put, learning kernel parameters for some implicit neural kernel), that's still really powerful and all we've done is just renamed "learning representations."
DNNs end up encoding parts equivalent to type theories in their operations — extracting an “internal language” (in the TT/CAT) sense which models the data. In particular, convolutions are a form of extracting/deducing types from your input. A full DNN deduces a model of the data in which it can type the incoming data and then reason about relations on those types to decide the challenge.
What is it that you think you’re doing, besides forming a model from data in which you reason based on an internal language, which describes the data you’ve seen?
Your post comes across as ignorant of eg, the overlap between TT and DNNs, while simultaneously full of woo about how humans operate.
Sure, but all this is only within one space. The "internal language" is a formalism of that space, and not beyond.
What I think we're doing is, you know, forming concepts which are not constrained to spaces. I dont even really think our concepts, which are mostly just bundles-of-techniques, live within the space of the target object, nor its measurment space (ie., that of experience).
Eg., my "pen" includes being able to grasp a pen (indeed, the relevant motor skills are prior-to and necessary-for the abstract concept formation).
What you're talking about is a pixel-space projection of "pen" being able to function like the (animal, body-first) concept "pen". It doesnt.
Rather "pen" in pixel space is, sure, an "internal langauge" of a sampling of AllPixelPatterns concerned with pixel-space projections of "pen". This i'd call a "template", and I dont think it has almost anything to do with concepts/representations/etc.
What we are doing when we acquire motor techniques which produce "ways of sensing and moving" that eventually could become reified as the abstract "Pixel Pen" is really nothing at all like sampling PixelPenSpace and deriving a tempalte.
To see the difference, consider that our concept allows us to resolve ambiguities -- eg., if I think something might be a pen, I can go beyond one space of measurement (eg., sight) -- eg., move the pen, write with it etc. -- and thus return to the target space "Is Pen?" with sufficient confidence.
Once such "mere templates" exist, everything we do, isnt required. Indeed, a calculator can take over at that point; put it in a cupboard forever to repeat whatever thin inferences pixel templates admit.
> Sure, but all this is only within one space. The "internal language" is a formalism of that space, and not beyond.
This is incorrect:
You can swap encoders and final fully connected sections, then mildly retrain the network to substantially save on training effort — “transfer learning”.
Further, you can compose networks, eg image recognition on top of something extracting a structural map.
This implies that the structures they’re “learning” generalize to different contexts — with a bit of retraining. In much the way your knowledge of programming transfers between languages (swapping encoders) and tasks (swapping final connected network).
Your point seems deeply based on this incorrect understanding of DNNs.
Sure, as set up by the way we have constructed the dataset. "Green" isn't near "Leaves" because the world placed them there.
All these properties follow
from properties of the data which we have structured to produce these solutions.
We arrange the target ___domain of one dataset up, so that the target ___domain of another is aligned --- and in doing so, solve the only problem which requires intelligence.
Yes, the character-space structure of Y labelled stuff aligns with the pixel space structure of Y labelled stuff. And feature space templates can likewise be associated.
This is just more of the same. King +Woman isn't Queen --- it isn't anything. This isn't reasoning. And likewise PicOfKing+PicOfW isnt PicOfQueen.
This is schizophrenic pseudoscience --- it's actually the mind of a
schizophrenic who obtains sequential thoughts just by non-semantic associations.
The latent structure being learnt in both cases is just the coincidences we rig in the Y
___domain --- Inhave no doubt they 'transfer', we
wrote them in
You can show a DNN a large sample of photos of animals, retain the lower sections of the network, and then show it a large sample of landscapes — and it will be able to recognize those faster precisely because a well-trained network picked up on the latent structure of shapes, and can reason about shapes in its internal language for describing pictures. It’s acquiring genuine semantic information — and learning about new experiences relative to that previous acquired understanding.
We didn’t create that similarity of form in nature — and DNNs discover that in much the way we do: finding recurring, abstract patterns to utilize in our internal language.
We labelled them -- their pixel space geometric structure has no inherent relevance to their character space structure --- and the only inferential
contexts this will work in is where we rig the deployment to only require these sorts
of coincidences.
Animals acquire robust techniques which are not brittle to these sorts of thin contexts.
We do so by not templating pregiven data -- but by coordinating our bodies in the world, and rearranging it, so as to determine how to regulate ourselves in response to it.
Only a teen reading cosmo would think King-Man=Woman. The reason we can even interpret this thin formalism is because of our robust understanding of the concepts that these character symbols name.
With these concepts we don't cross domains, we regulate our own imagination, and so on, so as to create representations which are thick across an infinity of domains.
Talking in terms of feature domains has it backwards -- it assumes they're available for measurement. We don't transfer thin templates across
domains. Our representations aren't within a feature space --- they're techniques for self-regulation.
With these we can construct templates in an infinite number of different feature spaces -- ie., project by imagination from our techniques
the downvotes on this one show some blind alligiance to word2vec it seems to me; I tend towards questioning the playing field, not clamping to a (large) set of useful vactors to be the new "hammer looks for nails"
Humans ability of conceptual modeling is entirely constrained by their perception. Humans can't conceptualize a tesseract. All of their conceptualizations are in some form related to their perception.
When you're thinking of a pen, are you thinking of the atoms, molecule compositions and all the quantum effects going on? No. Your mind is, just like ML, working from a template constrained by your perception (visual, haptic, auditory).
Humans are more advanced than ML, but they are nothing special.
Human can't visualize a tesseract, but human can conceptualize the Idea of a tesseract in the symbolic space, by math, physics, or in other words, by Reason.
The Symbolic is a radically simplification of all the complexities impossible to be fully sensible. Even though the simplification is always particular, contingent, and full of ambiguity (human languages) and often inaccuracy (Newton's laws vs relativity), without the simplification, without Reason, ML systems are probably like animals, eventually succumbing to the full force of the complexity of reality.
Perception in animals changes the structure of their bodies as they are coordinating with their environment, such that they acquire motor techniques and hence new ways of structuring their perceptions.
Perceiving the world isnt a passive activity in which facts strike your eyes. The world doesnt have these facts, there is no data, and nothing from which to simply "average to a template".
When light strikes your eye, there is no "keyboard" in it. Nothing in it from which derive even a template of a keyboard.
There are only templates in datasets we prepare -- and we dont prepare them by actually "encountering datasets in reality". Rather, we arrange reality and measure it so-as-to-be-templateable.
What animals have is the ability, with effort, to engage in this dynamic -- to arrange the world to make it knowable. It is this process of arrangement which requires intelligence; not "taking the average after it has happened".
There is nothing in the world to be "percieved" in the ML sense, as in: ready for analysis. That's an illusion we construct: we make "perceiable data".
It is our bodies, and our concerns, which make the world itself perceivable. The world itself is infinitely dense with infinities stacked on infinities. There isnt "data" to be templated.
And this isnt theorertical: no ML system will ever work. All that works is our data preparation.
Could you elaborate on the connection between Deep Learning and type theory, resp. topos theory that you allude to? It sounds fascinating. At the some time, I'm slighly surprised that I am not aware of results in this direction.
I think it mostly exists at the level of folk lore right now, with preliminary bridges. I’ve spoken some with people at, eg HoTT 2019 or Topos Institute speakers, but I think the state of publishing is that the other direction is being worked out fully — since it solves problems in the math community like accelerating proof engines via GPUs. (Maybe, I hope.)
There’s three approaches to seeing the connection:
1. If the manifold hypothesis is true, then the semantic information between the original manifold and reduced manifold will be in the topology of that map — and your TT is a relabeling of the synthetic topology description.
2. For the convolution comment, take a TT statement, view it as a CuTT structure, and draw the construction diagram of that “shape”. Project that into a matrix encoding and… convolutions are right-side type division! Eg, finding a submatrix that matches “Line(…)” in your overall matrix. But it’s right-sided division since you can only remove the “innermost” factor in a product that way. (There’s also a left-handed, but it’s less relevant here.)
3. The paper on ML via hyper plane splits to create data coverings also suggests that what we really want to know is the topological relationships between kinds of data. This is furthered by, eg, “animal” in word2vec ideally being the centroid of the open ball covering animals, as they’re embedded.
The manifold hypothesis is true when the measurement bases aren't orthogonal, and there's lots of them eg., when we measure 3D objects by their relative position to 100 objects in a room.
That isnt a bridge principle between spaces which measure different properties.
The low-D structure of any high-D dataset is only a "washing out" of spare measurements of the same properties. These arent different "property-spaces", ie., the low-D structure isnt a world one has bridged to via statistics (or grad desc, etc.).
The properties the formal structure of text captures aren't the properties of objects.. they're something like the social-vocab-coincidental usage of language users in how they "orient their words" around topics. If we change our usage, models which find this structure, become invalidated.
We language-users routinely change this structure, and robustly against misunderstanding, because we bridge to the relevant property spaces by being in the world.
>because we bridge to the relevant property spaces by being in the world.
Isn't there an implicit conclusion here that the extra dimensions of 'world being' would actually make the manifold hypothesis true, it's just missing data in the training model? Not to say it's trivial to fix, that missing data might be derived laws of physics, smell, shared world models of a society etc. but none the less I fail to see how conceptually any of that couldn't be reduced to data.
If you have all possible ways of measuring the world, and all possible measurements you can produce all possible models. A subset of those models will be appear relevantly intelligent.
But that isnt intelligence. Intelligence is the solution to the problem of those "all possibles" being imposssible. Namely, intelligence is how animal's conduct themselves in the face of not knowing what the data is..
they do this by conducting themselves and their bodies in their environment -- they dont focus on the problem in the AI/ML/etc. sense... In this way they dont make decisions based on "data" -- they move according to self-regulatory goals.
The data which they encounter, in that movement, becomes useful for reasoning about the world. But its radically more accidental, partial, and speculative than ML requires.
There might be some truth in what you say for very large image and language models that use supervised learning.
It is really hard to see how this 'it's just a lot of good data' view applies to deep reinforcement learning where the model learns multi step policies from raw input data (e.g a camera on a robot) with only a rough high level reward function to guide it.
If therefore (as seems to be the case) you can abstract the information humans need to provide to the model/learning system to ever high levels of reward function (and thereby vastly reduce the information provided by humans) then it seems very hard to argue that the model (and the training process) isn't doing to some degree what you describe as:
'incredible amounts of experimental work to carve-the-world along its joints, ie., to have the right concepts; and incredible amounts of work to measure along its joints, ie., to have the right units. And then to eliminate all the coincidences and irrelevances.'
For example, imagine a robot learning from scratch to pick objects up based on raw pixel data with only a scalar reward function - where in this process is the human preparing the data so the model only has to average?
> For example, imagine a robot learning from scratch to pick objects up based on raw pixel data with only a scalar reward function - where in this process is the human preparing the data so the model only has to average?
Great -- so do you have an example of such a system?
I'd be inclined, initially, to deny that it exists. If your reward function expresses a reward for the goal of "picking up objects with (pixel-space) properties etc.", you're cheating. In this case, the reward function serves the role of the data: ie., prepared by us to work. Indeed, a function is just a dataset -- and the reward function here is being sampled by the system.
You'd need to show me a system whose reward function / dataset didnt "contain the solution", in the manner of animals who respond to the world without already having all the information about it.
The relevant capacity a system needs to have, in both cases, is being able to take a profoundly ambiguous environment and produce a dataset/reward-fn which "carves along its joints". Ie., which effectively eliminates that ambiguity.
When such ambiguity & coincidence is eliminated, there's basically nothing left to do -- it's that basic nothing which we task machines with doing. Ie. running `mean(sample(unambigious relevant well-carved data))`.
You'll note its the *properties* of the data which express intelligence & learning.
Plenty of RL systems learn to play video games just fine without fine-tuned rewards, but I see this line of thought isn't actually what you're getting at.
I would assume serious ML people would not be overly ambitious and overstep their claims beyond empirical realms. You were saying ML "uncovers latent representational structure not present in the data", but I would guess the claim, if that is what you're going against, is merely that the latent structures exist, and no Truth is really "uncovered" by ML per se, in the Heideggerian sense.
I agree ML hasn't really produced an Understanding of the world. The carving along the joints is in other words a symbolic abstraction of the world that is a radical simplification, for which only Reason is capable of, and ML hasn't shown to be capable of Reason. As an aside, I also would not assume the ambiguity you refer to can be fully eliminated even by human intelligence, just see how languages are fully of ambiguity, or even quantum mechanics.
But again, when philosophical critiques are launched against ML, the usual story is ML advocates would retreat to the success of ML in the empirical realms. I'm reminded of the Norvig vs Chomsky debate by this.
I think this debate has historically suffered from being conducted purely philosophically. Hediegger, Dreyfus (Ponty et al.) needed a bit more science and mathematics to see through the show.
All we need to do to make the Heideggerian point is ask the RL researcher what his reward function is. Have him right it out, and note, that its a disjunction of properties which already carve the environment of the robot.
In otherwords, the failure of AI is far less of a mystery than philosophy alone seems to imply. Its a failure in a very very simple sense if one just asks the right technical questions.
For RL, all we need ask is, "what will the machine do when it encounters an object outside of your pregiven disjunction?"
The answer, of course, is fall over.
Hardly what we fear when the wolf learns our movements, or what we love when a person shows us how to play a piano for the first time. The very thing we want, and we are told we have, isnt there... and it's not "not there" philosophically... its not there in the actual journal paper.
The Heideggerian point is a start, but I don't think it's enough to just point out a failure like this. This allegation is something like "The answer is already encoded in the question" like of trick, similar to one played in Foucault's episteme, where science itself is always-already a social construction without which it is impossible to happen.
The trick is challenging on first sight but it won't go very far, because it just tells us what ML lacks but doesn't tell us what ML can have and how to go there. We need a new kind of Turing test that actually reflects the power of human intellect.
I suspect even thinking there's a "test" has it wrong.
Yes, there's an experimental test -- as in, testing to see if salt is salt. But I dont think there's a formal test... as soon as you specify it, you've eliminated the need for intelligence. Intelligence is in that process of specification.
In otherwords, we should be able to ask the machine "what do you think of woody allen's films?" and rather than just taking any answer.. we need an empirical test to see if the machine has actually understood the question. Not a formal test.
There is no doubt a sequence of replies which will convince a person that the machine has understood the question: just record them, and play them back.
We're not interested in the replies. We're interested in whether the machine is actually thinking about the world. Is it evaluating the films? What are its views? What if I show it a bad film and say it wasnt by woody allen? What then?
There's something wrong in seeing this as a formal, rather than experimental, process. For any given machine we will need specific hypothesis tests as to its "intelligence", and we will need to treat it like any other empirical system.
OK, maybe "Turing test" was a bad hint because too often its extension turns into a philosophical rabbit hole of defining intelligence.
I want to get back to your initial statement about uncovering and structures, which I think is still grounded in the empirical realm. I think a less ambitious new test could be about the "uncovering" between analog data and the structures. To be real uncovering, the structures must be symbolic, not just transformed analog representation, and the symbolic structures must be useful, e.g. provide radical reduction of computational complexity compared to equivalent computation with analog data.
The point is to test if the machine can make the right abstraction (real uncovering) and also connect the abstraction with the data, not just games with words.
> For RL, all we need ask is, "what will the machine do when it encounters an object outside of your pregiven disjunction?"
> The answer, of course, is fall over.
There is no reason to think humans are qualitatively different in this regard, it is just that it does not happen very often. One case where it does is that humans pilots, no matter how competent, are incapable of flying without external visual references or instrument proxies for them.
If I am following here, a key part of this argument is that models only represent things "in bounds" of the model, and that unsupervised, iterative approaches are especially susceptible to this. Video games are enormously constrained, artifical model environments, and therefore by definition are completely discoverable.
Meanwhile, human cognition and the real actual world, have vast and subtle detail, and also are not completely knowable at any level minus some physics or similar. Tons of possible data sets are not necessarily discoverable or constrained, yet humans can investigate and draw conclusions, sometimes in very non-obvious ways.
Falling back to pure philosophy, personally I am heavily on the side of the human, and in the wake of Kurt Gödel, believe that plenty of formal systems are never complete, nor can they be shown to be complete or incomplete.
This would be one example from Deepmind using raw pixel input to stack objects. This has a relatively detailed reward function (but is also a very complicated task) - https://arxiv.org/abs/2110.06192
There are other examples from OpenAI a while back using even just sparse rewards (i.e binary 1, 0 for success or failure over the whole task) - but these weren't pixel input if I remember correctly - https://openai.com/blog/ingredients-for-robotics-research/
I m afraid if you think providing any reward function is cheating then we have fundamentally different views of what AI/ML even means/involves. It appears humans and likely all animals have largely pre-programmed reward functions developed over billions of years of evolution (pain is bad, food is good, etc.). These reward functions are ultimately what underpin what we are trying to do, what outcomes are good/bad, to what degree we 'want' to explore vs exploit. The idea that human and animal 'intelligence' is born as a blank slate with nothing to guide it and no reward function to maximise doesn't seem to bear any resemblance to reality.
The only difference between a reward function that tells a robot 'you need to stack these objects but I m not going to tell you where in 3D space the objects are or where they need to go to stacked or the shapes/forces involved' and an animal that is born with a reward function that says 'you need to find food and shelter but I m not going to tell you how to collect the food or where to find shelter' is the level of abstraction. Fundamentally they appear the same.
You are pulling a sleight of hand when you suggest 'in the manner of animals who respond to the world without already having all the information about it' - there is a vast difference between an abstract reward function (which humans and animals also have) and 'having all the information about [the world]'.
I think NN=kNN is reductive to the point of near uselessness as a statement. You can treat human object recognition as a sort of kNN as well in that we 'detect' a large number of features, the grouping of which ultimately allows us to recognize something. But of course that doesn't actually tell us much about how we recognize things, since that explanation just changes the question to "how do we detect useful features?".
Similarly, I think most ML people recognize that NN classifiers are just learning to look for groups of features, yet that doesn't really count as an analysis because what we want to understand is why it arrives at the features it does.
>The magic of ML is a sleight-of-hand trick -- we dont really need to know how its averaging of our data does anything useful -- it almost never does.
I strongly disagree, we do need to know how the 'averaging' does what it does so we can improve our models further. A lot of big steps in NN architectures have come in part from applying theories and insights regarding the properties of the training process, so a deeper understanding of the process may allow us to make better models, ones that have better accuracy as well as ones that train faster with less data.
>Rather, it is our "experimental design" which produces the usefulness of the system. ML algorithms are just interpolations and averages through data prepared to produce useful averages by (literally millenia) of human ingenuity.
>It takes actual intelligence to do this because the world isnt data, and almost any measurement one cares to make (with eyes, even) produces endless ambiguities and coincidences that you have to "be in the world" to resolve; and resolution is a dynamic process which you "have to be here for".
I think most of the industry agrees that current ML models are sort of like being able to make the icing but not the cake (i.e AGI), so I'm not really sure what you're trying to say. Of course the models need a general intelligence to filter general data down to a specific problem and set things up for the model to learn?
As an ML Research Scientist, I have never heard this interpretation. It's a very interesting thought that NN == kNN. It puts some of my lingering intuitions in clear wording. Thank you for this.
I think you are close to truth. This would explain why even the largest language models can't generalize beyound the training set.
At the same time, I disagree that analysis is out of proportion. It might be some clever averaging, but it does useful and interesting things. Take a look at Google: some clever averaging can get you a long way. It would be great to understand how it works and how can we go beyond it.
I do believe we need a paradigm shift, but it does not come out of nothing.
I am myself trying to get closer to a clear formulation of this problem, which is why I'm writing here. Here's what I have so far,
ML systems (eg., NN) remember averages (, compressions) of historical data. They are useful, wrt the problem, iif (1) the problem's target function exists; (2) the data is relevant, unambiguous, well-carved; and if (3) these properties will hold regardless of likely permutations to the problem's framing.
Systems are given data with these properties by significant amounts of experimental design, work, and effort by people. Absent these properties, data is useless.
Producing data with these properties requires intelligence, and no machine systems exist which can do it.
My issue with research into ML on the whole, is that it *assumes* these properties and then explains how the systems work. I understand why this is interesting from a formal perspective... but it fails to note that this situation is almost never how ML is used.
There is no function from Image->Animal, ie., biologists arent just idiots who could have just looked at some pixel patterns. Pixel patterns are radically ambigious wrt to `Animal`, and so even an infinite sampling of (Image, Animal) is not enough for ML.
... so what on earth are ML systems doing?
This is a bigger research question: to characterise how ML performs when this assumed setup fails. And you know, that research almost doesnt exist. This is an industry led by partisans to its success.
What do you think would happen if research actually talked about the dynamics of ML systems performance when (1) the target doesnt exist; (2) data isnt relevant & unambigious; (3) the problem framing will permute most times its deployed....
Suddenly we'd have an explaination of why 2016 wasnt the year self-driving cars were delivered. And indeed, likewise, of why even 2036 wont be.
One good lens I know is that a neural network is just good at approximating stuff. Trained properly, you can have it approximate a distribution. A conditional distribution like p(animal_species=dog | image=what I am seeing) (discriminative model, e.g. classifier), or even a joint one p(animal,image) (generative model, Autoencoder/GAN/VAE/Diffusion).
There is also an information theoretic lens about compression, which is probably very close to what you are thinking about, but I haven't studied it yet.
Regarding Image -> Animal. An image of an animal is a projection of the animal onto a 2D plane, plus lots of noise. So there is some dependance between an image of an animal and the animal. Biologists can get a lot of information looking at a photo of an animal. In some sense they are always looking at two images from their eyes.
But the problem you are talking about is indeed serious, and far from solved. You can't understand the real world from 2D images with the current approaches. Ideally we want neural networks to build a 3D (or even 4D, with time?) model of reality. Instead we find them trying to guess labels based on patterns. May favourite example is the tiger-dog [1]. Still, there is evidence NNs are doing some clever things [2]. My guess is that the problem is that we just haven't found a way to formulate the task for the solution we want. In the current formulation it's easiest for the model to minimize the loss by sticking to patterns, so why do something else?
There is a lot of research on more applied ML that asks the questions you are asking. It's just that this paper is another attempt at a theoretical explanation.
I agree on the self-driving cars. We can't have truly self-driving cars until a model can generalize, which none can't at the moment. The core question is: if we had a "dumb" model that does clever averaging and successfully covers 99% cases, such that the car is dumb in special cases, but smarter than most human drivers in usual cases, would it justify deploying the cars? If this was the case, dumb ML might be enough for self-driving. It's definitely enough for self-driving in walled garden conditions, so there is some evidence that with enough data we can brute force our way to a tolerable solution.
Yes, we do require models parameterised by both space and time -- but really, we require implementations of these models -- the implementation i'm thinking of is called a body.
Why? Well, consider the best sort of such models: physics. What is "the mass of the sun"? What is "the sun"? There is nothing in all of physics which says anything exists, nor what its boundaries are. Least of all what "the sun" is.
Physics, all of science, is counter-factual: if something exists, then.
You're never going to get to "what a table is" just by table(x, t) -- because there is something in the background which asserts "tables exist" and that is the concernful actions of animals which care to partition reality this way.
Reality, in the end, measured in every possible way is still ambiguous. It still leaves open how one actually refers to any of it. Where one places a boundary. What the unit is going to be, in our descriptions.
There is no way around starting from the other direction: not with facts already provided; but with no facts at all. You have to build a system which cares, that then induces a partitioning, that can then change its caring; and so on.
This is an extremely partical concern. A car cannot drive itself, in the relevant sense, if it doesnt care about anything; and more severely, if it doesnt care like we do.
The car isnt going to be able to modify its concepts in response to being challenged -- by other people, by the environment, etc. because it has no reason to. There is nothing which is important to it. And hence, when confronted by the need to adapt, the car will kill people.
If that was true, a model trained on one commonsense Q&A dataset would be able to answer questions from another commonsense Q&A dataset without finetuning. But they can't. It's the same for every task you can find, but especially evident on commonsense reasoning benchmarks. At least last time I actively researched the question, I haven't found a single task where NNs definitely generalize.
When researchers dig in, they find that the neural network is learning wrong things. Word matching between answer and question, learning to model the annotator who asked a lot of questions because his Q&A answers are predictable, and stuff like this.
There was a great review of the problem, but I can't find it, so I will have to link to this article [1], which gives an overview of issues with current NLP models.
The fact that NN does not generalize in a particular (arguably very challenging) task trained on a particular data set etc. does not mean NN never generalizes.
Don't know exactly what is meant by "generalization" in this context. I'd argue that it's very unclear what the difference between "interpolation" and "extrapolation" even is in a high dimensional and sparsely sampled space.
Would a NN trained to recognize "fur" trained on different kinds of dog fur also activate when it encounters wolf, cat or even bear fur? This seems quite plausible, and seems like a kind of generalization.
It's indeed a very challenging task. I picked it because it's one of those where the lack of generalization is apparent. In fact, researchers study commonsense QA and similar tasks to find how we can reach generalization.
I agree, it's very hard to fidn the difference. Especially for GPT or BERT, where the training set is basically the whole internet.
It's a very good question about fur. I would suspect that it would correctly recognise all kinds of dog-like fur and sometimes fail on different furs, like bear furs. But in general NN's are very good at textures, so maybe it will just be good on all kinds of fur. One problem that might arise is that you will show it something fur-like but not really fur, and it will think it's fur.
I agree it's some kind of generalization. Here I don't have enough background to draw the line, perhaps a more theorically oriented person could, but I can't.
I guess the most important judge is that if you devise a benchmark like commonsense Q&A, a neural network fails it. Or how a Tesla will recognise a truck full of red stop signs as a real stop sign, while a "generalizing" thing like a human would definitely know that a core property of a stop sign is that it should be installed near a road. So there is a real problem.
> This would explain why even the largest language models can't generalize beyound the training set.
Why do you say this, they do. I can teach gpt about new objects I have made up and it’s physical properties and it can understand them. That seems a clear example of generalizing beyond training set, or do you not think so?
It's repeating the same patterns it learned from text. Maybe with different specific objects, but still. One good way to test this is to ask it to count. It will break very soon. A person is able to build a rule in their head: "one apple is 1, two apples are 2, three apples are... 1+2 = 3".
Prompt: "One apple is 1, two apples are 2, three apples are"
Model output: "One apple is 1, two apples are 2, three apples are 4, four apples are 5, six apples are 7, seven apples are 8, nine apples are 10, ten apples are 11 (for apples being the perfect length of life)."
Even if you use a special dataset to each it to count, it won't be able to count beyound the examples in the training set. So it's spewing plausible-sounding gibberish at you (i.e. approximating the training set distribution)
It doesn't generalize. Not in the sense that it can't give you a phrase that didn't exist in the training set. It can. But it can't give you a new kind of phrase, of a "kind" that didn't exist in the training set.
>But it can't give you a new kind of phrase, of a "kind" that didn't exist in the training set.
Can you give a concrete example that doesn't involve math (which GPT is a bit handicapped at, because of the way its encoded)? I feel this is a bit like 'moving the goalpost'. It seems to me plenty of humans only ever repeat things they've heard and aren't coming up with novel, complex abstractions or ideas...
If neural networks do compression, then they are implicitly providing structure not present in the data. Given a dataset there are always many different ways to compress it, each corresponding to some implicit distribution over unseen datapoints. Most of these distributions don’t correspond to real-world generalizations. Yet neural networks do often—not always—choose models for compression that correspond to real generalizations. This is nontrivial.
However, it just moves the conceptual line between data/feature engineering and modeling so as to make the data/feature engineering encompass nearly all of the process.
By this definition of ML, one might just say that clever ML is a really clever way to engineer useful features, after which we "merely" average them.
From a certain POV the structure of an animal's body, as for example, it learns to ride a bike can be given this kind of formulation.
Namely, that the muscle tissue and the motor cortext express a latent representation of riding which is formed under the animal's self-regulative goals of "stability" and "getting to some ___location" etc.
From this POV, one can model any aspect of reality with this sort of language. I used to be annoyed by this, but less so now. Annoyed because it's pretty hopelessly useless describe any complex physical system this way. It only makes sense if your relevant "feature spaces" are discrete and computationally tractable... i dont think our muscles are good candidates. (Nor likewise, is almost anything).
Nevertheless, you are right that "ML" on a ___domain whose dimensions are infinite, and in which slices through those dimensions are infinite, and whose models are chaotic -- then ML with an infinity of time to run would produce an animal. And indeed, an intelligent system.
One has to remember here that no one is doing this. What theyre doing is taking a highly regulated non-chatoic, tractable, measured and largely unambigious slice through these spaces --- sampling them --- and taking an average.
Who gave you the knife to take this slice? Was it your cognition, was it the world?
No, neither, it was your body -- largely; and its ability to regulate your imagination; and to guide action.
Formulating reality as-if its patterns can be determined statistically leads to absurd levels of non-computability. Every possible partitioning of the world, in every possible arrangment, moving in every possible way -- what discrete model of this is there?
There isn't one. And this isnt semantic. The challenge of building an intelligent system is a (bio-)engineering one: we require a material which grows intelligently, whose internal chaos is useful. Not a dumb little calculator which comes in to average our efforts after the fact.
There are methods that build networks from scratch, including position of individual neurons, connections, weights, and mutation methods this models perform poorly to say the least compared to RL models in single or multi agent settings just learning only through interaction within environment. In my limited perspective we do not understand how/where/why brain changes on second by second bases on neural level, furthermore we have no slightest idea how instincts in brain are encoded through DNA.
In terms of plains and birds analogy, it is like trying to build airplain without understanding basics of aerodynamics.
<< almost any measurement one cares to make (with eyes, even) produces endless ambiguities and coincidences that you have to "be in the world" to resolve; and resolution is a dynamic process which you "have to be here for". >>
Sounds like a magic view of the mind to me.
It's suspicious that this is never said about the function of Liver or Kidneys, only about the function of the Brain.
Maybe the brain is protecting itself, like if you ask a cow about steaks.
The way we can tell if a stick is bent, or not, in a fluid is to remove the stick.
Animals arent magic, theyre just actually present in the world.
The converse is true: if the world's structure could be uncovered by data analysis, then it would be something like The Matrix; and the whole history of science would be falsified. It turns out the hardwork of experiment was never required, one could just blindly measure anything in any way, and somehow, knowledge would be produced.
> one could just blindly measure anything in any way, and somehow, knowledge would be produced.
Is this not what astronomers do? Sure they also have inputs from experiments made on Earth, but the actual investigation of "the stars" is reliant solely on data analysis.
What is the structure of the world if not the correlations in some measurement data?
So (trying to grok this) in order to "know" I have picked the box up, I can feel its weight, wave my hand in the empty space etc. I am "in the world" and use side channels to build my own reward function for my own eyesight. (presumably this is the first 9+ months of life.
if I only had eyesight i could not build a reward function ?
Does this mean AI could do this if they had other sensors like weight gauges in the servos?
Well the body has properties which enable it to formulate concepts which begin simply as motor techniques. It is unlikely that servos have the right properties, but it's an open question.
I strongly suspect the body's ability to organize itself at the cellur-to-organ level over time, ie., to grow, is an essential component of how we develop & deploy novel motor techniques.
It is these techniques which we use to resolve the ambiguity inherent within any system of measurement -- we, of course, learn to see -- but i think "learn" here is really a complex form of motor control which ends up structruing our perception and then reflexivily, we build on that to develop more motor techniques.
We learn to see in the sense that we first learn to move, then we see better, and we cycle until we're building electron microscopes. Explicit cognition is really just a book-keeping/accounting system to tidy up this process. In modelling only this, we're relying on "doing all the hardwork" for machines.
Intelligence isnt what cognition does when it tidy's up your concepts, its the very having of those concepts. Cognition is a big, largely neurotic unself-aware misdirection. It's what academics think is important -- failing to notice they have any body at all.
They only got their cherished abstractions by first moving their meat -- and whilst that can be dropped for the most pure and formal of ideas (eg., arithmetic) -- it cannot be dropped for discovery as such.
I don't really get the first part of your comment. You say an NN is "just" compression + kNN and does no representational learning. But finding a compression (a transformation in other words) that makes kNN feasible on the data is exactly what people mean when they say it finds a hidden representation. It is a highly non-trivial task: e.g. simple distance in pixel space between images would get you useless results.
People have the notion that a latent representation in the animal sense, ie., a concept -- is the same thing as your "representation" in the NN sense.
That's not the case. You're right that if I find a predictive compressions of faces, say F1...n then they arent literal "rememberings". And they seem to be able to participate in a decision process (eg., classification) which doesnt seem to target pixel patterns.
However I think this is kind of an illusion. What `F1..n` are, are ambiguous pixel-space projections of the abstraction which isnt present in this projection. When I have the concept "this type of face" I can reason with it beyond similarity in pixel-space.
When we form representations we arent restricted to reasoning with them in only one space (eg., how faces look as pixels). We (perhaps superstitiously) impart to machine "representations" an actual depth which they lack.
They are templates derived from the spaces they live in, eg., pixel-space; and have only the properties that space affords (eg., pixel-geometry). Reasoning beyond that space, and those properties, doesnt work. People think it does. This is the illusion.
Templates derived from this data, that we provide, function like actual representations because we simplify the world for the machine -- and prepare its environment so that its pixel-space templates are good enough.
> When I have the concept "this type of face" I can reason with it beyond similarity in pixel-space.
I think there are at least two possible things that might going on here:
1. we're "trained" on non-pixel data (to use the same framing) and so it seems obvious that we would reason with a concept like "this type of face" in a non-pixel space.
2. the experience of "reasoning with it" is an illusion, and is merely the subjective experience that we have when our brains do stuff with whatever their underlying representation is. That is, we may have no real knowledge of what space our own "face model" is built on, what it represents, what properties it considers.
So, another option, is that our concepts arent really about their targets.
For example, my "pen" is actually a bundle of largely motor techniquess (also: emotional regulation technquies etc.) which is about how I coordinate myself.
In a sense all my concepts are about me. We might say that in congition we abstract these "primal concepts" into properties which are then about their targets, sure. This is why cognition is a bit of a charlatan.
In being able to regulate my own imagination, motor skills, emotions etc. with these techniques, I can actually discover new things about the "abstract concepts" that they imply.
Since my "pen" isnt really anything like "a pen" nor even my experiences of pens... it is rather, "a way of me moving everything within me" -- I can simulate these movements and discover new things about their implied abstractions. Indeed, I can do this in the world: I can explore actual pens by moving my body differently.
This is one of the big issues with the AI paradigm as it has always existed: researchers are still talking about environments as if they had properties which were already there. Everything is formualted as-if the problem were solved. The world is just some bundle of facts (data, propositions, etc.) and representations are just subsets of these.
This misses, you know, the world. The thing that hits you. And it misses, entirely, what happens when it hits you.
Sounds like a re-visit of a lot of the ideas in "situated action", popular in some AI circles in the 90s. That also included concepts about how we reduce cognitive load by storing information (and procedure) in our built environments.
I'm not, though, that I agree about the pen example. I mean, you're completely correct in your description of all these different elements of your "pen". But to the extent that a concept of/about something is really never anything like the thing, it's not a particularly important aspect of concepts. I think that what's important is that all of these embodied, semi-reflective elements of your "pen" concept sit in parallel with some abstract concepts about pens. Pens do have properties that are already there, but in many contexts as you note, these are less important than the ones embodied by you.
I suspect we'll find that these abstractions are no where within us. We dont have ML-like templates of stuff.
Whenever we need to reason about the properties of pens, our coordinative bodily structure generates these abstractions for cognition to operate on. They're ephemeral, and live only when cognising an issue.
To model cognition then, is to model only the symptom of intelligence; not the actual process itself.
which actually goes into all the layers (by their choice of design, initialization), distance functions, optimization steps, data cleaning, splits, etc.?
And furthermore, also that
P(model simply memorizes observations + human provided inductive cues | model finds structure)
>>
P(model does something beyond memorization of observations + human cues | model finds structure)
? What would be your test to determine this? For instance, how would you measure the information content (under some pragmatic-enough encoding that could be realizable and built today) of the memorization + human cues, vs memorization + human cues + the extra bit that the model concludes? Not a gotcha question, curious about your answer.
Though, I have to say something about the first bit. Suppose you were comparing the safety and probability of catastrophic failure of some of the earliest buildings made in human civilization compared to something made in 2022. And suppose you conditioned that probability on the sum total of human knowledge at the time. The same laws of physics apply, but different amounts of knowledge went into producing each artifact.
Can we say that the latent structure (human theory of engineering) in the modern building (model) reveals more about the ground truth of the world (laws of physical reality) than is inside of the observations data set (does it collapse or not) than the old model? I think we can. And I don't feel that embarrassed to say that deep learning is just a part of the wider landscape of human experience.
You may be interested in this article that proposes Rip van Winkle's Razor: if you apply a state of the art method to an old dataset, for a fair comparison, you need to measure and take into account the increase in tooling that has become available since the time the benchmark was first released: http://www.offconvex.org/2021/04/07/ripvanwinkle/
What we require is a "bridge" from any particular feature space to the target. A compression of a feature space doesn't produce this bridge. When we look at a photo, its ink-space arragement can be bridged to its depicted situation via our ability to brigde.
We see in the ink, a world. We do this via concepts which bridge. We get those by being in the world.
We can "roughly accept" limited forms of inference which simply operate within a feature space, ie., we can be fooled by photos "made on photoshop" which appear to depict a situation which never really existed.
A machine operates in "inkspace" and we rig it only ever to produce "human-bridgable" arragements of ink. The machine itself doesnt draw a situation -- as with an artist painting a room. A machine isnt in the room.
It is, as-if it were in the room, and as if it had painted it. This illusion is created by us, we rig the system to average historical paintings of rooms, and only ever show the apes-at-home a good picture. They think, looking at it, that the machine has painted something.
One formulation of this, I suppose, is to add "| HumanUnderstanding(Y-as-labelled, Y-the-actual-target)" to everything
From the physics point of view, an intelligence is to keep the universe running in the perfect order, no waste of energy to increase the temperature. All accurate predictions align our current states to the future. If we look deep enough, we always find information coming from the past and the future. The only trouble is that our computation power is bounded, and computing itself generates heat waste.
Human always feel that we have free will to design. It's likely a mistake in our intelligence unless the universe is infinitely dimensional where we can process the energy flow in any way we want.
I have been working and thinking about data and ML models for many years and this is probably the most concise and lucid summary of a post NN philosophy I have heard so far. It also resonates with me. Thanks for sharing.
It sounds like the unique attribute you're describing in humans is nuance and contextual understanding, but we developed those as a way of doing something a neutral network does inherently by means of error adjustment.
I don't think there's ever an expectation that the data is "truth" but that we have to contextualize it and adjust it. Humans adapted to those through different mechanisms than NN but the goal is still fundamentally the same: to predict something with as much accuracy as possible despite the infinite noise of the data that is the world.
Alas the goal isnt the same, and here "predict" is the culprit. What ML systems do is offer statistical estimates, which is one very narrow meaning of "predict".
When I "predict" what will happen when a glass falls, I'm not giving you any statistical estimates. I'm running a mental simulation of the world in its relevant parts and using this to predict what will happen.
The heart of basically almost all knowledge of the world is actually counterfactual (ie., requiring simulation), it's always "if this happens (in the simulation), this will happen (in the simulation)".
To acquire this ability to predict one has to understand, ie., to have the right concepts, reason with them and so on. When a person "has the concept `7`" they can reason thus: "6 eggs would be fewer", "7 is an odd number", "7 is a whole quantity", "with 7 less of 10, i'd have 3" etc.
Ie., `7` is available to play a fundamental role in simulating scenarios and reasoning with them.
When a NN produces a mapping from {Images} -> {Labels} such that the squiggle `7` produces the label `"7"` it doesnt do so via prediction in the meaningful sense of the world. Rather it has found a means of averaging sqiggles, and compares one squiggle to its averages and reports what the historical similar squiggles were labelled.
This is a game which is only useful because we, who can reason with `7` profoundly about the world (etc.), have produced datasets which express our knowldge of the world in ways that dont require it. Ie., we present to the machine something whose average is useful.
The machine is not trying to, nor does it, nor can it, predict. It has no knoweldge at all. It is just a correspondance table we have provided.
>When a person "has the concept `7`" they can reason thus: "6 eggs would be fewer", "7 is an odd number", "7 is a whole quantity", "with 7 less of 10, i'd have 3" etc.
I just input this into GPT-3. Its responses are in italics, this is first try no rewriting or retrying anything:
This is a test of understanding of the concept of "the number 7".
Question: Is 7 odd?
Answer: Yes, 7 is an odd number.
Question: Is 6 eggs fewer than 7 eggs?
Answer: Yes, 6 eggs is fewer than 7 eggs.
Question: With 7 less of 10, what would I have?
Answer: I would have 3.
Question: Is 7 a whole quantity?
Answer: Yes, 7 is a whole quantity.
This is mostly a joke because I think I understand where you are coming from (and that you are hypothesising that gpt3's responses are an elaborate trick of sorts). But I don't believe AI has to take the same route as human intelligence, and I don't think we really understand what a concept is or how it behaves from a signal/data perspective, but I think that may be inconsequential for creating general AI.
Also people are can be really stupid sometimes and also have failures, and the concepts that people hold can be incorrect or flawed etc. So it may be useful also to compare human failures with AI failures, rather than just AI failures with human successes.
I think this conversation gets away from itself very quickly if you run with generic terms: data, prediction, and so on.
People think that NNs/ML/etc. works, and now the relevant questions are philosophical: if it works, how.
But it doesnt work. In very silly trivial ways. It's only taking averages over historical data. And all it can do is that.
As soon as you are alive to what thinking about things requires (ie., a live embeddedness into an environment, imagining scenarios, etc.) it's trivial to expose the magic show. It isnt philosophical, it's quite literally just broken and doesnt work.
I cannot ask GPT, "do you like what i'm wearing?", or "what would it take to change your mind on drug legalisation?" -- and so on. Indeed, I can't ask it, "if I have seven eggs, should I make a cake with 6 and keep the 7th -- or should I just use 7?".
I can't ask it anything which actually requires it to reason with concepts, because it doesnt have any; and I cant ask it about anything around now because it is just summarising what text has already been produced.
The whole AI hype industry has a vested interest in the philosophy here, ask its premised on everythign working. Almost nothing works. The idea that GPT presents any philosophical challenges is nearly absurd; about as many as a shredder at a library.
> I can't ask it anything which actually requires it to reason with concepts, because it doesnt have any; and I cant ask it about anything around now because it is just summarising what text has already been produced.
I think you might have too strong of a belief that your subjective experience of reasoning is indicative of the actual machinery. I think it would be crazy to reject this view entirely, but it also seems possible to me that the whole experience of "reasoning with concepts" is not actually what we do at all, but merely what we experience when we do something else. Essentially, some variant on Dennett's intentional stance but applied to ourselves.
I'm always really happy to get to a technical description -- because I think these philosophical replies make it seem like we're talking about a working system. And there isnt one.
So what i'm interested in is machine systems for which we have (1) no prior specification of the environment; and which in the end (2) anticipate permutations in the environment which arent specified.
intelligence is the solution to this, for (1) form concepts which carve out the environment using your body; and (2) use these concepts to engage in counter-factual reasoning which simulate possible change.
I think a fatal issue for AI is that these arent even in the class of problems being addressed. Talking about intentionality here gives the field vastly more credit than it deserves. We arent even at the level of basic concept formation.
> So what i'm interested in is machine systems for which we have (1) no prior specification of the environment; and which in the end (2) anticipate permutations in the environment which arent specified.
But there are no biological systems that do this either (other than to the extent that by virtue of their embodiement, they may necessarily come more of a builtin "understanding" of the (physical) environment)).
Yes, I dont think intelligence is cognitive -- it's the content of cognition. Where does that come from? Essentially our bio-organic structure as distributed across the body, and esp. motor cortext.
In otherwords, the AI problem is unsolvable by any system. It's the wrong problem. Cognition, as a formal structure, isnt intelligent.
> Yes, I dont think intelligence is cognitive -- it's the content of cognition. Where does that come from? Essentially our bio-organic structure as distributed across the body, and esp. motor cortext.
All right, so where does that (the organic structure including the motor cortex that creates cognition) come from?
I like your comments, even though I fall into the DL fanboy camp.
I hope that you read Gary Marcus’s and Ernest Davis’ book “Rebooting AI” - I think it would resonate with you, and it made me get a lot more interested in so-call “hybrid-AI” systems.
I haven't read the paper yet but you present an interesting argument to think about but I am not sure that neural network based approaches are mutually exclusive to experimental design and can in fact inform them. I'd love to talk more about this.
"We have more data than ever, more good data than ever, a lower proportion of data that are good, a lack of strategic thinking about what data are needed to answer questions of interest, sub-optimal analysis of data, and an occasional tendency to do research that should not be done."
Well if you, like me, see ML as only a garbage shaft, then yes.
But a lot of people think of it as a fancy processing plant which can take sewage and make diamonds. Really, it takes carefully prepared gemstones and crushes them into a shiny powder.
Since this is hacker news I'm sure there's at least a few people who will read this that are the ML research engineers who write papers on their architectures which achieve SoTA results on different benchmarks, or establish new datasets, or publish new tooling like different loss functions/optimizers (which usually are published alongside some new SoTA results achieved by integrating them into a current SoTA architecture).
I've wondered from that crowd (the productive crowd) if they get any value from papers like these, if reading them ever gives any insight. I don't seem to ever glean very much from them, whereas just reading the formerly mentioned papers and thinking for myself and experimenting seems to give me the intuition I personally need.
I'm starting to feel there's a whole secondary field within ML of people who publish 50+ page math-heavy theory only that yields no actual results nor value in achieving those results, but maybe I'm wrong. I'll ask around at the conferences this year. I'd be willing to read papers like these if it turned out it'd help me but as time is limited, I feel I'm likely better off reading more papers that show results and their architectures (which usually have a bit of light speculating/intuiting mixed in as well) than reading this.
Also, no offence meant to the writers of the paper who might read this. If you have any thoughts on this line of inquiry as well I'd be interested to hear them. Maybe there's a lot of practical value I'm missing out on by not taking the time to read and understand papers like these.
Indirect value: there are many facets to every field of maths/science/philosophy and each facet makes progress at a different pace. This type of paper doesn’t have any impact on my day-to-day thinking but I don’t doubt that this approach will eventually lead to better methods and understanding of deep learning. Unfortunately that understanding could take 50 years for all we know.
In the meantime, progress in applications of deep learning are outpacing progress in theory by several orders of magnitude. The original GAN paper by Ian Goodfellow was published in 2014. Just 8 years later we have ML models that can generating convincing images from text prompts, arguably beyond what 99.9% of humans are capable of. Theory has no chance when engineering is moving so rapidly.
And yet none of these systems can do anything beyond generating statistical patterns adapted to our visual sense. It's certainly a worthwhile achievement but I have yet to see anyone make a convincing case that there is a way to scale these systems to anything resembling intelligence. My current test for whether these systems are showing us how intelligence actually works in the human brain is a convincing proof of Cauchy's integral formula and I'm certain this benchmark will remain unsolved for the foreseeable future.
Why would your benchmark for intelligence in the human brain be something that less than 1% of humans are capable of achieving? To me that type mathematical proof is an example of a very specific type of intelligence rather than general intelligence.
Because if an algorithm is capable of setting up the conceptual machinery for making sense of and explaining Cauchy's integral formula then that algorithm is revealing something about the structure of the human brain and as such it would be a good benchmark for understanding how human intelligence actually works (not just in the 1% of people). Moreover, mathematics is a fantastic proving ground for testing algorithms that purport to explain symbolic intelligence because if something can not work in a mathematical setting/context then there is no hope it will ever work in the real world since real world intelligence is much more than symbol shuffling. Cauchy's integral formula is a reasonable proxy for proving understanding of symbolic systems and not just juggling their statistical properties/associations.
If you think that Cauchy's integral formula is too complicated then there are probably simpler problems that would also serve as reasonable proxies of symbolic understanding, e.g. elementary group theory and linear algebra.
Could you give a simple example of proof that real world since real world intelligence is much more than symbol shuffling? I'm a bit unfamiliar with the idea.
I don't appreciate the sarcasm. If you're confused then say so, save the snark for someone else. If you're asking a sincere question then you can go outside and watch a few animals navigate the world. Squirrels are a good one, as are crows and ravens. Or as the kids say these days, you should touch some real grass instead of virtual ones.
It was a genuine question, I've not heard an intuitive or simple explanation of the concept. Your assumed combative tone is very reminiscent of others I've heard bring up the concept though (like Gary Marcus). I still haven't been able to hear a good and simple explanation though from your cohort despite my genuine curiosity.
You're the first to confuse it for sarcasm on this forum. Your tone seems to be presumptuous and uncordial (possibly a learned memetic behaviour from others with similar opinions when asked to elaborate with simple explanations of their view), so I'll take the hint and stop replying.
The way I see it, the utility of theory is raising the baseline. It guides you away from dead ends, so you don't waste time on approaches that could never work.
It's necessarily high-level, so you still need to learn about specific approaches to get practical things done.
I think I'd need an author of one of those (SoTA) papers to walk me through how reading part of a 50+ page theory paper helped them achieve any marginal improvement on their path towards productive contributions for me to really understand or believe it. If I could have even a single example explained to me I think I'd get it.
I'm a bit lost on who the 'they' are and what the 'paper' is in this case: The authors of this paper writing this paper, or the authors of SoTA papers writing SoTA papers.
A third option you may have meant which makes sense to me in context is the authors of this paper going on to write SoTA papers. I actually looked through the published works of the first two authors and didn't find any practical work, all similar very long form math notation heavy theoretic papers, which supports what I initially worried about that there's basically two diverging branches of ML papers, one of which I'm skeptical about the practical value of. I have actually not seen any examples of first authors on theory papers going on to publish SoTA results on applications, seeing such a thing I would also consider pretty convincing evidence of the utility of these papers and the symbiotic unification of the two groups.
I wonder if there's been similar concerns raised historically, like at the advent of electricity, between practical engineers focused on creating groundbreaking applications versus people still focused on theory, and in retrospect what contributions continued to be gained by the theoretical work afterwards in those cases.
When you pick up a SoTA algorithm and try to apply it into a particular problem, ... it does not work or does not work well enough.
You must modify or fix it. Combine it with other approaches. Be able to know what algorithms to pick. Without knowing how the thing works, what the internal dynamics is, what are the bottlenecks and limitations you get nowhere. Trying to fix problems by blindly tweaking parameters is not going to work.
You don't have to be able to analyze and do research on internals by yourself, but you must understand the papers and be able to get the idea. Just like engineer doing signal processing must know lot about Fourier transform, wavelets, Laplace transforms even if they forget the details and forget some equations.
IMHO these books are not used for research. In order to contribute to research you need to be at the cutting edge of the existing contributions which are the most recent papers. On the other hand, these kind of books are useful because there are a huge aggregation of past research and it is based on these kinds of ressources that professors make their courses and that undergrads and graduates students learn.
In statistics, there's a joint distribution of x and f(x) to be learned. In this piece, there's a distribution over trained models that naturally depends on the initialized parameters.
It's interesting when data is considered non-random, and the probabilistic variation arises from weight initialization.
While this serves to characterize the training process, the authors see the goal of the training procedure to converge to a target function f(x), with known values for x. This is of course not the substantive goal of the DL/ML researcher or practitioner (where the target function is unknown, and the goal is to learn the joint distribution and hopefully not fit the latent f(x) function exactly).
On the one hand, uncovering how a NN would behave with different lengths and widths for a fixed set of data seems an important step, and although I haven't read far enough to see any interesting results yet, I am sure the authors have much in store.
On the other hand, I feel like most questions of practical relevance - like generalization, model choice given data, causality, statistical properties of the entire design process and so on - all require us to think about the data generating process, rather than only the weight initialization process.
Curious how this work relies a lot on the mathematical language of Quantum Field Theory. Wick theorem. Criticality. RG flows. As someone with background in Physics I had a permanent deja-vu reading this.
Two of the three co-authors, Dan and Sho, come from high energy physics backgrounds. Dan went from pure high energy to the intersection of HEP and quantum information. Boris comes from a pure math background but he has spent a lot of time learning physics as well. People's trainings influence their work :)
I hear this phrase everywhere "Every layer is a change in representation of the previous layer". Is this mathematically proven or is this an assumption looking at the image classification models layer?
That seems true by definition, as each layer is a functional transformation of the previous layer.
I think want they want to imply is "Each layer is a semantically meaningful representation, changed from the previous layer" which I don't think is a verifiable claim (as what is semantically meaningful isn't an objective claim)
Well since the next layer outputs are a linear transformation plus some nonliearity of the previous function, it's a fact that it's a change in representation. But I guess the true question is broader: "Is it proven that the next layer is preparing a feature representation for the next one?".
I don't know if it is mathematically proven, but you can easily see it yourself when making an image classifier with one hidden layer. It has some really good evidence for an assumption, at this point I would call it at least empirical evidence.
I retired recently after using neural networks since the 1980s. I still spend at least 10 hours a week keeping up with DL, RL, etc. it seems like the roof has blown off the field recently, progress increases exponentially. I like material that makes me think of NNs with different intuitions. I am working on a CC licensed book consisting my experiments, in Jupiter notebook/Colab form - expect me to be shamelessly plugging that in a few months.
In the book, I especially loved this quote:
“You can hide a lot in a large-N matrix.” – Steve Shenker – John McGreevy