There is evidence that language is fairly smooth though. For example, we can extract e.g. the gender vector from a word embedding space that is learned by a recurrent neural network. That seems to hint at the possibility that words, sentences and concepts live in smooth, high-dimensional manifold that makes them learnable for us in the first place (because in that case they can be learned by small local improvements which seems to be required for biological plausibility). That is also the reason why we have often many words for the same or similar meanings and, conversely, why formal grammars have failed at modeling language.
Arguing from the other direction, neural networks have also already proven to deal with very sharp features. For example the value and policy networks in AlphaGo are able to pick up on subtle changes in the game position. The changes from the placement of single stones can be vast in Go and by no means this is only solved by the Monte Carlo tree search. Without MCTS, AlphaGo still wins in ~80% of the time against the best hand-crafted Go program. The value and policy networks have pretty much evolved a bit of boolean logic, simply from the gradient from the smoothness that results from averaging over a lot of training data.
I have a pet theory that the discovery of sharp features and boolean programs might heavily rely on noise. If the error surface becomes too discrete, we basically need to backup to pure random optimization (i.e. trying any direction by random chance and keep it, if it is better). That allows us to skip down the energy surface even without the presence of a gradient. Of course, such noise can also lead to forgetting, but it just seems that elsewhere the gradient will be non-zero again, so any mistakes will be correct by more learning (or it simply leads to further improvement if the step was into the right direction). Surely, our episodic memory helps in the absence of gradient information as well. If we encounter a complex, previously unknown Go strategy, for example, it will likely not smoothly improve all our Go playing abilities by a small amount. Instead, we store a discrete chaining of states and actions as an episodic memory which allows us to reuse that knowledge simply by recalling it at a later point in time.
> I have a pet theory that the discovery of sharp features and boolean programs might heavily rely on noise. If the error surface becomes too discrete, we basically need to backup to pure random optimization (i.e. trying any direction by random chance and keep it, if it is better). That allows us to skip down the energy surface even without the presence of a gradient.
It's called random optimization or random search depending on whether you sample from a normal or uniform distribution for the random direction. MC typically refers to any algorithm that computes approximated solutions using random numbers (as opposed to Las Vegas algorithms which use random numbers to always compute the correct solution). So, yes, RO, RS and gradient descent are MC local optimization algorithms.
The very method of using a word embedding space assumes the manifold is smooth, so the fact that vectors extracted from a method that assumes a smooth manifold, are in fact on a smooth manifold, is just circular and not evidence of anything.
Arguing from the other direction, neural networks have also already proven to deal with very sharp features. For example the value and policy networks in AlphaGo are able to pick up on subtle changes in the game position. The changes from the placement of single stones can be vast in Go and by no means this is only solved by the Monte Carlo tree search. Without MCTS, AlphaGo still wins in ~80% of the time against the best hand-crafted Go program. The value and policy networks have pretty much evolved a bit of boolean logic, simply from the gradient from the smoothness that results from averaging over a lot of training data.
I have a pet theory that the discovery of sharp features and boolean programs might heavily rely on noise. If the error surface becomes too discrete, we basically need to backup to pure random optimization (i.e. trying any direction by random chance and keep it, if it is better). That allows us to skip down the energy surface even without the presence of a gradient. Of course, such noise can also lead to forgetting, but it just seems that elsewhere the gradient will be non-zero again, so any mistakes will be correct by more learning (or it simply leads to further improvement if the step was into the right direction). Surely, our episodic memory helps in the absence of gradient information as well. If we encounter a complex, previously unknown Go strategy, for example, it will likely not smoothly improve all our Go playing abilities by a small amount. Instead, we store a discrete chaining of states and actions as an episodic memory which allows us to reuse that knowledge simply by recalling it at a later point in time.