I am one of the authors. The most critical aspect is that transformer is a "different kind of SVM". It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.
This also explains how attention induces sparsity through softmax: 'Bad' tokens that fall on the wrong side of the SVM decision boundary are suppressed by the softmax function, while 'good' tokens are those that end up with non-zero softmax probabilities. It is also worth mentioning this SVM arises from the exponential nature of the softmax.
The title of the paper does not make this clear but hopefully abstract does :).
It is interesting that you have cited this paper but did not even correctly acknowledge their contribution. Yeah I get all that "they are doing X and we are doing X+1" narrative, but the fact that you have defined "good" tokens by multiplying Y_i to your head function, is not much different than "assigning 0-1" label to inputs in traditional SVM. Your "Y_i" essentially serves as a 0-1 label in SVM.
Sounds like a mind game of re-branding existing concepts lol.
This seems related to NTK literature i.e. wide neural nets behave like kernel regression. NTK is a great tool but a notable weakness is kernel view doesn't explain how the model learns new features. Transformer is also pretty different from standard neural architectures because tokens interact with each other through attention. Our goal was capturing this interaction and we believe there is a clean insight on feature learning: Attention is running a token-selection procedure by implementing an SVM that separates tokens.
See our re-examination of the kernel equivalence. Path kernels exactly measure how models learn as their understanding of data improves during training, and this can be expressed in terms of the gradients with regards to each trianing input: https://arxiv.org/abs/2308.00824
We believe that all neural networks are effectively an SVM or more generally reproducing kernel architecture to implicitly layer the understanding contributed during each training iteration. Do you have any comment in the RKHS or RKBS context for transformers?
When you say SVM, do you mean any classifier that finds a separating hyperplane, like a no-hidden-layer "perceptron" or Naive Bayes, instead of one which finds the maximum margin hyperplane? Or is finding the maximum margin important here? Thanks. Very interesting.
I think our own brains and nervous system use a step-function as their "activation function", so this could - optimistically - be a throwback to the roots of Rosenblatt's idea.
This SVM summarizes the training dynamics of the attention layer, so there is no hidden-layer. It operates on the token embeddings of that layer. Essentially, weights of the attention layer converge (in direction) to the maximum margin separator between the good vs bad tokens. Note that there is no label involved, instead you are separating the tokens based on their contribution to the training loss. We can formally assign a "score" of each token for 1-layer model but this is tricky to do for multilayer with MLP heads.
Finally, I agree that this is more step-function like. There are caveats we discuss in the paper (i.e. how TF assigns continuous softmax probabilities over the selected tokens).
To me, summary is: Through softmax-attention, transformer is running a "feature/token selection procedure". Thanks to softmax, we can obtain a clean SVM interpretation of max-margin token separation.
> It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.
sorry but how is separating 'good' tokens from 'bad' tokens inherently different from assigning a 0-1 label
Standard SVM classifier: Maps an input sequence to a 0-1 label. Example: Take a paragraph and return its sentiment. During training, label is specified.
Transformer's SVM: Takes input sequence, suppresses bad tokens and passes good tokens to the next layer. This is a token-selector rather than classifier.
Example: Take a paragraph and output the salient words in the paragraph. We don't know which words are salient during training, the model has to figure them out during training.
I have read that SVMs as machine learning model failed to take off because of their inability to scale relative to deep neural networks. Would your work point to ways of changing this?
IMO it is important to understand transformer mechanics through core ML themes like SVM and feature-selection. Our results are not interpretation, they are mathematically rigorous and numerically verifiable. That said, we have no intention of trivializing a complex model like GPT-4 as a simple SVM. That is a tall order :)
Practically speaking, does this give us anything interesting from an implementation perspective? My uneducated reading of this is that a single SVM layer is equivalent to the multiple steps in a transformer layer. I'm guessing it can't reduce the number of computations purely from an information theory argument, but doesn't it imply a radically simpler and easier to implement architecture?
I just read the abstract so could be way off, but sounds more like one of those papers that connect seemingly different mathematical formalisms and show their equivalence (often under some restrictions). Typically they don’t give us much immediate benefit in terms of implementation, but they add to the intuitive understanding of what we’re doing, and sometimes helps others make more practical progress.
I'm not an expert in this, so hopefully someone more knowledgeable can weight in - but SVMs are understood much better from the perspective over overfitting and things like the VC bound - while Transformers are not really understood as well. From what I remember it's quite easy to have a SVM overfit, while Transformers have fewer issues. It'd be interesting to understand why
So if the two are somehow connected, then that could have implications for tuning and fighting overfitting
maybe it'd also be possible to design better non-overfitting SVMs
> From what I remember it's quite easy to have a SVM overfit ... It'd be interesting to understand why
SVMs with well-tuned kernels and regularization are reasonably resistant to overfitting. The problem is that you can easily end up overfitting the hyperparameters if you're not very careful about how you do performance testing.
Those equivalences can connect two different fields and allows to transfer methods from one field to the other. Each field usually has developed quite a number of methods and tricks over the time. So when this work shows that they are equivalent (with restrictions), you can maybe use some of the tricks of SVMs and try to use them to improve the Transformer model or its training.
Otherwise, they just help us in better understanding Transformers and SVMs.
There have been similar equivalences before, for example:
Or policy gradient methods from reinforcement learning are basically the same as sequence-discriminative training as it was done for speech recognition since many years, however, they come with different tricks, and combining the tricks was helpful.
I am waiting for someone publishing the theoretical limits of these "AI" systems. They're certainly impressive language models - don't get me wrong on that. But every algorithm and every model has its limits. To know the limits turns their application from hype into engineering. And of course, the hype-sellers will try to keep that from happening as long as possible.
This theorem explain the limits, putting it in simple terms, most architectures are universal approximators that are constrained by the inductive bias that we give them, so far the approximator arquitectured that is less constrained by the inductive bias is the transformer, so it should be able to approximate any mathematical function, the current problem is that the attention mechanism have a quadratic scaling, so while is easy to scale it in text, is pretty hard to scale it with anything else to the same performance, even if not further discoveries are made, just with the computer power of the future it should be able to scale in every field, even with the techniques of today it gives pretty good performance in a lot of tasks.
This review of the paper an image is worth 16x16 words by Yannic Kilcher explains it better if you are interested.
I, as a Real Engineer, REFUSE to use ChatGPT until we have a working theory of quantum gravity. Enough of this bullshit where no one knows the fundamentals of what they’re working with.
What are the fundamental limits of language itself? Is English somehow more "emergent" than Korean? Isn't this more interesting than the actual execution mechanism?
The business of these new LLMs is next token prediction with context. This is also now a mission because it clearly works to some large extent. Where most would not have been willing to take a leap of faith prior, many can see some path now. I've been able to suspend my disbelief around language-as-computation long enough to discover new options.
You're looking for the universal approximation theorem. It's one of those cases where they can do anything in theory so the question is more are we chasing a turning tarpit or not, where everything is possible but nothing is easy
Fully connected neural networks are hierarchies of logistic regression nodes. Transformers are networks of SVM nodes. I guess we can expect networks of other kinds of classifiers in the future. Perhaps networks of Decision Tree nodes? Mix and match?
NNs are decision trees anyway -- take any classification alg and rewind from its decision points into a disjunction of conditions.
Or, maybe more clearly: imagine taking any classification algorithm and drawing the graph of all of its predictions across it's ___domain. Then just construct a decision tree which "draws splits" along the original alg's decision edges.
Likewise, all ML is equivalent to a KNN parameterised on an averaging operation.
Everything here is eqv to everything else. ML is just computing an expectation over a training dataset, weighted by the model parameters.
The "value" comes from the (copyright laundering/) data. The only question is: can you find useful weights by which to control the expectation you're taking?
Various ML approaches weight the training data differently. The most successful of the latest round of AI manages to compute weights across everything ever written --- hence more useful than naive KNN which wouldnt terminate on 1PB of text.
> NNs are decision trees anyway -- take any classification alg and rewind from its decision points into a disjunction of conditions
By that argument, every computation can be reduced to a lookup table. Take every possible input, memorize the correct output and store it in a database of sorts.
If decision trees were truly equivalent to NNs, you would be able to solve any problem currently addressed with NNs but using only decision trees without learning from the output of the NN. Same input datasets same output quality metrics.
Not really feasible, is it?
Likewise with all the other equivalences you made here.
Sure, every computation is equivalent to a lookup table over "predetermined answers". It isn't equivalent if we don't have those answers.
Eg., "what's the US President's telephone number in 2000?" had no answer in 1900.
> If decision trees were truly equivalent to NNs
They are equivalent. And you don't need to precompute answers you don't have. You can take the weights of a NN and encode them as a DT; just as you can also transform a NN to just be k-nearest-neighbors.
The reason we dont do that is prediction efficiency.
Also, of course, such functions are also basically impossible to train as a practical matter. That bares little on their equivalence.
All ML models are expressible as k-nearest-neighbors -- this is useful information because it demystifies the process. Countless papers end with "and we dont know why!" -- where the "why" is obvious if you reformulate the model.
ML is just ranking a historical dataset of size N, by similarity to some X, selecting up to N examples from it, weighting each by W and then taking an average.
> By that argument, every computation can be reduced to a lookup table. Take every possible input, memorize the correct output and store it in a database of sorts
You're playing into his argument. You are right. All computation we know of is equivalent to a lookup table since none of our computers are actual turing machines.
And this highlights the difference between the software engineer way of thinking and the mathematical one
If we are talking about "hierarchies of logistic regression nodes" we have to define how to extend logistic regression to multiple outputs.
The most common approach is Multinomial logistic regression: https://en.wikipedia.org/wiki/Logistic_regression#Extensions
. Other times sigmoid might be the right answer.
> When you train an SVM model in sklearn, the algorithm uses a random initialization of the model parameters. This is necessary to avoid getting stuck in a local minimum during the optimization process.
> The random initialization is controlled by a parameter called the random seed. The random seed is a number that is used to initialize the random number generator. This ensures that the random initialization of the model parameters is consistent across different runs of the code
An SVM is a quadratic program, which is convex. This means that they should always converge and they should always converge to the same global optimum, regardless of initialization, as long as they are feasible, I.e. as long as the two classes can be separated by an SVM.
> In quantum mechanics, separable states are quantum states belonging to a composite space that can be factored into individual states belonging to separate subspaces. A state is said to be entangled if it is not separable. In general, determining if a state is separable is not straightforward and the problem is classed as NP-hard.
An algorithm may converge upon the same wrong - or 'high error' - answer; regardless of a random seed parameter.
It looks like there is randomization for SVMs for e.g. Platt scaling [1], though I had confused Simulated Annealing with SVMs. And then re-read Quantum Annealing; what is the ground state of the Hamiltonian any why would I use a hyperplane instead?
The article you’ve linked is incorrect. As Dr_Birdbrain said, fitting an SVM is a convex problem with unique global optimum. sklearn.SVC relies on libsvm which initializes the weights to 0 [0]. The random state is only used to shuffle the data to make probability estimates with Platt scaling [1]. Of the random_state parameter, the sklearn documentation for SVC [2] says
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.
Which article is incorrect? Indeed it looks like there is no random initialization in libsvm or thereby sklearn.svm.SVC or in sklearn.svm.*. I seem to have confused random initialization in Simulated Annealing with SVMs; though now TIL that there are annealing SVMs, and SVMs do work with wave functions (though it's optional to map the wave functions into feature space with quantum state tomography), and that there are SVMs for the D-Wave Quantum annealer QC.
Kernel-based support vector machines (SVMs) are supervised machine learning algorithms for classification and regression problems. We introduce a method to train SVMs on a D-Wave 2000Q quantum annealer and study its performance in comparison to SVMs trained on conventional computers. The method is applied to both synthetic data and real data obtained from biology experiments. We find that the quantum annealer produces an ensemble of different solutions that often generalizes better to unseen data than the single global minimum of an SVM trained on a conventional computer, especially in cases where only limited training data is available. For cases with more training data than currently fits on the quantum annealer, we show that a combination of classifiers for subsets of the data almost always produces stronger joint classifiers than the conventional SVM for the same parameters.
My apologies for the ambiguity; I assumed it would be clear from context. The article at the link, https://saturncloud.io/blog/what-is-the-random-seed-on-svm-s..., is incorrect. Whoever wrote it seems to have confused support vector machines with neural networks.
For the D-Wave paper, I'm not sure it's fair that they are comparing an ensemble with a single classifier. I think it would be more fair if they compared their ensemble with a bagging ensemble of linear SVMs which each use the Nystroem kernel approximation [0] and which are each trained using stochastic sub-gradient descent [1].
Nystroem defaults to an rbf radial basis function and - from quantum logic - Bloch spheres are also radial. Perhaps that's nothing.
FWIU SVMs w/ kernel trick are graphical models, and NNs are too.
How much more resource-cost expensive is it to train an ensemble of SVMs than one graphical model with typed relations? What about compared to deep learning for feature synthesis and selection and gradient boosting with xgboost to find the coefficients/exponents of the identified terms of the expression which are not prematurely excluded by feature selection?
There are algorithmic complexity and algorithmic efficiency metrics that should be relevant to AutoML solution ranking. Opcode cost may loosely correspond to algorithmic complexity.
The weight per datapoint thing is actually kind of orthogonal to the concept of an SVM, but is conflated by most introductions to SVMs. SVMs are linear models using hinge loss. In the "primal" optimization perspective (rather than the dual problem SVMs are usually formulated as), one optimizes the feature weights like normal. This is not sparse in general, but it's not like dual SVM weights are particularly sparse in practice.
If I can expand on your "kind of", it would be that because of the kernel trick, it actually does matter that the data itself can determine the "linear" (in an infinite dimensional space, that would require infinitely many parameters under the primal formulation) model.
Kernelization can be done in primal or dual. Due to the representation theorem, it only ever needs as many parameters as data points. In the primal with a kernel K, you're just doing a feature expansion where each data point x corresponds to a feature whose value at each data point y is just K(x, y).
Yes SVM’s don’t store weights like parametric models but they also don’t store weights “per data point”. Only the points closest to the decision boundary are stored (i.e., the “support vectors”).
The attention matrix is computed based on all tokens in the context, so it kind of functions non-parametrically (but over the batch instead of over the whole training dataset)
Fuck, imagine how many doctoral theses I could've written every time I tweaked a few lines of code to try some abstract way of recombining outputs I didn't fully understand. I missed the boat. All this jargon is absolutely for show, though. Purely intended to create the impression that there's some kind of moat to the "discovery". There are much clearer ways to express "we fucked around with putting the outputs of this black box back into the inputs", but I guess that doesn't impress the rubes.
I think this applies to everything right now. Papers like this are just ridiculous examples. In like, 6th grade I won second place at the LA county science fair for coding a simulation of a coyote's life in hypercard (with tons of graphs). Yay. Y'know what? That shit and those graphs would've been incomprehensible to the judges if it hadn't been written in plain language, in an attempt to make them understand what they were looking at. My entire career since has been an attempt to communicate and alleviate the pain points in communication between parties, by way of writing software that encapsulated their descriptions of what they needed. And likewise I never pretended to be smarter or know more than my clients did: Everything must be explained and comprehensible in normal people language. People need to know how shit works, especially if they're paying for it.
Or they should.
Or if they don't know and don't care, they're fucking negligent.
Especially if they say "wow that sounds smart, let's let these guys run our weapons program".
To your point, the reason this ornate language thrives and people get away with complacency about how their own systems work, boils down to a silent pact between managers and engineers to sweep everything under the rug out of laziness and ill-will. There's something blatantly mendacious and evil (in the banal way) about the agreement that managers approve black boxes which were approved by complex-sounding papers so that upper management can wash their hands of the results.
[edit] maybe I'm just bitter because I spent hours today pondering exactly how many engineers at Monsanto must have known about the dangers of the astroturf, and how many raised their hand, or hid behind a spreadsheet
In math, this is mostly an English problem I think. Next time you find a Wikipedia math page to be an impenetrable wall of jargon, click the Wikipedia language tool and choose another language, any will do.
Then use Chrome's tool to machine translate the foreign language version back to English. I've found invariably this makes the article more coherent then the native English language Wikipedia math page.
I know a lot of professional mathematicians who can't make their way through the Wikipedia articles of adjacent fields. The English entries appear to be written by the ghost of Bourbaki!
What I find entertaining/confounding is how difficult the abstracts to these new AI papers are to understand. It feels like academia is pushing this style, so it’s hard to blame the authors since they have to play the game.
For reference I have an undergrad degree in computer science, have been working professionally for 25 years, and am fairly data centric in my work.
I’m hoping when I run this through GPT4 to get an explanation for a mortal software developer something sensible comes out the other end.
"Not math-y enough"/ "Needs more math" is a very common feedback ML/AI researchers get when writing papers.
The other day I was watching a live-stream of a doctoral defense, as the thesis was quite relevant to my work.
So one of the committee members would really pick and criticize the math - ask questions like "You are supposed to be the bleeding edge on this topic, why was the math so simple? Did you research more rigorous theories to explain the math?" etc. (He was awarded the doctorate though)
So, I dunno, if that's how things are now - it makes sense to me that the authors go overboard with complicated notation, even if they could have written it much simpler. Probably makes the work seem more rigorous and legit.
Doesn't really take that much more time, and it covers your ass from "not rigorous enough" gotchas - though at the expense of readability.
Go read any article in the first 200 years or so of philtrans. There's lots of crucial science there written in a way that doesn't have the modern trappings of the form. It's good reading. Maybe some style perturbations borrowed from earlier eras would be good
If the excuse is true and the "ornate" language really is a dense representation of information then it should be fairly trivial to have an LLM agent unsummarize it.
There could be a webservice that offers a parallel track of layman's translations of any paper.
I haven't tried other models, but if you prompt a recent ChatGTP with "academic style" and ask it to "review and provide feedback" of a paragraph you wrote it will reword it using the most fancy, overselling words it can find.
I liked to use it for improving grammar and style, but in later iterations ChatGTP started writing garbage...
I'm not sure if that is because training, feedback from users or an attempt to make usage is LLMs obvious to teachers.
Yep ~18 century
Didn't Wittgenstein and/or Nietzsche say something similar.
Words are in-adequate for communication, and all philosophy is playing with words.
But, Language is all we have to communicate, so guess we are stuck with it.
I wish also. When I was young and new, so much wasted time trying to parse the 'arcane' math that was really something simple bug dressed up as complicated to give it weight.
Watching the AI community rediscover automatic differentiation 20+ years after the field was considered "mature" was equal parts frustrating & fascinating. The frustration was them rewriting the history of discovery, but without any sort of sense or rigor ... and it was also the most fascinating!
I'm waiting for some fresh group of grad students to make a breakthrough using a reinvented version of Pearls "Do" calculus or maybe they make some narrow breakthrough using BayesNets and everyone geeks out on those for a while
*I do think transformers (much like ff networks + backprop from 2012-2018) are probably a lasting software architecture for inference applications until we come up with new hardware, and move beyond GPU focused computing
It's exciting to see it all working, but disheartening how a-historical this last few years has been in AI - with the exception of Brooks, Sutton and a few other greybeards in the field who say similarly
Programmers excel at it, you don't see that happening much with Physics, Chemistry, Math, Material Sciences etc
Funnily enough it has also been observed before, by Alan Kay comparing programming to a pop culture "In the last 25 years or so, we actually got something like a pop culture"
https://queue.acm.org/detail.cfm?id=1039523#:~:text=In%20the...
I think the main motivation in ml theory that touches current SOTA is not "expressing simple ideas with a jargon for show". Jargon is necessary, as much as some (mostly very practical) engineers or software people cannot see it due to how unnecessary it seems to them (as they are used to practically and quickly express themselves). It's a jargon for the mathematics of machine learning, which is pretty unstandardized so to speak. So you need to define yourself. And without a jargon and clear proofs, what you do is just brainstorming at most. The value of such work is that their statement is pretty clear, proved and contain hypotheses which can be tested by the future papers.
Here is an example: to explain the existence of adversarial example, there are 2 suggestions without a jargon: 1) that the decision boundary is too nonlinear, 2) that the decision boundary is too linear. Both of these explanations contradict and stated without any real proof and unfortunately can be widely heard in most of the adversarial example papers. If we were to have clear formulations of these two statements, we could have tested both of these claims but unfortunately the papers that suggested these theories didn't put effort for defining a jargon and putting their suggestion as a clear-formal statement.
I studied using ML just over a decade ago. I actually compared MLPs to SVMs and had a similar thought to this. It does seem like there is a regression on understanding some of the fundamentals and older tools of the trade.
I guess everyone gets focused on the newer things.
Really does seem like people rediscovering older endpoints.
There's been a huge flood of vanilla software engineers into ML, retconning it as "a subfield of computer science" (computability is a minor concern compared to the statistical underpinnings). They pretend to know the math because they can read the equations, then claim with utmost confidence that actually they're doing all the hard work in ML because they are experts in calling APIs and integrating into products, however useful or useless.
I'm referring obliquely to a specific nitpick from select CS folks who argue that because the theoretical optimum is not computable in finite time/memory that the statistical basis for understanding ML is irrelevant.
Really? Here I thought it was a flood of academics retconning neural net code into a "science" now that programmers had made it run in Python for them fast enough to be useful.
As hacky as it ends up being in practice, there are some pretty solid theoretical fundamentals to the field of statistical learning.
The problem is the theory is constrained either to the micro-scale (individual layers/"simple" models, etc.) or to the supra-scale (optimization/learning theory, etc.).
Not much concrete can be said about the macro-scale (individual networks) in theoretical terms, only that empirically they seem tend toward the things the supra-scale theory says they should do.
The current controversy in the academia v engineers tussle is 1) what exactly do the empirical results imply and 2) how much does the theory really matter given the practical outcomes. The only thing the two sides broadly agree upon is that some amount of error will always exist because NNs can be broadly understood as lossy compression machines.
There's a similar trend of accusing LLMs of not really understanding, being just pattern machines. Funny that a whole group of people get the same treatment.
I'm not sure what your point is. Many software engineers blatantly pattern-match and copy-paste code to try to get their own stuff to work without understanding what's really going on. This is a long-standing complaint in the industry.
It's a term from literature/fiction, where later installments try to explain something from earlier installments; usually in a way that's nothing like the original intent. Applying this to real history is more like "revisionism".
"RETroactive CONtinuity", in addition to what the sibling comment said.
The Wikipedia top example is Sherlock Holmes dying in a fight with Moriarty and then coming back later when the author relented and decided to write more stories.
> we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points
does this mean 'an over-parameterized transformer problem is a convex svm problem'?
The irony is that your "simplification" uses even more "jargon."
But yes, thats how I would read that, and I also see no issue at all with the language in the paper. These terms are used for precision, and have meaning to those in the field. Papers are written for other experts, not laymen.
OK but why they write "benign optimization landscape devoid of stationary points" instead of "convex" other than "just for show"? In my understanding it's not better for either audience experts or laymen. For experts it would be more clear to just say convex and they would know the implications, and if someone doesn't know what convex means they probably also aren't going to be on board with 'stationary points'. Also I'm not trying to pick on the authors I'm just trying to answer the question of which specific parts could be seen as 'just for show'.
I wouldn’t go this far, applied ML articles are my favorite articles. If you’re in the arena, it’s good to see things that other people have done from a practical perspective so you can ape it in your own work or not give it further consideration.
FFT describes everything with sinusoids by default. For QFT, it's wave functions. Wavelets are more like NN neurons that match (scale-invariant?) quantized sections of waveforms.
A classical universal function approximator is probably not sufficient to approximate quantum systems
(unless there is IDK a geometric breakthrough in classical-quantum correspondence similar to the Amplituhedron).
IIUC Church-Turing and Church-Turing-Deutsch say that Turing complete is enough for classical computing, and that a qubit computer can simulate the same quantum logic circuits as any qudit or qutrit computer; but is it ever shown that Quantum Logic is indeed the correct and sufficient logic for propositional calculus and also for all physical systems?
> - The rotation operators Rx(θ), Ry(θ), Rz(θ), the phase shift gate P(φ)[c] and CNOT are commonly used to form a universal quantum gate set.
> - The Clifford set {CNOT, H, S} + T gate. The Clifford set alone is not a universal quantum gate set, as it can be efficiently simulated classically according to the Gottesman–Knill theorem.
> - The Toffoli gate + Hadamard gate.[17] The Toffoli gate alone forms a set of universal gates for reversible boolean algebraic logic circuits which encompasses all classical computation.
[...]
> - The parametrized three-qubit Deutsch gate D(θ)
> A universal logic gate for reversible classical computing, the Toffoli gate, is reducible to the Deutsch gate, D(π/2), thus showing that all reversible classical logic operations can be performed on a universal quantum computer.
Turing machines can also be used as universal function approximators. But I'm not sure it makes sense to put them in the same category as the other two.
I would love to put them in the same category as the other two. In fact I’ve spent quite a lot if time thinking about it / experimenting. Wouldn’t it be great if we could somehow train on data and get a small Turing machine instead of a huge neural network?
Both RISC and CISC are usually used in the context of describing Turing complete instruction sets. I'm not sure, it's relevant here?
If you want to make a comparison in this flavour: Turing machines are a bit like CPUs in that they can execute arbitrary things in sequence. All the flavours of machine learning are more like GPUs: they do well with oodles of big, parallelisable matrix multiplications interspersed with some simple non-linear transformations.
Well we're talking about a "native" implementation of both for comparison, right? Neural nets as they're being used are just being emulated by our turing-machine-like processors which makes them run like ass in practice. Something like an analog circuit that adds up voltages would be a native NN implementation and would surely vastly outperform any turing machine in wide highly parallel super highly memory driven tasks that are well suited for it, and either emulating the other is slow and bloated.
The term hyperplane already assumes that the hypothesis space that your learning algorithm searches has some kind of dimension and is some variant of an Euclidean / vector space (and its generalisations). This is not the case for many forms of ML, for example grammar induction (where the hypothesis space is Chomsky-style grammars) or inductive logic programming (hypothesis space are Prolog (or similar) programs), or, more generally, program synthesis (where programs form the hypothesis space).
Note that "some sort of partitioning" isn't a hyperplane. A partition is a set-theoretic concept. A hyperplane is (a generalisation of) a geometric concept, so has much more structure.
Someday I'm going to write a paper that achieves SOTA results with a nigh-incomprehensible mishmash of diverse techniques and title it "All You Need Considered Harmful".
A hyperplane is a multi-dimensional linear function that splits space into two distinct regions. In the context of a classifier, it splits feature space into disjunct sub-spaces (one for each class). SVMs effectively place a hyperplane with maximum margin, thereby separating classes in an optimal way.
Worth keeping in mind that though it may be optimal according to some mathematical criterion, that is no guarantee that it's the best for the purposes you have in mind.
Or as the subspace of all the vectors are orthogonal to a given single vector, or as the subspace generated by any orthogonal basis with one base vector removed, or as the kernel of a linear form, ... – but a more visual explanation is probably better as a first foray in the question.
I agree that a more visual explanation is better in general.
I was trying to hint how the visual explanation relates to the long vectors of numbers we actually feed our machine learning contraptions with. Not sure I was successful.
Yeah, but only a few are made up to seem like terms of art designed to obfuscate their actual meaning; and usually prepending "hyper-" to something is a signal that a more clear description of the thing doesn't yet exist.
I regret to inform you, I don’t think it’s the same set of people. Writing a cute “Transformers are SVMs” paper and “building chatGPT” are not the same skillset.
This also explains how attention induces sparsity through softmax: 'Bad' tokens that fall on the wrong side of the SVM decision boundary are suppressed by the softmax function, while 'good' tokens are those that end up with non-zero softmax probabilities. It is also worth mentioning this SVM arises from the exponential nature of the softmax.
The title of the paper does not make this clear but hopefully abstract does :).