Transformers as Support Vector Machines

sametoymak · on Sept 3, 2023

I am one of the authors. The most critical aspect is that transformer is a "different kind of SVM". It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.

This also explains how attention induces sparsity through softmax: 'Bad' tokens that fall on the wrong side of the SVM decision boundary are suppressed by the softmax function, while 'good' tokens are those that end up with non-zero softmax probabilities. It is also worth mentioning this SVM arises from the exponential nature of the softmax.

The title of the paper does not make this clear but hopefully abstract does :).

romirain2007 · on Sept 15, 2023

Well, guess what, transformer is also a "traditional" SVM that assigns a 0-1 label: https://openreview.net/forum?id=U_T8-5hClV

It is interesting that you have cited this paper but did not even correctly acknowledge their contribution. Yeah I get all that "they are doing X and we are doing X+1" narrative, but the fact that you have defined "good" tokens by multiplying Y_i to your head function, is not much different than "assigning 0-1" label to inputs in traditional SVM. Your "Y_i" essentially serves as a 0-1 label in SVM.

Sounds like a mind game of re-branding existing concepts lol.

visarga · on Sept 3, 2023

Any comment on how the paper relates to "Every Model Learned by Gradient Descent Is Approximately a Kernel Machine" by Pedro Domingos?

https://arxiv.org/pdf/2308.16898.pdf

sametoymak · on Sept 3, 2023

This seems related to NTK literature i.e. wide neural nets behave like kernel regression. NTK is a great tool but a notable weakness is kernel view doesn't explain how the model learns new features. Transformer is also pretty different from standard neural architectures because tokens interact with each other through attention. Our goal was capturing this interaction and we believe there is a clean insight on feature learning: Attention is running a token-selection procedure by implementing an SVM that separates tokens.

bwbellmath · on Sept 15, 2023

See our re-examination of the kernel equivalence. Path kernels exactly measure how models learn as their understanding of data improves during training, and this can be expressed in terms of the gradients with regards to each trianing input: https://arxiv.org/abs/2308.00824

We believe that all neural networks are effectively an SVM or more generally reproducing kernel architecture to implicitly layer the understanding contributed during each training iteration. Do you have any comment in the RKHS or RKBS context for transformers?

ogogmad · on Sept 3, 2023

When you say SVM, do you mean any classifier that finds a separating hyperplane, like a no-hidden-layer "perceptron" or Naive Bayes, instead of one which finds the maximum margin hyperplane? Or is finding the maximum margin important here? Thanks. Very interesting.

I think our own brains and nervous system use a step-function as their "activation function", so this could - optimistically - be a throwback to the roots of Rosenblatt's idea.

sametoymak · on Sept 3, 2023

This SVM summarizes the training dynamics of the attention layer, so there is no hidden-layer. It operates on the token embeddings of that layer. Essentially, weights of the attention layer converge (in direction) to the maximum margin separator between the good vs bad tokens. Note that there is no label involved, instead you are separating the tokens based on their contribution to the training loss. We can formally assign a "score" of each token for 1-layer model but this is tricky to do for multilayer with MLP heads.

Finally, I agree that this is more step-function like. There are caveats we discuss in the paper (i.e. how TF assigns continuous softmax probabilities over the selected tokens).

To me, summary is: Through softmax-attention, transformer is running a "feature/token selection procedure". Thanks to softmax, we can obtain a clean SVM interpretation of max-margin token separation.

ftxbro · on Sept 3, 2023

> It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.

sorry but how is separating 'good' tokens from 'bad' tokens inherently different from assigning a 0-1 label

sametoymak · on Sept 3, 2023

Here is what I meant:

Standard SVM classifier: Maps an input sequence to a 0-1 label. Example: Take a paragraph and return its sentiment. During training, label is specified.

Transformer's SVM: Takes input sequence, suppresses bad tokens and passes good tokens to the next layer. This is a token-selector rather than classifier.

Example: Take a paragraph and output the salient words in the paragraph. We don't know which words are salient during training, the model has to figure them out during training.

eurekin · on Sept 3, 2023

AFAIR, SVMs have one optimal solution, achievable analytically. NN can get stuck in local optima.

If a transformer is a SVM, could we simply extract it out and optimise the hyperplane like for any SVM?

joe_the_user · on Sept 3, 2023

I have read that SVMs as machine learning model failed to take off because of their inability to scale relative to deep neural networks. Would your work point to ways of changing this?

ftxbro · on Sept 3, 2023

how is your paper different from all the ones like 'transformers are really x' where x is the author's special field of study

sametoymak · on Sept 3, 2023

IMO it is important to understand transformer mechanics through core ML themes like SVM and feature-selection. Our results are not interpretation, they are mathematically rigorous and numerically verifiable. That said, we have no intention of trivializing a complex model like GPT-4 as a simple SVM. That is a tall order :)

mycall · on Sept 3, 2023

If there is actually equivalence between different type systems and algorithms, that opens the door for simplification through unification.

regularfry · on Sept 3, 2023

Practically speaking, does this give us anything interesting from an implementation perspective? My uneducated reading of this is that a single SVM layer is equivalent to the multiple steps in a transformer layer. I'm guessing it can't reduce the number of computations purely from an information theory argument, but doesn't it imply a radically simpler and easier to implement architecture?

bjornsing · on Sept 3, 2023

I just read the abstract so could be way off, but sounds more like one of those papers that connect seemingly different mathematical formalisms and show their equivalence (often under some restrictions). Typically they don’t give us much immediate benefit in terms of implementation, but they add to the intuitive understanding of what we’re doing, and sometimes helps others make more practical progress.

geokon · on Sept 3, 2023

I'm not an expert in this, so hopefully someone more knowledgeable can weight in - but SVMs are understood much better from the perspective over overfitting and things like the VC bound - while Transformers are not really understood as well. From what I remember it's quite easy to have a SVM overfit, while Transformers have fewer issues. It'd be interesting to understand why

So if the two are somehow connected, then that could have implications for tuning and fighting overfitting

maybe it'd also be possible to design better non-overfitting SVMs

consilient · on Sept 3, 2023

> From what I remember it's quite easy to have a SVM overfit ... It'd be interesting to understand why

SVMs with well-tuned kernels and regularization are reasonably resistant to overfitting. The problem is that you can easily end up overfitting the hyperparameters if you're not very careful about how you do performance testing.

albertzeyer · on Sept 3, 2023

Those equivalences can connect two different fields and allows to transfer methods from one field to the other. Each field usually has developed quite a number of methods and tricks over the time. So when this work shows that they are equivalent (with restrictions), you can maybe use some of the tricks of SVMs and try to use them to improve the Transformer model or its training.

Otherwise, they just help us in better understanding Transformers and SVMs.

There have been similar equivalences before, for example:

Linear Transformers Are Secretly Fast Weight Programmers, https://arxiv.org/abs/2102.11174

Or policy gradient methods from reinforcement learning are basically the same as sequence-discriminative training as it was done for speech recognition since many years, however, they come with different tricks, and combining the tricks was helpful.

choeger · on Sept 3, 2023

I am waiting for someone publishing the theoretical limits of these "AI" systems. They're certainly impressive language models - don't get me wrong on that. But every algorithm and every model has its limits. To know the limits turns their application from hype into engineering. And of course, the hype-sellers will try to keep that from happening as long as possible.

ImScared1234 · on Sept 3, 2023

Hey,

https://en.wikipedia.org/wiki/Universal_approximation_theore...

This theorem explain the limits, putting it in simple terms, most architectures are universal approximators that are constrained by the inductive bias that we give them, so far the approximator arquitectured that is less constrained by the inductive bias is the transformer, so it should be able to approximate any mathematical function, the current problem is that the attention mechanism have a quadratic scaling, so while is easy to scale it in text, is pretty hard to scale it with anything else to the same performance, even if not further discoveries are made, just with the computer power of the future it should be able to scale in every field, even with the techniques of today it gives pretty good performance in a lot of tasks.

This review of the paper an image is worth 16x16 words by Yannic Kilcher explains it better if you are interested.

https://youtu.be/TrdevFK_am4?t=1314

Q6T46nT668w6i3m · on Sept 3, 2023

It’s entirely reasonable to desire boundaries between nothing and … the universal approximation theorem!

ben_w · on Sept 3, 2023

Hype sellers, despite being annoying and noisy, are not the reason why it's hard to figure out the theoretical limits.

To put it the form of a rhetorical question: many of these models are public, so why "wait" when you could do the research yourself?

fl7305 · on Sept 3, 2023

> I am waiting for someone publishing the theoretical limits of these "AI" systems.

> To know the limits turns their application from hype into engineering.

It would be helpful to know how the models actually work under the hood.

But we made very good use of metals for thousands of years before we understood things like atoms, chemical bonds, lattices, etc.

Some engineering disciplines can be made up largely of empirical knowledge.

Engineering to me is "make the things we want out of the things we have", and not necessarily "design based on complete scientific theories".

riwsky · on Sept 3, 2023

I, as a Real Engineer, REFUSE to use ChatGPT until we have a working theory of quantum gravity. Enough of this bullshit where no one knows the fundamentals of what they’re working with.

bob1029 · on Sept 3, 2023

What are the fundamental limits of language itself? Is English somehow more "emergent" than Korean? Isn't this more interesting than the actual execution mechanism?

The business of these new LLMs is next token prediction with context. This is also now a mission because it clearly works to some large extent. Where most would not have been willing to take a leap of faith prior, many can see some path now. I've been able to suspend my disbelief around language-as-computation long enough to discover new options.

vikramkr · on Sept 3, 2023

You're looking for the universal approximation theorem. It's one of those cases where they can do anything in theory so the question is more are we chasing a turning tarpit or not, where everything is possible but nothing is easy

abhinai · on Sept 3, 2023

Fully connected neural networks are hierarchies of logistic regression nodes. Transformers are networks of SVM nodes. I guess we can expect networks of other kinds of classifiers in the future. Perhaps networks of Decision Tree nodes? Mix and match?

mjburgess · on Sept 3, 2023

NNs are decision trees anyway -- take any classification alg and rewind from its decision points into a disjunction of conditions.

Or, maybe more clearly: imagine taking any classification algorithm and drawing the graph of all of its predictions across it's ___domain. Then just construct a decision tree which "draws splits" along the original alg's decision edges.

Likewise, all ML is equivalent to a KNN parameterised on an averaging operation.

Everything here is eqv to everything else. ML is just computing an expectation over a training dataset, weighted by the model parameters.

The "value" comes from the (copyright laundering/) data. The only question is: can you find useful weights by which to control the expectation you're taking?

Various ML approaches weight the training data differently. The most successful of the latest round of AI manages to compute weights across everything ever written --- hence more useful than naive KNN which wouldnt terminate on 1PB of text.

david-gpu · on Sept 3, 2023

> NNs are decision trees anyway -- take any classification alg and rewind from its decision points into a disjunction of conditions

By that argument, every computation can be reduced to a lookup table. Take every possible input, memorize the correct output and store it in a database of sorts.

If decision trees were truly equivalent to NNs, you would be able to solve any problem currently addressed with NNs but using only decision trees without learning from the output of the NN. Same input datasets same output quality metrics.

Not really feasible, is it?

Likewise with all the other equivalences you made here.

mjburgess · on Sept 3, 2023

Sure, every computation is equivalent to a lookup table over "predetermined answers". It isn't equivalent if we don't have those answers.

Eg., "what's the US President's telephone number in 2000?" had no answer in 1900.

> If decision trees were truly equivalent to NNs

They are equivalent. And you don't need to precompute answers you don't have. You can take the weights of a NN and encode them as a DT; just as you can also transform a NN to just be k-nearest-neighbors.

The reason we dont do that is prediction efficiency.

Also, of course, such functions are also basically impossible to train as a practical matter. That bares little on their equivalence.

All ML models are expressible as k-nearest-neighbors -- this is useful information because it demystifies the process. Countless papers end with "and we dont know why!" -- where the "why" is obvious if you reformulate the model.

ML is just ranking a historical dataset of size N, by similarity to some X, selecting up to N examples from it, weighting each by W and then taking an average.

anon291 · on Sept 3, 2023

> By that argument, every computation can be reduced to a lookup table. Take every possible input, memorize the correct output and store it in a database of sorts

You're playing into his argument. You are right. All computation we know of is equivalent to a lookup table since none of our computers are actual turing machines.

And this highlights the difference between the software engineer way of thinking and the mathematical one

thomasahle · on Sept 3, 2023

> Fully connected neural networks are hierarchies of logistic regression nodes.

Only if you use softmax ss your activation function.

bjornsing · on Sept 3, 2023

You mean sigmoid activation function?

thomasahle · on Sept 3, 2023

If we are talking about "hierarchies of logistic regression nodes" we have to define how to extend logistic regression to multiple outputs. The most common approach is Multinomial logistic regression: https://en.wikipedia.org/wiki/Logistic_regression#Extensions . Other times sigmoid might be the right answer.

westurner · on Sept 3, 2023

SVMs are randomly initialized (with arbitrary priors) and then are deterministic.

From "What Is the Random Seed on SVM Sklearn, and Why Does It Produce Different Results?" https://saturncloud.io/blog/what-is-the-random-seed-on-svm-s... :

> When you train an SVM model in sklearn, the algorithm uses a random initialization of the model parameters. This is necessary to avoid getting stuck in a local minimum during the optimization process.

> The random initialization is controlled by a parameter called the random seed. The random seed is a number that is used to initialize the random number generator. This ensures that the random initialization of the model parameters is consistent across different runs of the code

From "Random Initialization For Neural Networks : A Thing Of The Past" (2018) https://towardsdatascience.com/random-initialization-for-neu... :

> Lets look at three ways to initialize the weights between the layers before we start the forward, backward propagation to find the optimum weights.

> 1: zero initialization

> 2: random initialization

> 3: he-et-al initialization

Deep learning: https://en.wikipedia.org/wiki/Deep_learning

SVM: https://en.wikipedia.org/wiki/Support_vector_machine

Is it guaranteed that SVMs converge upon a solution regardless of random seed?

Dr_Birdbrain · on Sept 3, 2023

An SVM is a quadratic program, which is convex. This means that they should always converge and they should always converge to the same global optimum, regardless of initialization, as long as they are feasible, I.e. as long as the two classes can be separated by an SVM.

steppi · on Sept 3, 2023

The soft-margin SVM which can handle misclassifications is also convex and has a unique global optimum [0].

[0] https://stackoverflow.com/a/12610455/992102

westurner · on Sept 4, 2023

> as long as the two classes can be separated by an SVM.

Are the classes separable with e.g. the intertwined spiral dataset in the TensorFlow demo? Maybe only with a radial basis function kernel?

Separable state https://en.wikipedia.org/wiki/Separable_state :

> In quantum mechanics, separable states are quantum states belonging to a composite space that can be factored into individual states belonging to separate subspaces. A state is said to be entangled if it is not separable. In general, determining if a state is separable is not straightforward and the problem is classed as NP-hard.

An algorithm may converge upon the same wrong - or 'high error' - answer; regardless of a random seed parameter.

It looks like there is randomization for SVMs for e.g. Platt scaling [1], though I had confused Simulated Annealing with SVMs. And then re-read Quantum Annealing; what is the ground state of the Hamiltonian any why would I use a hyperplane instead?

[1] https://news.ycombinator.com/item?id=37369783

steppi · on Sept 3, 2023

The article you’ve linked is incorrect. As Dr_Birdbrain said, fitting an SVM is a convex problem with unique global optimum. sklearn.SVC relies on libsvm which initializes the weights to 0 [0]. The random state is only used to shuffle the data to make probability estimates with Platt scaling [1]. Of the random_state parameter, the sklearn documentation for SVC [2] says

Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.

[0] https://github.com/scikit-learn/scikit-learn/blob/2a2772a87b...

[1] https://en.wikipedia.org/wiki/Platt_scaling

[2] https://scikit-learn.org/stable/modules/generated/sklearn.sv...

westurner · on Sept 4, 2023

Which article is incorrect? Indeed it looks like there is no random initialization in libsvm or thereby sklearn.svm.SVC or in sklearn.svm.*. I seem to have confused random initialization in Simulated Annealing with SVMs; though now TIL that there are annealing SVMs, and SVMs do work with wave functions (though it's optional to map the wave functions into feature space with quantum state tomography), and that there are SVMs for the D-Wave Quantum annealer QC.

From "Support vector machines on the D-Wave quantum annealer" (2020) https://www.sciencedirect.com/science/article/pii/S001046551... :

Kernel-based support vector machines (SVMs) are supervised machine learning algorithms for classification and regression problems. We introduce a method to train SVMs on a D-Wave 2000Q quantum annealer and study its performance in comparison to SVMs trained on conventional computers. The method is applied to both synthetic data and real data obtained from biology experiments. We find that the quantum annealer produces an ensemble of different solutions that often generalizes better to unseen data than the single global minimum of an SVM trained on a conventional computer, especially in cases where only limited training data is available. For cases with more training data than currently fits on the quantum annealer, we show that a combination of classifiers for subsets of the data almost always produces stronger joint classifiers than the conventional SVM for the same parameters.

steppi · on Sept 7, 2023

My apologies for the ambiguity; I assumed it would be clear from context. The article at the link, https://saturncloud.io/blog/what-is-the-random-seed-on-svm-s..., is incorrect. Whoever wrote it seems to have confused support vector machines with neural networks.

For the D-Wave paper, I'm not sure it's fair that they are comparing an ensemble with a single classifier. I think it would be more fair if they compared their ensemble with a bagging ensemble of linear SVMs which each use the Nystroem kernel approximation [0] and which are each trained using stochastic sub-gradient descent [1].

[0] https://scikit-learn.org/stable/modules/generated/sklearn.ke...

[1] https://scikit-learn.org/stable/modules/sgd.html#classificat...

westurner · on Sept 8, 2023

Nystroem method: https://en.wikipedia.org/wiki/Nystr%C3%B6m_method

6.7 Kernel Approximation > 6.7.1. Nystroem Method for Kernel Approximation https://scikit-learn.org/stable/modules/kernel_approximation...

Nystroem defaults to an rbf radial basis function and - from quantum logic - Bloch spheres are also radial. Perhaps that's nothing.

FWIU SVMs w/ kernel trick are graphical models, and NNs are too.

How much more resource-cost expensive is it to train an ensemble of SVMs than one graphical model with typed relations? What about compared to deep learning for feature synthesis and selection and gradient boosting with xgboost to find the coefficients/exponents of the identified terms of the expression which are not prematurely excluded by feature selection?

There are algorithmic complexity and algorithmic efficiency metrics that should be relevant to AutoML solution ranking. Opcode cost may loosely correspond to algorithmic complexity.

[Dask] + Scikeras + Auto-sklearn 2.0 may or may not modify NN topology metaparameters like number of layers and nodes therein at runtime? https://twitter.com/westurner/status/1697270946506170638

gugagore · on Sept 3, 2023

SVMs typically have weights per data point. I.e. nonparametric/hyper parametric. Modern machine learning doesn't really work like that anymore, right?

mjhay · on Sept 3, 2023

The weight per datapoint thing is actually kind of orthogonal to the concept of an SVM, but is conflated by most introductions to SVMs. SVMs are linear models using hinge loss. In the "primal" optimization perspective (rather than the dual problem SVMs are usually formulated as), one optimizes the feature weights like normal. This is not sparse in general, but it's not like dual SVM weights are particularly sparse in practice.

gugagore · on Sept 3, 2023

Totally. Thank you for expanding on "typically".

If I can expand on your "kind of", it would be that because of the kernel trick, it actually does matter that the data itself can determine the "linear" (in an infinite dimensional space, that would require infinitely many parameters under the primal formulation) model.

mjhay · on Sept 5, 2023

Kernelization can be done in primal or dual. Due to the representation theorem, it only ever needs as many parameters as data points. In the primal with a kernel K, you're just doing a feature expansion where each data point x corresponds to a feature whose value at each data point y is just K(x, y).

exegete · on Sept 3, 2023

Yes SVM’s don’t store weights like parametric models but they also don’t store weights “per data point”. Only the points closest to the decision boundary are stored (i.e., the “support vectors”).

joaogui1 · on Sept 3, 2023

The attention matrix is computed based on all tokens in the context, so it kind of functions non-parametrically (but over the batch instead of over the whole training dataset)

quickthrower2 · on Sept 3, 2023

I would love an Andrej video on this

noduerme · on Sept 3, 2023

Fuck, imagine how many doctoral theses I could've written every time I tweaked a few lines of code to try some abstract way of recombining outputs I didn't fully understand. I missed the boat. All this jargon is absolutely for show, though. Purely intended to create the impression that there's some kind of moat to the "discovery". There are much clearer ways to express "we fucked around with putting the outputs of this black box back into the inputs", but I guess that doesn't impress the rubes.

kristopolous · on Sept 3, 2023

I really really wish this culture of expressing simple things in ornate ways would die. All it does is make knowledge less accessible

noduerme · on Sept 3, 2023

I think this applies to everything right now. Papers like this are just ridiculous examples. In like, 6th grade I won second place at the LA county science fair for coding a simulation of a coyote's life in hypercard (with tons of graphs). Yay. Y'know what? That shit and those graphs would've been incomprehensible to the judges if it hadn't been written in plain language, in an attempt to make them understand what they were looking at. My entire career since has been an attempt to communicate and alleviate the pain points in communication between parties, by way of writing software that encapsulated their descriptions of what they needed. And likewise I never pretended to be smarter or know more than my clients did: Everything must be explained and comprehensible in normal people language. People need to know how shit works, especially if they're paying for it.

Or they should.

Or if they don't know and don't care, they're fucking negligent.

Especially if they say "wow that sounds smart, let's let these guys run our weapons program".

To your point, the reason this ornate language thrives and people get away with complacency about how their own systems work, boils down to a silent pact between managers and engineers to sweep everything under the rug out of laziness and ill-will. There's something blatantly mendacious and evil (in the banal way) about the agreement that managers approve black boxes which were approved by complex-sounding papers so that upper management can wash their hands of the results.

[edit] maybe I'm just bitter because I spent hours today pondering exactly how many engineers at Monsanto must have known about the dangers of the astroturf, and how many raised their hand, or hid behind a spreadsheet

https://frontofficesports.com/investigation-links-astroturf-...

pedrosorio · on Sept 3, 2023

“Engineers who like to pretend to be mathematicians” - I heard once.

kristopolous · on Sept 3, 2023

In math, this is mostly an English problem I think. Next time you find a Wikipedia math page to be an impenetrable wall of jargon, click the Wikipedia language tool and choose another language, any will do.

Then use Chrome's tool to machine translate the foreign language version back to English. I've found invariably this makes the article more coherent then the native English language Wikipedia math page.

It says something about the culture for sure.

thechao · on Sept 3, 2023

I know a lot of professional mathematicians who can't make their way through the Wikipedia articles of adjacent fields. The English entries appear to be written by the ghost of Bourbaki!

consilient · on Sept 3, 2023

Some of them are: I've run into sections lifted straight of of Lang's Algebra several times.

novalis78 · on Sept 3, 2023

This works for philosophers too. Try Schopenhauer (already easy to read in comparison) in English translation.

djmips · on Sept 4, 2023

Since LLM are good at translating perhaps that's one of the reasons they seem to do a decent job of explaining things.

mattew · on Sept 3, 2023

What I find entertaining/confounding is how difficult the abstracts to these new AI papers are to understand. It feels like academia is pushing this style, so it’s hard to blame the authors since they have to play the game.

For reference I have an undergrad degree in computer science, have been working professionally for 25 years, and am fairly data centric in my work.

I’m hoping when I run this through GPT4 to get an explanation for a mortal software developer something sensible comes out the other end.

TrackerFF · on Sept 3, 2023

"Not math-y enough"/ "Needs more math" is a very common feedback ML/AI researchers get when writing papers.

The other day I was watching a live-stream of a doctoral defense, as the thesis was quite relevant to my work.

So one of the committee members would really pick and criticize the math - ask questions like "You are supposed to be the bleeding edge on this topic, why was the math so simple? Did you research more rigorous theories to explain the math?" etc. (He was awarded the doctorate though)

So, I dunno, if that's how things are now - it makes sense to me that the authors go overboard with complicated notation, even if they could have written it much simpler. Probably makes the work seem more rigorous and legit.

Doesn't really take that much more time, and it covers your ass from "not rigorous enough" gotchas - though at the expense of readability.

kristopolous · on Sept 3, 2023

Go read any article in the first 200 years or so of philtrans. There's lots of crucial science there written in a way that doesn't have the modern trappings of the form. It's good reading. Maybe some style perturbations borrowed from earlier eras would be good

https://www.biodiversitylibrary.org/bibliography/62536 menu on the right

Benjamin Franklin, Robert Boyle, Isaac Newton, Maxwell, Ohm and Volt - they're all there. If that style was good enough for them ...

tinco · on Sept 3, 2023

If the excuse is true and the "ornate" language really is a dense representation of information then it should be fairly trivial to have an LLM agent unsummarize it.

There could be a webservice that offers a parallel track of layman's translations of any paper.

heydenberk · on Sept 3, 2023

Sounds like you're referring to https://www.explainpaper.com/ :)

noduerme · on Sept 3, 2023

An LLM won't tell you that the authors obfuscated it because they don't know what the fuck they did. You need a human for that.

estebarb · on Sept 3, 2023

I haven't tried other models, but if you prompt a recent ChatGTP with "academic style" and ask it to "review and provide feedback" of a paragraph you wrote it will reword it using the most fancy, overselling words it can find. I liked to use it for improving grammar and style, but in later iterations ChatGTP started writing garbage...

I'm not sure if that is because training, feedback from users or an attempt to make usage is LLMs obvious to teachers.

CuriouslyC · on Sept 3, 2023

That's literally the entire field of philosophy after the ~18th century.

FrustratedMonky · on Sept 3, 2023

Yep ~18 century Didn't Wittgenstein and/or Nietzsche say something similar. Words are in-adequate for communication, and all philosophy is playing with words.

But, Language is all we have to communicate, so guess we are stuck with it.

FrustratedMonky · on Sept 3, 2023

I wish also. When I was young and new, so much wasted time trying to parse the 'arcane' math that was really something simple bug dressed up as complicated to give it weight.

jimkoen · on Sept 3, 2023

People have to get their PhD's somehow... ;)

thechao · on Sept 3, 2023

Watching the AI community rediscover automatic differentiation 20+ years after the field was considered "mature" was equal parts frustrating & fascinating. The frustration was them rewriting the history of discovery, but without any sort of sense or rigor ... and it was also the most fascinating!

AndrewKemendo · on Sept 3, 2023

This is indeed the frustration

I'm waiting for some fresh group of grad students to make a breakthrough using a reinvented version of Pearls "Do" calculus or maybe they make some narrow breakthrough using BayesNets and everyone geeks out on those for a while

*I do think transformers (much like ff networks + backprop from 2012-2018) are probably a lasting software architecture for inference applications until we come up with new hardware, and move beyond GPU focused computing

It's exciting to see it all working, but disheartening how a-historical this last few years has been in AI - with the exception of Brooks, Sutton and a few other greybeards in the field who say similarly

barelyauser · on Sept 4, 2023

Scarcity is not a myth.

AndrewKemendo · on Sept 7, 2023

There is no scarcity of fundamental human needs of water, food, shelter and love.

The only reason someone lacks them is because someone else is hoarding them.

This is well established in global trade metrics.

capableweb · on Sept 3, 2023

The funny thing is that this constantly happens in every field ever, humans truly excel at repeating history without learning from the past.

Another example:

- HTML served by static file servers

- HTML generated by backend

- HTML enhanced with small JS snippets

- HTML generated by frontend, but served by backend

- Go to step one, not learning why anyone moved on from the previous method

joaogui1 · on Sept 3, 2023

Programmers excel at it, you don't see that happening much with Physics, Chemistry, Math, Material Sciences etc Funnily enough it has also been observed before, by Alan Kay comparing programming to a pop culture "In the last 25 years or so, we actually got something like a pop culture" https://queue.acm.org/detail.cfm?id=1039523#:~:text=In%20the...

mr_toad · on Sept 4, 2023

Poor training, poor communication, & knowledge not being curated.

When then best method of getting advice on the internet is to post the wrong answer you know the system is broken.

maf12 · on Sept 3, 2023

I think the main motivation in ml theory that touches current SOTA is not "expressing simple ideas with a jargon for show". Jargon is necessary, as much as some (mostly very practical) engineers or software people cannot see it due to how unnecessary it seems to them (as they are used to practically and quickly express themselves). It's a jargon for the mathematics of machine learning, which is pretty unstandardized so to speak. So you need to define yourself. And without a jargon and clear proofs, what you do is just brainstorming at most. The value of such work is that their statement is pretty clear, proved and contain hypotheses which can be tested by the future papers.

Here is an example: to explain the existence of adversarial example, there are 2 suggestions without a jargon: 1) that the decision boundary is too nonlinear, 2) that the decision boundary is too linear. Both of these explanations contradict and stated without any real proof and unfortunately can be widely heard in most of the adversarial example papers. If we were to have clear formulations of these two statements, we could have tested both of these claims but unfortunately the papers that suggested these theories didn't put effort for defining a jargon and putting their suggestion as a clear-formal statement.

gargablegar · on Sept 3, 2023

I studied using ML just over a decade ago. I actually compared MLPs to SVMs and had a similar thought to this. It does seem like there is a regression on understanding some of the fundamentals and older tools of the trade.

I guess everyone gets focused on the newer things.

Really does seem like people rediscovering older endpoints.

uoaei · on Sept 3, 2023

There's been a huge flood of vanilla software engineers into ML, retconning it as "a subfield of computer science" (computability is a minor concern compared to the statistical underpinnings). They pretend to know the math because they can read the equations, then claim with utmost confidence that actually they're doing all the hard work in ML because they are experts in calling APIs and integrating into products, however useful or useless.

consilient · on Sept 3, 2023

> "a subfield of computer science" (computability is a minor concern compared to the statistical underpinnings)

Computability theory is not all of computer science. It's just one subfield among many.

uoaei · on Sept 3, 2023

I'm referring obliquely to a specific nitpick from select CS folks who argue that because the theoretical optimum is not computable in finite time/memory that the statistical basis for understanding ML is irrelevant.

noduerme · on Sept 3, 2023

Really? Here I thought it was a flood of academics retconning neural net code into a "science" now that programmers had made it run in Python for them fast enough to be useful.

uoaei · on Sept 3, 2023

As hacky as it ends up being in practice, there are some pretty solid theoretical fundamentals to the field of statistical learning.

The problem is the theory is constrained either to the micro-scale (individual layers/"simple" models, etc.) or to the supra-scale (optimization/learning theory, etc.).

Not much concrete can be said about the macro-scale (individual networks) in theoretical terms, only that empirically they seem tend toward the things the supra-scale theory says they should do.

The current controversy in the academia v engineers tussle is 1) what exactly do the empirical results imply and 2) how much does the theory really matter given the practical outcomes. The only thing the two sides broadly agree upon is that some amount of error will always exist because NNs can be broadly understood as lossy compression machines.

visarga · on Sept 3, 2023

There's a similar trend of accusing LLMs of not really understanding, being just pattern machines. Funny that a whole group of people get the same treatment.

jprete · on Sept 3, 2023

I'm not sure what your point is. Many software engineers blatantly pattern-match and copy-paste code to try to get their own stuff to work without understanding what's really going on. This is a long-standing complaint in the industry.

Q6T46nT668w6i3m · on Sept 3, 2023

You’re comparing a subjective claim to an objective claim. Let’s be clear, transformers do not “understand.”

og_kalu · on Sept 3, 2023

They do in every practical, testable definition of the word.

"Transformers don't understand" is not an objective claim and in fact any attempt to objectify it leads to the opposite assertion.

uoaei · on Sept 3, 2023

You will note the phrasing is tautological in nature and does not apply to all software engineers.

jjtheblunt · on Sept 3, 2023

Retconning? What does that mean?

chriswarbo · on Sept 3, 2023

It's a term from literature/fiction, where later installments try to explain something from earlier installments; usually in a way that's nothing like the original intent. Applying this to real history is more like "revisionism".

ReactiveJelly · on Sept 3, 2023

"RETroactive CONtinuity", in addition to what the sibling comment said.

The Wikipedia top example is Sherlock Holmes dying in a fight with Moriarty and then coming back later when the author relented and decided to write more stories.

https://en.wikipedia.org/wiki/Retroactive_continuity

jjtheblunt · on Sept 4, 2023

(thanks to the sibling comments for explaining...)

Q6T46nT668w6i3m · on Sept 3, 2023

That’s science. You can’t expect everyone to know everything. It’s a preprint so this is the first opportunity to provide feedback.

agent327 · on Sept 3, 2023

You think it's that easy, but, as frequently the case with transformers, I believe there's more here than meets the eye.

necroforest · on Sept 3, 2023

which jargon here is "just for show"?

ftxbro · on Sept 3, 2023

> we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points

does this mean 'an over-parameterized transformer problem is a convex svm problem'?

tensor · on Sept 3, 2023

The irony is that your "simplification" uses even more "jargon."

But yes, thats how I would read that, and I also see no issue at all with the language in the paper. These terms are used for precision, and have meaning to those in the field. Papers are written for other experts, not laymen.

ftxbro · on Sept 3, 2023

OK but why they write "benign optimization landscape devoid of stationary points" instead of "convex" other than "just for show"? In my understanding it's not better for either audience experts or laymen. For experts it would be more clear to just say convex and they would know the implications, and if someone doesn't know what convex means they probably also aren't going to be on board with 'stationary points'. Also I'm not trying to pick on the authors I'm just trying to answer the question of which specific parts could be seen as 'just for show'.

sdenton4 · on Sept 3, 2023

Because if you say convex, you better damned well have a proof of convexity...

dongecko · on Sept 3, 2023

I read it the same way as you did, or at least it's an approximation.

In general that's not really surprising. I remember discussions from some years ago about larger networks leading to smother loss surfaces.

meltyness · on Sept 3, 2023

Or, optimistically, this is really how they think about these things, and you should simply be happy they're not trying to obfuscate their findings.

Q6T46nT668w6i3m · on Sept 3, 2023

It’s a preprint on arXiv, not the Magna Carta.

r-zip · on Sept 4, 2023

What are you talking about? Did we read the same abstract?

ResearchCode · on Sept 3, 2023

You're not wrong. Applied ML articles are not worth reading.

constantly · on Sept 3, 2023

I wouldn’t go this far, applied ML articles are my favorite articles. If you’re in the arena, it’s good to see things that other people have done from a practical perspective so you can ape it in your own work or not give it further consideration.

sdenton4 · on Sept 3, 2023

[strike]Punk's[/strike] SVM's not dead!

(More seriously, it's good to find inroads to better formal understanding of what's happening in these systems.)

sgt101 · on Sept 3, 2023

Universal function approximator == universal function approximator

westurner · on Sept 3, 2023

Universal approximation theorem: https://en.wikipedia.org/wiki/Universal_approximation_theore...

NAND is a universal logic gate; from which all classical functions can be approximated.

CCNOT and Hadamard are universal logic gates with which all (?) quantum functions/transforms can be approximated.

soVeryTired · on Sept 3, 2023

Cubic splines are universal function approximators. As is piecewise linear interpolation. The universal function approximator thing is way overblown.

westurner · on Sept 4, 2023

FFT describes everything with sinusoids by default. For QFT, it's wave functions. Wavelets are more like NN neurons that match (scale-invariant?) quantized sections of waveforms.

Fluids are decomposed into things with curl.

A classical universal function approximator is probably not sufficient to approximate quantum systems [...] https://news.ycombinator.com/item?id=37379123

vcxy · on Sept 3, 2023

> CCNOT and Hadamard are universal logic gates with which all (?) quantum functions/transforms can be approximated.

CNOT, H, S, and T are universal for approximating any quantum operation.

westurner · on Sept 4, 2023

A classical universal function approximator is probably not sufficient to approximate quantum systems (unless there is IDK a geometric breakthrough in classical-quantum correspondence similar to the Amplituhedron).

IIUC Church-Turing and Church-Turing-Deutsch say that Turing complete is enough for classical computing, and that a qubit computer can simulate the same quantum logic circuits as any qudit or qutrit computer; but is it ever shown that Quantum Logic is indeed the correct and sufficient logic for propositional calculus and also for all physical systems?

From "Quantum logic gate > Universal quantum gates": https://en.wikipedia.org/wiki/Quantum_logic_gate#Universal_q... :

> Some universal quantum gate sets include:

> - The rotation operators Rx(θ), Ry(θ), Rz(θ), the phase shift gate P(φ)[c] and CNOT are commonly used to form a universal quantum gate set.

> - The Clifford set {CNOT, H, S} + T gate. The Clifford set alone is not a universal quantum gate set, as it can be efficiently simulated classically according to the Gottesman–Knill theorem.

> - The Toffoli gate + Hadamard gate.[17] The Toffoli gate alone forms a set of universal gates for reversible boolean algebraic logic circuits which encompasses all classical computation.

[...]

> - The parametrized three-qubit Deutsch gate D(θ)

> A universal logic gate for reversible classical computing, the Toffoli gate, is reducible to the Deutsch gate, D(π/2), thus showing that all reversible classical logic operations can be performed on a universal quantum computer.

CCNOT: https://en.wikipedia.org/wiki/Toffoli_gate https://en.wikipedia.org/wiki/Quantum_logic_gate#Toffoli_(CC...

CNOT: https://en.wikipedia.org/wiki/Controlled_NOT_gate

H: https://en.wikipedia.org/wiki/Quantum_logic_gate#Hadamard_ga...

S: https://en.wikipedia.org/wiki/Quantum_logic_gate#Phase_shift...

T: https://en.wikipedia.org/wiki/Quantum_logic_gate#Phase_shift...

Implicit to a quantum approximator would be at least Quantum statistical mechanics and maybe also Quantum logic:

Quantum statistical mechanics: https://en.wikipedia.org/wiki/Quantum_statistical_mechanics

Quantum logic: https://en.wikipedia.org/wiki/Quantum_logic

sgt101 · on Sept 3, 2023

All quantum functions can be approximated with a three layer perceptron, or by arbitary boolean logic.

Quantum computers can only compute - just like any other computer.

eru · on Sept 3, 2023

Turing machines can also be used as universal function approximators. But I'm not sure it makes sense to put them in the same category as the other two.

bjornsing · on Sept 3, 2023

I would love to put them in the same category as the other two. In fact I’ve spent quite a lot if time thinking about it / experimenting. Wouldn’t it be great if we could somehow train on data and get a small Turing machine instead of a huge neural network?

moffkalast · on Sept 3, 2023

I would expect it to result in a large and slow Turing machine instead of a small neural network.

ethbr1 · on Sept 3, 2023

So far, betting on RISC over CISC in terms of ultimate hardware performance has been a good bet.

eru · on Sept 4, 2023

Both RISC and CISC are usually used in the context of describing Turing complete instruction sets. I'm not sure, it's relevant here?

If you want to make a comparison in this flavour: Turing machines are a bit like CPUs in that they can execute arbitrary things in sequence. All the flavours of machine learning are more like GPUs: they do well with oodles of big, parallelisable matrix multiplications interspersed with some simple non-linear transformations.

sgt101 · on Sept 3, 2023

These words, I do not think that they mean what you hope they will mean.

moffkalast · on Sept 4, 2023

Well we're talking about a "native" implementation of both for comparison, right? Neural nets as they're being used are just being emulated by our turing-machine-like processors which makes them run like ass in practice. Something like an analog circuit that adds up voltages would be a native NN implementation and would surely vastly outperform any turing machine in wide highly parallel super highly memory driven tasks that are well suited for it, and either emulating the other is slow and bloated.

quickthrower2 · on Sept 3, 2023

Depends which valley it chooses to die on

sgt101 · on Sept 3, 2023

Yup - and that's the big question for ML... "which valley will this algorithm die on" so far I haven't seen an answer.

SomeoneFromCA · on Sept 3, 2023

Transformers as voltage amplifiers.

u320 · on Sept 3, 2023

Transformers as toy vehicles that can turn into robots.

quickthrower2 · on Sept 3, 2023

And robots in disguise

wizzard0 · on Sept 3, 2023

my tldr: this explains

1) why huge models are important (so the gradient is high-dimensional enough to be monotonic)

2) why attention (aka connections, aka indirections) is trainable at all;

and says nothing about why they might generalize the dataset

adamnemecek · on Sept 3, 2023

All machine learning is about finding hyperplanes.

mafribe · on Sept 3, 2023

This is wrong!

The term hyperplane already assumes that the hypothesis space that your learning algorithm searches has some kind of dimension and is some variant of an Euclidean / vector space (and its generalisations). This is not the case for many forms of ML, for example grammar induction (where the hypothesis space is Chomsky-style grammars) or inductive logic programming (hypothesis space are Prolog (or similar) programs), or, more generally, program synthesis (where programs form the hypothesis space).

adamnemecek · on Sept 3, 2023

It can also just be some sort of partitioning. I would be really surprised if there was no partitioning of some space.

mafribe · on Sept 3, 2023

Note that "some sort of partitioning" isn't a hyperplane. A partition is a set-theoretic concept. A hyperplane is (a generalisation of) a geometric concept, so has much more structure.

adamnemecek · on Sept 3, 2023

Alright how about coalgebra.

alexmolas · on Sept 3, 2023

Hyperplanes is all you need

jpfed · on Sept 3, 2023

Someday I'm going to write a paper that achieves SOTA results with a nigh-incomprehensible mishmash of diverse techniques and title it "All You Need Considered Harmful".

quickthrower2 · on Sept 3, 2023

Hyperplanation is all you need

nologic01 · on Sept 3, 2023

The large dimensionality seems to be what creates the need for heuristic designs rather than a generic approach

adamnemecek · on Sept 3, 2023

Hyperplanes are the heuristics.

revskill · on Sept 3, 2023

What is hyperplane ?

adamnemecek · on Sept 3, 2023

For 2d, a line, for 3d a plane, for nd a hyperplane.

https://en.wikipedia.org/wiki/Hyperplane

loehnsberg · on Sept 3, 2023

A hyperplane is a multi-dimensional linear function that splits space into two distinct regions. In the context of a classifier, it splits feature space into disjunct sub-spaces (one for each class). SVMs effectively place a hyperplane with maximum margin, thereby separating classes in an optimal way.

im3w1l · on Sept 3, 2023

Worth keeping in mind that though it may be optimal according to some mathematical criterion, that is no guarantee that it's the best for the purposes you have in mind.

shakow · on Sept 3, 2023

A subspace of dimension n-1 of a n-dimensional vector space. It is an extension of the well-known concept of a 2d-plane in a 3d-space to nd-spaces.

eru · on Sept 3, 2023

You could also describe a hyperplane as the set of solutions of a system of linear equations.

shakow · on Sept 3, 2023

Or as the subspace of all the vectors are orthogonal to a given single vector, or as the subspace generated by any orthogonal basis with one base vector removed, or as the kernel of a linear form, ... – but a more visual explanation is probably better as a first foray in the question.

eru · on Sept 4, 2023

I agree that a more visual explanation is better in general.

I was trying to hint how the visual explanation relates to the long vectors of numbers we actually feed our machine learning contraptions with. Not sure I was successful.

revskill · on Sept 4, 2023

This is really good !

blt · on Sept 3, 2023

a partition of Euclidean space into two convex sets ;)

noduerme · on Sept 3, 2023

it's a word (a made up word)

MattPalmer1086 · on Sept 3, 2023

All words are made up!

noduerme · on Sept 3, 2023

Yeah, but only a few are made up to seem like terms of art designed to obfuscate their actual meaning; and usually prepending "hyper-" to something is a signal that a more clear description of the thing doesn't yet exist.

Downvote away, fellas.

adamnemecek · on Sept 3, 2023

https://en.wikipedia.org/wiki/Hyperplane

tudorw · on Sept 3, 2023

and multi-dimensional topological manifolds, maybe :)

porridgeraisin · on Sept 3, 2023

[flagged]

noduerme · on Sept 3, 2023

You missed the best part where they think they're coming for our jobs.

pedrosorio · on Sept 3, 2023

I regret to inform you, I don’t think it’s the same set of people. Writing a cute “Transformers are SVMs” paper and “building chatGPT” are not the same skillset.

r-zip · on Sept 4, 2023

Did you read the abstract?

hexo · on Sept 3, 2023

Is this an April joke?