Hacker News new | past | comments | ask | show | jobs | submit login
Vision Transformers Need Registers (openreview.net)
171 points by cscurmudgeon 11 months ago | hide | past | favorite | 23 comments



According to the paper, the "registers" are additional learnable tokens that are appended to the input sequence of a Vision Transformer model during training.

They are added after the patch embedding layer, with a learnable value, similar to the [CLS] token and then at the end of the Vision Transformer, the register tokens are discarded, and only the [CLS] token and patch tokens are used as image representations.

The register tokens provide a place for the model to store, process and retrieve global information during the forward pass, without repurposing patch tokens for this role.

Adding register tokens removes the artifacts and high-norm "outlier" tokens that otherwise appear in the feature maps of trained Vision Transformer models.

Using register tokens leads to smoother feature maps, improved performance on dense prediction tasks, and enables better unsupervised object discovery compared to the same models trained without the additional register tokens.

This is a neat result. For just a 2% increase in inference cost, you can significantly improve ViT model performance. Close to a free lunch.


This all token business is very shady, and the whole probability theory. You add token here and there and magic happens. Discreet math people can not take this lightly. Stochastic regexes is one thing, but this on a completely different level of mathematical debauchery.

Absolutely amazing this works.


Vision transformers are essentially just JPEG but with learned features rather than the Fourier transform.


I think it's important to point out for people that might be interested in this comment that a few things are wrong.

1. Standard JPEG compression uses the Discrete Cosine Transform, not the Fourier Transform.

2. It is easy to be dismissive of any technology by saying that it is 'just' X with Y, Z, etc on top

3. Vision transformers allow for much longer range context - the magic comes in part from the ability to relate between patches, as well as the learned features, which JPEG does not do.


The discrete cosine transform is the real part of a Fourier transform.


Indeed. Kernels mashing features. Knowing jpeg helped the understanding of embedding a lot. It’s why I tell friends - talking to GPT is like talking to .ZIP files…


Rockets are essentially just fire that burns real fast.


Interesting! Can you elaborate?


The JPEG algorithm is:

1. Divide up the image into 8x8 patches

2. Take the DCT (a variant of the Fourier transform) of each patch to extract key features

3. Quantize the outputs

4. Use arithmetic encoding to compress

The ViT algorithm is:

1. Divide up the image into 16x16 patches

2. Use query/key/value attention matrices to extract key features

3. Minimize cross-entropy loss between predicted and actual next tokens. (This is equivalent to trying to minimize encoding length.)

ViT don't have quantization baked into the algorithm, but NNs are being moved towards quantization in general. Another user correctly pointed out that vision transformers are not necessarily autoregressive (i.e. they may use future patches to calculate values for previous patches), while arithmetic encoding usually is (so JPEG is), so the algorithms have a few differences but nothing major.

-----

I think it's pretty interesting how closely related generation and compression are. ClosedAI's Sora[^1] model uses a denoising vision transformer for their state-of-the-art video generator, while JPEG has been leading image compression for the past several decades.

[^1]: https://openai.com/index/sora/?video=big-sur


I find it wild that the training process can do such things as forcing it repurpose background areas to begin with. The authors just observed abd optimized what the model was already doing by itself.


I agree, the most interesting thing about the paper is the default behavior of the network as it tries to compress the data.


The modern Alchemy.


But then… alchemy never produced gold right? So how we expect this thing to ever produce gold value. I’m sure the alchemist OpenAI of 12th century must’ve also had very high valuation.


There was an attempt to add several CLS tokens to BERT, with less spectacular results: https://arxiv.org/pdf/2210.05043


are there lessons here for regular (non vision) transformers? sounds close to attention sinks/pause tokens?


For these tokens you first need to unembed the result of the final layer, the re-embed the resulting token on the next pass. Has anyone investigated passing the raw output of one pass to the input of the next?


So is that what all the visual cues are in real life, things like fashion accessories, uniforms etc.?


Interesting. One other potential benefit is an easier quantization of the activations.


Related? "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models" https://arxiv.org/abs/2404.15758

> Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.

> In this work, we demonstrate that transformers trained on the next-token prediction objective can achieve improved performance on certain tasks when given filler tokens, achieving perfect accuracy whereas the no-filler, immediate-answer setting achieves only low accuracy.

--

I wonder if we could get benefits from adding special computation/register tokens to text LLMs?

More discussion:

- https://news.ycombinator.com/item?id=40182695

- https://www.reddit.com/r/LocalLLaMA/comments/1cf2w5a/transfo...


We tested dozens (maybe >100) of papers/ideas over the last years in vision and multimodal perception and this is one of the rare cases, where everything worked well! Neat idea and paper!

This model, for example, uses 4 register tokens, and combines them with Matryoshka-style losses for training, resulting in super-compact 64-dimensional embeddings, in case anyone is looking for CLIP alternatives: https://huggingface.co/unum-cloud/uform3-image-text-english-...


I was at ICLR and this was one of the best of this year, it was also evident during the poster session. Congrats to the authors!!


I’ve been using DinoV2 for some months now. I’ve tried the models with 4 register tokens along with CLS + patch tokens. I’ve several embeddings (tokens) from previous model (no registers) which are part of my solution, so I didn’t adopt the newer “register” models because the CLS tokens are not aligned between 0 registers and 4 registers models. It would be nice if the CLS and patch tokens were somehow aligned between those models.


"Attention sinks" for vision models?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: