I am much more interested if they fall for the same tricks.
For example, if it is easy to fool them with optical illusions, such as innocent images that look racy at the first glance:
https://medium.com/@marekkcichy/does-ai-have-a-dirty-mind-to...
CW: Even though it does not contain a single explicit picture, it might be considered NSFW (literally - as at the first glance it looks like nudity); full disclosure: I mentored the project.
I suggest you take a look at Geometric Deep Learning, but the gist here is that convolutions can be thought of as translation equivariant functions, and pooling operations as permutation invariant combinations, all operating on a graph of as many components as the number of times the operation will be output, each component is composed of the pixels that the operation will act on, so there is local information that is slowly combined through the layers, then relative positioning can be decoded when the representation is transformed into a dense 1d-vector aka flattening.
In contrast, attention mechanisms in transformers can be seen as taking into consideration dense graphs of the whole input (at least in text, I haven't really worked with vision transformers but if an attention mechanism exists then it should be similar), along with some positional encoding and a neighborhood summary.
If they indeed can be thought as stacking neighborhood summaries along with attention mechanisms, then they shouldn't fall for the same tricks since they have access to "global" information instead of disconnected components.
But take this reply with a grain of salt as I am still learning about Geometric DL. If I misunderstood something, please correct me.
Thank you for bringing up GDL. I've been following its developments, and a really great resource is this site: https://geometricdeeplearning.com/
It contains links to the paper and lectures, and the keynote by M. Bronstein is illuminating and discusses the operations on graphs that lead to equivalence to other network topologies and designs: transformer equivalence, and more.
I believe Bronstein is onto something huge here, with massive implications. Illuminating is the best adjective to describe it, as I watched this keynote:
Everything clicked into place and I was given a new language to see the world that combined everything together well beyond the way standard DL is taught:
> we do feature extraction using this function that resembles the receptive fields of the visual cortex and then we project the dense feature representation onto multiple other vectors and pass that through stacked non-linearities, and oh by the way we have myriad of different, seemingly disconnected, architectures that we are not sure why they work, but we call it inductive bias.
That's my main source, along with the papers that lead up to the proto-book, so pretty much Bronstein's work along with related papers found using `connectedpapers.com`. I don't have an appropriate background so I am grinding through abstract algebra, geometric algebra, will then go into geometry and whatever my supervisor suggests I should read. Sure, I would like to have other people to discuss it, but don't expect much just yet.
> I suggest you take a look at Geometric Deep Learning, but the gist here is that convolutions can be thought of as translation equivariant functions, and pooling operations as permutation invariant combinations, all operating on a graph of as many components as the number of times the operation will be output, each component is composed of the pixels that the operation will act on, so there is local information that is slowly combined through the layers, then relative positioning can be decoded when the representation is transformed into a dense 1d-vector aka flattening.
Hey just trying to check my understanding, this is what Taco-tron does right? It outputs an image that can be Fourier transformed into a soundwave, which is the flatting to a dense 1d vector? And the construction of that image works because the network was able to learn from examples of existing sounds transformed into images, because the transformation into an image encodes some invariance that biases the learning network to generalize better or something?
I didn't do really great in math in college, but always found deep learning interesting. Not sure if anything I said above makes any sense.
The hard thing to emulate is that people quickly become aware that they are looking at an illusion. Even though you can't turn your perception off, Escher's infinite staircase doesn't actually trick you into thinking a set of stairs can go in a closed loop.
I think because the human world is made for humans there’s a lot of value in an AI with similar failure modes to humans. Right now AI can do a good job of learning to classify images, but fails in ways that are entirely foreign to us
Those pictures are definitely NSFW when viewed at low res/from far away, which is how coworkers typically see your monitor contents. An argument that starts with “Well, technically” is unlikely to carry much weight in a discussion with HR (and probably rightfully so).
I think the scientific point here is that visual processing is not a one-shot process. Tasked with object detection, some scenes demand more careful processing and more computation.
Almost all neural network architectures process a given input size in the same amount of time, and some applications and datasets would benefit from an "anytime" approach, where the output is gradually refined given more time.
I understand the point you are making, but it's kind of irrelevant. The task is to produce an answer for the image at the given resolution. It is an accident and coincidence that the neural network produces an answer that is arguably correct for a blurrier version of the image.
Probably because "Well, technically" is a really unfair way to characterize that argument.
It's not porn. It's not simulated porn. It's a hallway, and if you're not setting it as your desktop background to trick people on purpose then you're not doing anything wrong.
It's always nice to see big labs working more towards building an understanding of things instead of just chasing SOTA. On the other hand, I'm not sure there is a lot of actionable findings in here. I guess that's the trade off with these things though....
Offtopic sort of, but does anyone know if folks are working on combining vision and natural language in one model? I think that could wield some interesting results.
What would be really cool is neural networks with routing. Like circuit switching or packet switching. No idea how you would train such a beast though.
Like imagine the vision part making a phonecall to the natural language part to ask it for help with something.
It may make sense, but it also makes no sense. CNNs already have full view of the entire input image. That's how discriminators are able to discriminate in GANs.
We added attention and observed no benefits at all in our GAN experiments.
They use the key-value lookup/routing mechanism from Transformers to predict pixel-wise labels in bird view (lane, car, obstacle, intersection etc.). The motivation here is that some of the predictions may temporarily be occluded, so for predicting these occluded areas it may be particularly helpful to attend to remote regions in the input images which requires long-range dependencies that highly depend on the input itself (e.g. on whether there is an occlusion), which is exactly where the key-value mechanism excels. Not sure they even process past camera frames at this point. They only mention that later in the pipline they have an LSTM-like NN incorporating past camera frames (Schmidhuber will be proud!!).
Edit: A random observation which just occurred to me is that their predictions seem surprisingly temporally unstable. Observe, for example, the lane layout wildly changing while it drives makes a left-turn at the intersection (https://youtu.be/j0z4FweCy4M?t=2608). You can use the comma and period keys to step through the video frame-by-frame.
A useful HN feature would be small space to put in a summary, like the abstract:
Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks?
Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating ViTs successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.
Personally I like it the way that it is. I think showing an accompanying text for links would allow too much for anyone posting a link to “force” everyone to read their comment on it. Leaving it so that comments must be posted separately in order to be visible in the thread makes it so that useful accompanying comments can float to the top, whereas a useless comment from the submitter sinks to the bottom while still allowing the submitted link to be voted on individually.
I guess this is really in response to all the other responses as well, but I thought the idea would be:
to help people decide if they want to click the link.
So the title may not be sufficiently informative to let people know if they can understand the article, are interested in it, if it is at the right technical level and so on.
I think you're right that it will be abused in many instances and might not be worth it.
This would be handy, but at the same time, I think I like that not doing so encourages people to click the link and read more of the article than they might otherwise.
For example, if it is easy to fool them with optical illusions, such as innocent images that look racy at the first glance: https://medium.com/@marekkcichy/does-ai-have-a-dirty-mind-to... CW: Even though it does not contain a single explicit picture, it might be considered NSFW (literally - as at the first glance it looks like nudity); full disclosure: I mentored the project.