Hacker News new | past | comments | ask | show | jobs | submit login

It kinda makes sense, doesn't it? What are the largest convolutions you've heard of -- 11 x 11 pixels? Not much more than that, surely? So how much can one part of the image influence another part 1000 pixels away? But I am not an expert in any of this, so an expert's opinion would be welcome.



Yes it makes sense a bit. Many popular convents operate on 3x3 kernels. But the number of channel increases per layer. This, coupled with the fact that the receptive field increases per layer and allows convnets to essentially see the whole image relatively early in model's depth (esp. coupled with pooling operations which increase the receptive field rapidly), makes this intuition questionable. Transformers on the other hand, operate on attention which allows them to weight each patch dynamically, but it's clear to me that this allows them to attend to all parts of the image in a way different from convnets.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: