I wonder what it would look like in multi modal, if the reasoning part was an im...

ttul · 2025-02-07T02:52:36 1738896756

Or just embeddings that only make sense to the model. It’s really arbitrary, after all.

daxfohl · 2025-02-07T03:48:49 1738900129

That's what I was thinking too, though with an image you could do a convolution layer and, idk, maybe that makes it imagine visually. Or actually, the reasoning is backwards: the convolution layer is what (potentially) makes that part behave like an image. It's all just raw numbers at the IO layers. But the convolution could keep it from overfitting. And if you also want to give it a little binary array as a scratch pad that just goes straight to the RELUs, why not? Seems more like human reasoning. A little language, a little visual, a little binary / unknown.