Hacker News new | past | comments | ask | show | jobs | submit login

I wonder what it would look like in multi modal, if the reasoning part was an image or video or 3D scene instead of text.



Or just embeddings that only make sense to the model. It’s really arbitrary, after all.


That's what I was thinking too, though with an image you could do a convolution layer and, idk, maybe that makes it imagine visually. Or actually, the reasoning is backwards: the convolution layer is what (potentially) makes that part behave like an image. It's all just raw numbers at the IO layers. But the convolution could keep it from overfitting. And if you also want to give it a little binary array as a scratch pad that just goes straight to the RELUs, why not? Seems more like human reasoning. A little language, a little visual, a little binary / unknown.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: