That's what I was thinking too, though with an image you could do a convolution layer and, idk, maybe that makes it imagine visually. Or actually, the reasoning is backwards: the convolution layer is what (potentially) makes that part behave like an image. It's all just raw numbers at the IO layers. But the convolution could keep it from overfitting. And if you also want to give it a little binary array as a scratch pad that just goes straight to the RELUs, why not? Seems more like human reasoning. A little language, a little visual, a little binary / unknown.