The classifier was likely a convolutional network, so the assumption of the image being a 2D grid was baked into the architecture itself - it didn't have to be represented via the shape of the input for the network to use it.
I don't think so - convolutional neural networks also operate over 1D flat vectors - the spatial relationship of pixels is only learned from the training data.
This is not true. CNNs perform 2D convolution, conceptually "sliding" a 2 dimensional kernel with learnable weights over the input image across two dimensions.
Perhaps it wasn't a convolutional network after all, but a simple fully-connected feed-forward network taking all pixels as input? Could be viable for a toy example (MNIST).