a multimodal llm is a general purpose device to churn sensor inputs into a sequence of close to optimal decisions. the 'language' part is there to reduce the friction of the interface with humans, it's not an inherent limitation of the llm. not too farfetched to imagine a scenario where you point to a guy in a crowd and tell a drone to go get him, and the drone figures out a close to optimal sequence of decisions to make it so.