Grounding language in other sense modalities (multimodal learning) is a thing. We can even generate captions from images and generate images from captions, albeit, not perfectly.
Another grounding source is related to ontologies. We are already building huge maps of facts about the world like "object1 relation object2".
Another source of "common sense" is word embeddings. In fact it is possible to embed all kinds of things, like, shopping bags, music preferences, networks topologies - as long as we can observe objects in context.
Then there is unsupervised learning from video and images. For example, starting from pictures, cut them in a 3x3 grid, shuffle the tiles and then task the network to recover the original layout. This automatically extract semantic information from images unsupervised. A variant is to take slides from video, shuffle them around, then task the network to recover the original temporal order. Using this process we can cheaply learn about the world and provide this knowledge as "common sense" for NLP tasks.
I am not worried about grounding language. We will get there soon enough, but we're just impatient. Life evolved over billions of years, AI is just emerging now. Imagine how much computing power is in the collected brains of humanity, and how much computer time we give AI to learn. AI is starved of raw computing power and experience yet. Human brains would have done much worse with the same amount of computing.
image caption is a separate, albeit related problem to what I'm talking about.
Ontologies are much the same; they are interesting for the problems they solve, but it's not clear how well those problems relate to the more general problem of language.
word embeddings are also quite interesting, but again, are typically based entirely off whatever emergent semantics can be gleaned from the structure of documents. It's not clear to me that this is anymore than superficial understanding. Not that they aren't very cool and powerful. Distributional semantics is a powerful tool for measuring certain characteristics of language. I'm not sure how much more useful it will be in the future.
Uunsupervised learning from video and images is a strictly different problem that seems to me to be much lower down the hierarchy of AI Hardness. More like a fundamental task that is solvable in its own universe, without requiring complete integration of multiple other universes. Whether the information extracted by these existing technologies is actually usefully semantic in nature remains to be seen.
I agree that we'll get there, somewhat inevitably; not trying to argue for any Searlian dualistic separation between what Machines can do and what Biology can do. I'm personally interested in the 'how'. Emergent Strong AI is the most boring scenario I can imagine; I want to understand the mechanisms at play. It may just be that we need to tie together everything you've listed and more, throw enough data at it, and wait for something approximating intelligence to grow out of it. We can also take the more top-down route, and treat this as a problem in developmental psychology. Are there better ways to learn than just throwing trillions of examples at something until it hits that eureka moment?
I think the key ingredient is to be reinforcement learning, and more importantly, agents being embedded in the external world.
Regarding the "internal world", we already see the development of AI mechanisms for attention, short term memory (references to concepts recently used), episodic memory (autobiographic) and semantic memory (ontologies).
Another grounding source is related to ontologies. We are already building huge maps of facts about the world like "object1 relation object2".
Another source of "common sense" is word embeddings. In fact it is possible to embed all kinds of things, like, shopping bags, music preferences, networks topologies - as long as we can observe objects in context.
Then there is unsupervised learning from video and images. For example, starting from pictures, cut them in a 3x3 grid, shuffle the tiles and then task the network to recover the original layout. This automatically extract semantic information from images unsupervised. A variant is to take slides from video, shuffle them around, then task the network to recover the original temporal order. Using this process we can cheaply learn about the world and provide this knowledge as "common sense" for NLP tasks.
I am not worried about grounding language. We will get there soon enough, but we're just impatient. Life evolved over billions of years, AI is just emerging now. Imagine how much computing power is in the collected brains of humanity, and how much computer time we give AI to learn. AI is starved of raw computing power and experience yet. Human brains would have done much worse with the same amount of computing.