I don't think the issue has anything to do with cognition and more to do with so...

I don't think the issue has anything to do with cognition and more to do with something that we do so subconsciously we don't always notice it as we do it: error correction and context setting. A big part of language are our error correction channels. In text it's a lot less obvious because we twist the language to clear things up, but speech is full of a lot of "I'm sorry, what?" and "uh, you know" and hand gestures and furrowed brows and a million other side channels to get someone to repeat something or elucidate it or set a deeper context.

But that happens in text too: we group things into paragraphs and add a lot of punctuation and as we read we sometimes skim a bit, return as needed, reread what we missed the first time. (Or in texts/IMs our cultures are in the process of building whole new sub-dialects of error correction codes like emoji and "k?".)

A lot of people would think a machine is broken if it hemmed and hawed as much as people do in a normal conversation; if it needed full paragraphs of text to context set and/or explain itself.

The biggest thing lacking in voice recognition right now is not the lack in word understanding or any of the other NLP areas of research: it's in a lot of the little nuance of conversation flow. For now, most of the systems aren't very good at interruptions, for instance. From the easy like "let me respond to your question as soon as I understand what you are asking to save us both time" to the harder but perhaps more important things like "No [that's not what I mean]" and "Wait [let me add something or let me change my mind]" and "Uh [you really just don't get it]" and presumably really hard ones like clears throat [listen carefully this time].

The point should not be that we hit 100% accuracy: real people in real conversations don't have 100% accuracy. The issue is how do you correct from the failures in real time and keep that "conversational" without feeling strained or overly verbose (such as the currently common "I heard x, is that correct?" versus "x?" and head nod or very quick "yup").

We don't consciously think about the error correction systems in play in a conversation so that makes them hard to judge/replicate and it's easy to imagine there's an uncanny valley waiting there for us to get from no "natural error correction" ability across to supporting error correction in a way that it works with our natural background mechanisms.

At least in my mind, that's probably the next big area to study in language recognition is deeper looks into things like error correction sub-channels and conversational timing (esp. interruption) and elocution ("uh", "um", "you know", "that thing", "right, the blue one"). I'd even argue that what we have today is probably already getting to "good enough" for the long run if it didn't require us to feel like we have to be "so exact" because you only get one sentence at a time and you don't have good error correcting channels with what we have today.