I think another issue is having to re-invent cues for changing who is talking in...

I think another issue is having to re-invent cues for changing who is talking in the conversation. When you're talking to someone in person, you can pick up a lot of information as to whether they want to continue talking, or whether it's your turn. On video conferences, those things are missing. You might detect a lull in the conversation and start talking, but at the same time, someone else does the same thing and starts talking. You have 100ms+ of latency, so you can get into what you're saying before realizing that someone else took their turn (or you took theirs). The collision mitigation algorithm takes place, and it's just as bad for your mental fatigue as it is for your WiFi's latency (which uses the same algorithm -- listen before talk, and if you collide, wait a random amount of time and repeat). These things just don't happen when a few people are in the same room together.

The result is that we try to use the video channel to pre-allocate timeslots for speaking. You make your eyes bigger, you raise a finger... but sometimes people aren't looking at the video, or don't understand what your newly-invented cue actually means. This is all very tiring.

The overall quality of the call is much lower than real life, as well. People do not own good microphones or cameras, so you can't actually hear them or see them very well. The noise gate intervenes and just cuts off audio from time to time. It is maddening how bad it all is.

Many years ago when I worked for Bank of America, we had these multiple-100k Cisco videoconferencing setups. They worked really well. My friends and I were at work pretty late and there was a tornado warning, so we couldn't go home (we all biked). We went to these conference rooms, set up a link between the two rooms right next to each other, and had a totally normal conference. (Things were set up so that the entire wall of the room was a video screen, and it had an array of cameras, microphones, and speakers. Everything was tuned perfectly so it looked like the people in the other room were just sitting across the table from you. There was no latency, everyone was their normal size, and the audio and video quality were perfect. Obviously with two rooms across the hall from each other, there shouldn't be any latency... but at least the system didn't add its own. It makes a big difference.)

Finally, I think another issue is that people just aren't used to getting work done with video conferences. I worked as a remote team at Google, so pretty much 100% of my meetings were video conferences. 1:1s with my manager and everything else. The system suffered from the same quality/latency issues as anything else (though we did typically have good cameras and microphone arrays in every room), but through practice, people got good at getting stuff done despite the limitations. I never really felt fatigued the way I do on calls with random people at home. (I guess my tips are: have an agenda in advance, and use your screen share to show progress through the agenda. Call on people in remote locations: "Anyone from the New York room have anything to add?".)