I think the problem isn't specific to SO. Text-based communication with strangers lacks two crucial emotional filters. Before speaking, a person anticipates the listener's reaction and adjusts what they say accordingly. After speaking, they pay attention to the listener's reaction to update their understanding for the future.
Without seeing faces, people just don't do this very well.
Without seeing faces, people just don't do this very well.