You don't need a GPT-4-sized model to count brackets. You just need to make sure that your training data includes enough cases like that for NN to learn it. My point is that GPT-4 can do much more complicated things than that, so there's nothing specific about LMs that preclude them from doing this kind of stuff right.
Consider the fact that GPT-4 can generate valid XML (meaning balanced tags, quotes etc) in base64-encoded form. Without CoT, just direct output.