Vision transformers are good enough that you can use them alone even on cursive handwriting. I've had amazing results with Microsoft's models and have my own little piece of wrapper software I use to transcribe blog posts I write in my notebook.
A couple of other people in the thread are using it too apparently. They're the Microsoft TROCR models. You do need a moderate amount of software to deskew, process, and segment the image before handing it to the model but after that it's typically extremely accurate in my experience.
Setting up my software online and monetizing it is next in the queue after my current side project. Although I haven't checked the model licenses.
Try again with 4o through the ChatGPT interface. Since I am getting very good results. I don't think gpt 4 was multimodal like gpt4o so must have used some other methodology?