A couple of other people in the thread are using it too apparently. They're the Microsoft TROCR models. You do need a moderate amount of software to deskew, process, and segment the image before handing it to the model but after that it's typically extremely accurate in my experience.
Setting up my software online and monetizing it is next in the queue after my current side project. Although I haven't checked the model licenses.
Try again with 4o through the ChatGPT interface. Since I am getting very good results. I don't think gpt 4 was multimodal like gpt4o so must have used some other methodology?