Although LLaVA specifically it might not be great for OCR; IIRC it scales all input images to 336 x 336 - meaning it'll only spot details that are visible at that scale.
I've had very poor results using LLaVa for OCR. It's slow and usually can't transcribe more than a few words. I think this is because it's just using CLIP to encode the image into a singular embedding vector for the LLM.
The latest architecture is supposed to improve this but there are better architectures if all you want is OCR.
Although LLaVA specifically it might not be great for OCR; IIRC it scales all input images to 336 x 336 - meaning it'll only spot details that are visible at that scale.
You can also search on HuggingFace for the tag "image-text-to-text" https://huggingface.co/models?pipeline_tag=image-text-to-tex... and find a variety of other models.