I've had very poor results using LLaVa for OCR. It's slow and usually can't transcribe more than a few words. I think this is because it's just using CLIP to encode the image into a singular embedding vector for the LLM.
The latest architecture is supposed to improve this but there are better architectures if all you want is OCR.
The latest architecture is supposed to improve this but there are better architectures if all you want is OCR.