LLaVA is one LLM that takes both text and images as inputs - https://llava-vl.gi...

katzinsky · 2024-08-09T18:21:48 1723227708

I've had very poor results using LLaVa for OCR. It's slow and usually can't transcribe more than a few words. I think this is because it's just using CLIP to encode the image into a singular embedding vector for the LLM.

The latest architecture is supposed to improve this but there are better architectures if all you want is OCR.