Hacker News new | past | comments | ask | show | jobs | submit login

I've had very poor results using LLaVa for OCR. It's slow and usually can't transcribe more than a few words. I think this is because it's just using CLIP to encode the image into a singular embedding vector for the LLM.

The latest architecture is supposed to improve this but there are better architectures if all you want is OCR.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: