Hacker News new | past | comments | ask | show | jobs | submit login

LLaVA is one LLM that takes both text and images as inputs - https://llava-vl.github.io/

Although LLaVA specifically it might not be great for OCR; IIRC it scales all input images to 336 x 336 - meaning it'll only spot details that are visible at that scale.

You can also search on HuggingFace for the tag "image-text-to-text" https://huggingface.co/models?pipeline_tag=image-text-to-tex... and find a variety of other models.




I've had very poor results using LLaVa for OCR. It's slow and usually can't transcribe more than a few words. I think this is because it's just using CLIP to encode the image into a singular embedding vector for the LLM.

The latest architecture is supposed to improve this but there are better architectures if all you want is OCR.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: