Hacker News new | past | comments | ask | show | jobs | submit login

This package seems to use llama_cpp for local inference [1] so you can probably use anything supported by that [2]. However, I think it's just passing OCR output for correction - the language model doesn't actually see the original image.

That said, there are some large language models you can run locally which accept image input. Phi-3-Vision [3], LLaVA [4], MiniCPM-V [5], etc.

[1] - https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

[2] - https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#de...

[3] - https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

[4] - https://github.com/haotian-liu/LLaVA

[5] - https://github.com/OpenBMB/MiniCPM-V




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: