Hacker News new | past | comments | ask | show | jobs | submit login

Oh hey! (This might be the first time I've been paged on HN)

I'm extremely excited by real, non-hype reasons to use LLMs, and I've also been frustrated that OCR isn't 100% accurate... I currently use Tesseract OCR in the context of UI automation of mobile apps. UI automation is already notorious for flakiness, I don't need to add to the problem... BUT... sometimes you only have access to the visible screen and literally nothing else... or you're in a regulated environment like payments, automative, or medical device testing where you're required to test the user interface exactly the way a user would, and you still want to automate that -- in those cases, all options are on the table, especially if an LLM-backed-OCR approach works better.

But with all that said, my "acid test" for any multimodal LLM here is to "simply" find the X,Y coordinates of "1", "2", "+", and "=" on the screenshot of a calculator app. So far in my testing, with no or minimal extra prompt engineering... Chat-GPT4o and Llava 1.5 fail this test miserably. But based on the pace of AI announcements these days, I look forward to this being a solved problem in a few months? Or... is the LLM-Aided OCR Project the magic I've been looking for? Tools like plain Tesseract and EasyOCR retain the X,Y locations in the scanned document image of the source text. I can't tell if that meta-information is lost when run through the LLM here.




> But with all that said, my "acid test" for any multimodal LLM here is to "simply" find the X,Y coordinates of "1", "2", "+", and "=" on the screenshot of a calculator app.

hugs if you find such a thing, could you please make a post about it? I am looking for the same thing and try the same test.


yes!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: