> But with all that said, my "acid test" for any multimodal LLM here is to "simply" find the X,Y coordinates of "1", "2", "+", and "=" on the screenshot of a calculator app.
hugs if you find such a thing, could you please make a post about it? I am looking for the same thing and try the same test.
hugs if you find such a thing, could you please make a post about it? I am looking for the same thing and try the same test.