Segmenting at lower resolution and then using them at higher resolution using resolution multipliers don't work as other items bleed in. FastSAM paper has some interesting ideas on doing this with CNNs which I guess SAM2 have superseded. However, the complication in the pipeline is not worth the result as I find vision LLMs are able to do almost the same task within the same OCR prompt.
I prefer to do all of this in 1 step with an LLM with a good prompt and few shots.
With so many passes with images, the costs/time will be high with ViT being slower.