Wow, this is just up my alley. I started a side project recently using Tesseract to read book spines for inventory purposes and hooked it up to ChatGPT to clean up the text, having it "fill in the blanks" so to speak. I'll definitely give this a go, having using two OCR engines I should get better results.
- https://github.com/clovaai/donut -- While it's primarily an "OCR-free document understanding transformer," I think it's worth experimenting with. Think I can sort this out by letting the LLM reason through it multiple times (although this will impact performance)
- yesterday got a suggestion to consider https://github.com/kakaobrain/pororo -- don't think development is still active but the results are pretty great on Korean text
Awesome - I'm a dabbler, but any thoughts on best engines for PDF tables? I've got tons of PDFs with similar tables embedded deep in them, but all formatted slightly differently. Seems like it should be easy....but nope!
Are you able to highlight the text on the PDF? If so, I highly recommend PDF2TXT to extract text from PDFs. Would require some parsing work on your part to convert it back to a table, but zero chance of error from inference since it’s using text extraction.
Cool. Is there also a library which can be combined with this to automatically translate languages except some I understand (which I would specify as a list of ISO codes)?
Any plans to add other OCR engines?