Show HN: BetterOCR combines and corrects multiple OCR engines with an LLM

snac · on Oct 29, 2023

Wow, this is just up my alley. I started a side project recently using Tesseract to read book spines for inventory purposes and hooked it up to ChatGPT to clean up the text, having it "fill in the blanks" so to speak. I'll definitely give this a go, having using two OCR engines I should get better results.

Any plans to add other OCR engines?

junhoyeo · on Oct 29, 2023

Yup! But I'm still exploring options. (any recommendations would be welcomed!) Here are some candidates I'm considering:

- https://github.com/mindee/doctr

- https://github.com/open-mmlab/mmocr

- https://github.com/PaddlePaddle/PaddleOCR (honestly I don't know Mandarin so I'm a bit stuck)

- https://github.com/clovaai/donut -- While it's primarily an "OCR-free document understanding transformer," I think it's worth experimenting with. Think I can sort this out by letting the LLM reason through it multiple times (although this will impact performance)

- yesterday got a suggestion to consider https://github.com/kakaobrain/pororo -- don't think development is still active but the results are pretty great on Korean text

om154 · on Oct 28, 2023

I’ve been looking for an OCR engine and hoped that using LLMs would improve their output.

Looks great! I’ll give it a go.

junhoyeo · on Oct 29, 2023

Super -- I'm thrilled you enjoyed it!

meetingthrower · on Oct 28, 2023

Awesome - I'm a dabbler, but any thoughts on best engines for PDF tables? I've got tons of PDFs with similar tables embedded deep in them, but all formatted slightly differently. Seems like it should be easy....but nope!

janderson215 · on Oct 28, 2023

Are you able to highlight the text on the PDF? If so, I highly recommend PDF2TXT to extract text from PDFs. Would require some parsing work on your part to convert it back to a table, but zero chance of error from inference since it’s using text extraction.

If you can’t highlight the text, it won’t work.

adr1an · on Oct 29, 2023

You can make any PDFs 'highlightable' with GitHub.com/ocrmypdf

lhuser123 · on Oct 29, 2023

It’s not perfect, unfortunately.

junhoyeo · on Oct 29, 2023

Thanks!

PDF -> Markdown looks like a pretty great use case

Just added box detection support -- maybe I'll start from here https://github.com/junhoyeo/BetterOCR#-box-detection

bomewish · on Oct 29, 2023

I use popper pdftotext -layout flag, which preserves the shape of the tables. Gpt can then fix them up into whatever format.

2Gkashmiri · on Oct 29, 2023

Tabula.

It does what you are looking for

qwerty456127 · on Oct 29, 2023

Cool. Is there also a library which can be combined with this to automatically translate languages except some I understand (which I would specify as a list of ISO codes)?

tamimio · on Oct 28, 2023

Interesting, will see how it performs

junhoyeo · on Oct 29, 2023

Thanks!

jamesnorden · on Oct 28, 2023

Is the OpenAI key paid-only?

raybb · on Oct 28, 2023

Similarly, is there any half decent general LLM that can run on consumer hardware?

junhoyeo · on Oct 29, 2023

Thinking to add LLaMA or other OSS LLM models soon

mkl · on Oct 29, 2023

Please consider LLaMA.cpp (https://github.com/ggerganov/llama.cpp), which supports a lot of models and doesn't need an expensive GPU.

harwoodjp · on Oct 28, 2023

Cool!

junhoyeo · on Oct 29, 2023

My pleasure