You know, I’ve really looked hard at what’s out there and haven’t been able to f...

kergonath · 2024-08-09T18:11:37 1723227097

> You know, I’ve really looked hard at what’s out there and haven’t been able to find anything else that’s totally free/open, that runs well on CPU, and which has better quality output than Tesseract. I found a couple Chinese projects but had trouble getting them to work and the documentation wasn’t great. If you have any leads on others to try I’d love to hear about them.

I did more or less the same, trying to solve the same problem. I ended up biting the bullet and using Amazon Textract. The OCR is much better than Tesseract, and the layout tool is quite reliable to get linear text out of 2-columns documents (which is critical for my use case).

I would be very happy to find something as reliable that would work on a workstation without relying on anyone’s cloud.

constantinum · 2024-08-10T16:54:24 1723308864

Was this by any chance Paddle OCR https://github.com/PaddlePaddle/PaddleOCR

fred123 · 2024-08-09T17:11:34 1723223494

macOS Live Text is incredible. Mac only though

eigenvalue · 2024-08-09T18:03:02 1723226582

Yes, I imagine it's using the same OCR model as the iPhone, which is really incredibly good. In fact, it's so good that I made a little app for fun just to be able to use it for OCRing whole PDF books:

https://apps.apple.com/us/app/super-pdf-ocr/id6479674248

kergonath · 2024-08-09T18:16:41 1723227401

Interesting! I’ll give it a try, I have a couple of large books to OCR (to be honest, the name in all caps with underscores is not really encouraging).

From your experience, how does the OCR engine work with multiple-columns documents?

eigenvalue · 2024-08-09T18:25:47 1723227947

The iOS app would likely not handle two-column text very well. I really made the iOS app on a lark for personal use, the whole thing took like 2 hours, and I'd never even made a Swift or iOS app before. It actually took longer to submit it to the App Store than it did to create it from scratch, because all the hard stuff in the app uses built-in iOS APIs for file loading, PDF reading, screenshot extraction, OCR, NLP for sentence splitting, and sharing the output.

I think the project I submitted here would do that better, particularly if you revised the first prompt to include an instruction about handling two column text (like "Attempt to determine if the extracted text actually came from two columns of original text; if so, reformat accordingly.")

The beauty of this kind of prompt engineering code is that you can literally change how the program works just by editing the text in the prompt templates!

kergonath · 2024-08-09T18:51:48 1723229508

Thanks, I’ll try to play with this. Thanks also for keeping us updated, your work is very interesting!

wahnfrieden · 2024-08-09T20:58:01 1723237081

Sadly no bounding rects

fred123 · 2024-08-09T21:43:51 1723239831

You can get them through the Vision API (Swift/Objective-C/AppleScript)

_boffin_ · 2024-08-10T19:17:02 1723317422

You’re forgetting about Python and TypeScript/JavaScript. PyObjC and whatever it is for TypeScript.

wahnfrieden · 2024-08-09T22:21:14 1723242074

Yes but it's relatively shit

The Vision API can't even read vertical Japanese text

fred123 · 2024-08-10T04:54:36 1723265676

Fair enough. There are some new OCR APIs in the next macOS release. I wonder if the model has been improved.

wahnfrieden · 2024-08-10T15:53:13 1723305193

They're just a new Swift-only interface to the same underlying behaviors, no apparent improvement. I was hoping for more given the visionOS launch but alas

What I'm trying now is combining ML Kit v2 with Live Text - Apple's for the accurate paragraphs of text, and then custom indexing that against the ML Kit v2 output to add bounding rects and guessing corrections for missing/misidentified parts from ML Kit (using it only for bounding rects and expecting it will make mistakes on the text recognition)

I also investigated private APIs for extracting rects from Live Text. It looks possible, the APIs are there (it has methods or properties which give bounding rects as is obviously required for Live Text functionality), but I can't wrap my head around accessing them yet.

fred123 · 2024-08-10T20:21:22 1723321282

I feel like text detection is much better covered by the various ML models discussed elsewhere in the comments. Maybe you can combine those with Live Text. I found Tesseract pretty ok for text detection as well but I don’t know if any of the models are good for vertical text.

wahnfrieden · 2024-08-11T02:49:11 1723344551

ML Kit v2 works with vertical text better than Tessy

anonymoushn · 2024-08-09T17:05:07 1723223107

I ended up using EasyOCR. I assume it is too slow in CPU-only mode.

aidenn0 · 2024-08-09T17:30:22 1723224622

> I assume it is too slow in CPU-only mode.

So you don't have to assume: I gave up after running on 8 cores (Ryzen 7 2700) for 10 days for a single page.

fred123 · 2024-08-09T17:42:37 1723225357

Something wrong with your setup. It should be less than 30 s per page with your hardware

aidenn0 · 2024-08-09T23:20:28 1723245628

Huh, I tried with the version from pip (instead of my package manager) and it completes in 22s. Output on the only page I tested is considerably worse than tesseract, particularly with punctuation. The paragraph detection seemed to not work at all, rendering the entire thing on a single line.

Even worse for my uses, Tesseract had two mistakes on this page (part of why I picked it), and neither of them were correctly read by EasyOCR.

Partial list of mistakes:

1. Missed several full-stops at the end of sentences

2. Rendered two full-stops as colons

3. Rendered two commas as semicolons

4. Misrendered every single em-dash in various ways (e.g. "\_~")

5. Missed 4 double-quotes

6. Missed 3 apostrophes, including rendering "I'll" as "Il"

7. All 5 exclamation points were rendered as a lowercase-ell ("l"). Tesseract got 4 correct and missed one.

ein0p · 2024-08-09T20:02:23 1723233743

I use a container on a machine with an old quad core i7 and no GPU compute. This should take at most tens of seconds per page.

yard2010 · 2024-08-09T17:41:05 1723225265

...how is it so slow?

savikko · 2024-08-09T18:05:09 1723226709

I have some pretty good experiences with PaddleOCR but you may refer to this Chinese and badly documented ones.

For our use case PaddleOCR + LLM has been quite nice combo.

eigenvalue · 2024-08-09T19:03:29 1723230209

Yes, that's one of the ones I tried. It seemed to be more designed for things like receipts and menus rather than books. But in any case, I found it hard to set up and use (and it's likely slow on the CPU compared to Tesseract, which despite its low accuracy, is at least very fast on CPU).