Having tried this in the past, it can work pretty well 90% of the time. However,...

eigenvalue · 2024-08-09T18:00:50 1723226450

Agreed, this should not be used for anything mission critical unless you're going to sit there and carefully review the output by hand (although that is still going to be 100x faster than trying to manually correct the raw OCR output).

Where it's most useful to me personally is when I want to read some old book from the 1800s about the history of the Royal Navy [0] or something like that which is going to look really bad on my Kindle Oasis as a PDF, and the OCR version available from Archive.org is totally unreadable because there are 50 typos on each page. The ability to get a nice Markdown file that I can turn into an epub and read natively is really nice, and now cheap and fast.

[0] https://archive.org/details/royalnavyhistory02clowuoft/page/...

ozim · 2024-08-09T22:18:17 1723241897

Why does it have to be 100% accurate?

If you get 90% of work done and you have to fix some numbers and names it still saves you time, isn't it?

choilive · 2024-08-09T22:39:35 1723243175

Theres some time savings, but not a ton.

If theres 30 fields on a document @ 90% accuracy - each field would still need to be validated by a human because you can't trust that it is correct. So the O(n) human step of checking each field is still there, and for fields that are long strings that are pseudo-random looking (think account numbers, numbers on invoices and receipts, instrumentation measurement values, etc.) there is almost no time savings because the mental effort to input something like 015729042 is about the same as verifying it is correct.

At 100% accuracy you remove that need altogether.

kevingadd · 2024-08-10T17:43:25 1723311805

Let's say you're OCRing a contract. Odds are good that almost every part of the contract is there for an important reason, though it may not matter to you. How many errors can you tolerate in the terms of a contract that governs i.e. your home, or the car you drive to work, or your health insurance coverage? Do you want to take a gamble on those terms that could - in the worst case - result in getting kicked out of your apartment or having to pay a massive medical bill yourself?

The important question is which parts are inaccurate. If it's messing up names and numbers but is 99.9% accurate for everything else, you can just go back and check all the names and numbers at the end. But if the whole thing is only 90% accurate, you now either recheck the whole document or you risk a 'must' turning into a 'may' in a critical place that undermines the whole document.