Follow

Heyo! Can anyone recommend a free (as in beer) option for transforming image PDFs to OCR'd PDFs [1] ? French support + macOS required, FLOSS preferred.

[1]: I'm not sure if I'm very clear 😕. Here's my use case: I have an app on my phone that scans documents to PDFs but it doesn't do any OCR. I also have a bunch of digital documents for which I don't have a paper version anymore. I'd like to OCR these documents to make them searchable and allow copy/paste.

@Crocmagnon

I'd suggest checking out imagemagick (convert command) for preprocessing the original, then using tesseract for OCR

@Jase thanks for your suggestion! pdfsandwich was mentioned before and is basically a toolchain wrapping enhancement tools and tesseract.

I guess your suggestion is more adapted to raw images? I don’t know if imagemagick can be used on PDFs.

@Crocmagnon

yes, although you will need ghostscript installed too.

Check out the imagemagick docs / forums for pdf ocr preprocessing
@Crocmagnon I have used PDFSandwich

http://www.tobias-elze.de/pdfsandwich/

OCR is done by tesseract, which isn't top grade but works for me.

@ben @themactep Thanks you both for your suggestions! pdfsandwich seems like a cool wrapper around tesseract and other tools.
It seems to work OK for french documents.

I'd be happy to have GUI suggestions as well!

At least I have a nice CLI tool in my toolbelt now 👍

@Crocmagnon @mike I recommend this service: doxisafe.me/#/safe/start
They have their own KI named Deeper and it's not Google's Tesseract plus an extra cloud service for free. Besteht I found so far.

@kettcar64 @mike Thanks for your suggestion, but I don’t feel safe uploading my pdf to an online service I don’t have control over. Plus there are some confidential documents I’m not allowed to upload anywhere among the ones I need to process 😊

I admit that it’s a really simple and easy solution though, and it might be sufficient for some!

@Crocmagnon if you don't need them to be directly stored on your phone, you can selfhost paperless-ng! It's a great app I selfhost at home and it has mobile apps that allow you to upload scans directly from your phone. The machine hosting it will then do OCR and the web interface let's you search through tags or OCR content

@iconvacation it looks awesome! I’ll definitely check it out, thanks for the suggestion!

Do you know if it can be configured to push the final OCR’d document to a specific NextCloud folder? That would complete the loop nicely 👌🏻

@Crocmagnon This might be quite a bit of overkill but I had paperless running for a couple of years and it did its job wonderfully. It's a server architecture that ingests everything in a folder, OCRs and files it for you. While the original isn't maintained there is github.com/jonaswinkler/paperl nowadays, though I have not tried this fork.

Sign in to participate in the conversation
Fosstodon

Fosstodon is an English speaking Mastodon instance that is open to anyone who is interested in technology; particularly free & open source software.