Follow

1. # apt install tesseract-ocr

2. $ tesseract input.jpg | <image file fomat> outputfile

3. cat outputfile.txt

You have now extractet text from any image into a .txt-file.

The .txt-extension is added by tesseract.

Yes, it IS that easy.

Convert a pdf-file with convert input.pdf output.tiff and feed the .tiff-file to tesseract then.

@alsternerd
Even better, use ocrmypdf (if it's not in the repos it's easy to find on github). It runs tessaract on input.pdf and adds a transperent text-layer on it. This way your pdf still looks the same but is searchable now.

@vi If I got a pdf, but yeah. Thats the solution for some of our internal newsletters that are printed, signed and then scanned to be send around via email. :)

@vi @alsternerd This is on my todo for work now... I have a bunch of papers that are not searchable that I would love to try this on. Any idea how this handles scientific equations?

@flocke @alsternerd

tesseract has a language pack for Math-notation, but from my experience you can't rely on that.
Sign in to participate in the conversation
Mastodon

One of the first Mastodon instances, there is no specific topic we're into, just enjoy your time!