Follow

1. # apt install tesseract-ocr

2. $ tesseract input.jpg | <image file fomat> outputfile

3. cat outputfile.txt

You have now extractet text from any image into a .txt-file.

The .txt-extension is added by tesseract.

Yes, it IS that easy.

Convert a pdf-file with convert input.pdf output.tiff and feed the .tiff-file to tesseract then.

@alsternerd
Even better, use ocrmypdf (if it's not in the repos it's easy to find on github). It runs tessaract on input.pdf and adds a transperent text-layer on it. This way your pdf still looks the same but is searchable now.

@vi If I got a pdf, but yeah. Thats the solution for some of our internal newsletters that are printed, signed and then scanned to be send around via email. :)

@vi @alsternerd This is on my todo for work now... I have a bunch of papers that are not searchable that I would love to try this on. Any idea how this handles scientific equations?

@flocke @alsternerd

tesseract has a language pack for Math-notation, but from my experience you can't rely on that.
Sign in to participate in the conversation
Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!