trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

ninjaaron · Post by **ninjaaron** » 2015-04-23T21:38:38-07:00

Hi. I'm trying to do OCR on some large PDFs (hundreds of pages), and to do that, I need to convert all the pages to images. imagemagick, amiright?

I started out with a command like this:

Code: Select all

convert some-doc.pdf some-folder/%03d.tiff

This worked. slowly but surely. However, the resulting images are pretty low-quality. I'd like to give the OCR scanner (`tesseract`) something better to work with.

I found a command like this:

Code: Select all

convert -density 400 some-doc.pdf -resize 25% some-folder/%03d.tiff

Such a command had apparently worked well for others. However, my files are fairly large, and my laptop only has 4G of RAM and an Ivy Bridge Core i3 @ 1.4 GHz. After ~9 minutes, the computer begins to drag, and finally the kernel kills the process before everything crashes.

Is there a way to do the same thing one page at a time or something else that won't crash my computer?

The script is in bash so far, but I can write python if that makes a difference (never used PythonMagick, but how hard can it be?)

Post by **magick** » 2015-04-24T03:21:10-07:00

Add -limit memory 20MB to your command line, right after the 'convert'. It will still be slow, but it will mostly process from disk rather than memory. If that fails, add -limit map 50MB in addition to the previous -limit option.

Post by **snibgo** » 2015-04-24T06:03:25-07:00

Typically, PDFs contain text as text. They may also contain raster images of text.

If you use IM, this will convert each page to a raster image that you can then OCR. I can't imagine why you would want to do that. It would seem more logical (using less resources) to use a tool (eg pdftotext) that extracts the text directly from the PDFs.

If you want the raster images that are contained in the PDFs, "pdfimages" may be an more logical tool.

Legacy ImageMagick Discussions Archive

trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

Re: trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

Re: trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!