Page 1 of 1

trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

Posted: 2015-04-23T21:38:38-07:00
by ninjaaron
Hi. I'm trying to do OCR on some large PDFs (hundreds of pages), and to do that, I need to convert all the pages to images. imagemagick, amiright?

I started out with a command like this:

Code: Select all

convert some-doc.pdf some-folder/%03d.tiff
This worked. slowly but surely. However, the resulting images are pretty low-quality. I'd like to give the OCR scanner (`tesseract`) something better to work with.

I found a command like this:

Code: Select all

convert -density 400 some-doc.pdf -resize 25% some-folder/%03d.tiff
Such a command had apparently worked well for others. However, my files are fairly large, and my laptop only has 4G of RAM and an Ivy Bridge Core i3 @ 1.4 GHz. After ~9 minutes, the computer begins to drag, and finally the kernel kills the process before everything crashes.

Is there a way to do the same thing one page at a time or something else that won't crash my computer?

The script is in bash so far, but I can write python if that makes a difference (never used PythonMagick, but how hard can it be?)

Re: trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

Posted: 2015-04-24T03:21:10-07:00
by magick
Add -limit memory 20MB to your command line, right after the 'convert'. It will still be slow, but it will mostly process from disk rather than memory. If that fails, add -limit map 50MB in addition to the previous -limit option.

Re: trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

Posted: 2015-04-24T06:03:25-07:00
by snibgo
Typically, PDFs contain text as text. They may also contain raster images of text.

If you use IM, this will convert each page to a raster image that you can then OCR. I can't imagine why you would want to do that. It would seem more logical (using less resources) to use a tool (eg pdftotext) that extracts the text directly from the PDFs.

If you want the raster images that are contained in the PDFs, "pdfimages" may be an more logical tool.