trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
ninjaaron
Posts: 1
Joined: 2015-04-23T21:16:19-07:00
Authentication code: 6789

trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

Post by ninjaaron »

Hi. I'm trying to do OCR on some large PDFs (hundreds of pages), and to do that, I need to convert all the pages to images. imagemagick, amiright?

I started out with a command like this:

Code: Select all

convert some-doc.pdf some-folder/%03d.tiff
This worked. slowly but surely. However, the resulting images are pretty low-quality. I'd like to give the OCR scanner (`tesseract`) something better to work with.

I found a command like this:

Code: Select all

convert -density 400 some-doc.pdf -resize 25% some-folder/%03d.tiff
Such a command had apparently worked well for others. However, my files are fairly large, and my laptop only has 4G of RAM and an Ivy Bridge Core i3 @ 1.4 GHz. After ~9 minutes, the computer begins to drag, and finally the kernel kills the process before everything crashes.

Is there a way to do the same thing one page at a time or something else that won't crash my computer?

The script is in bash so far, but I can write python if that makes a difference (never used PythonMagick, but how hard can it be?)
User avatar
magick
Site Admin
Posts: 11064
Joined: 2003-05-31T11:32:55-07:00

Re: trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

Post by magick »

Add -limit memory 20MB to your command line, right after the 'convert'. It will still be slow, but it will mostly process from disk rather than memory. If that fails, add -limit map 50MB in addition to the previous -limit option.
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!

Post by snibgo »

Typically, PDFs contain text as text. They may also contain raster images of text.

If you use IM, this will convert each page to a raster image that you can then OCR. I can't imagine why you would want to do that. It would seem more logical (using less resources) to use a tool (eg pdftotext) that extracts the text directly from the PDFs.

If you want the raster images that are contained in the PDFs, "pdfimages" may be an more logical tool.
snibgo's IM pages: im.snibgo.com
Post Reply