trying to create high-quality tiff from PDF for OCR, but it's too resource intensive!
Posted: 2015-04-23T21:38:38-07:00
Hi. I'm trying to do OCR on some large PDFs (hundreds of pages), and to do that, I need to convert all the pages to images. imagemagick, amiright?
I started out with a command like this:
This worked. slowly but surely. However, the resulting images are pretty low-quality. I'd like to give the OCR scanner (`tesseract`) something better to work with.
I found a command like this:
Such a command had apparently worked well for others. However, my files are fairly large, and my laptop only has 4G of RAM and an Ivy Bridge Core i3 @ 1.4 GHz. After ~9 minutes, the computer begins to drag, and finally the kernel kills the process before everything crashes.
Is there a way to do the same thing one page at a time or something else that won't crash my computer?
The script is in bash so far, but I can write python if that makes a difference (never used PythonMagick, but how hard can it be?)
I started out with a command like this:
Code: Select all
convert some-doc.pdf some-folder/%03d.tiff
I found a command like this:
Code: Select all
convert -density 400 some-doc.pdf -resize 25% some-folder/%03d.tiff
Is there a way to do the same thing one page at a time or something else that won't crash my computer?
The script is in bash so far, but I can write python if that makes a difference (never used PythonMagick, but how hard can it be?)