Efficiently split extremely 700 page PDF into JPG without crashing?
Posted: 2016-05-24T12:21:02-07:00
We use imagemagick to convert user submitted files scanned at 300DPI into individual images for the web, and we currently support two multipage formats TIF and PDF. We have to keep an extremely high quality so the user can 'zoom in' and see extreme detail and cannot compress at all.
99% of all content are scanned images and we are failing to convert PDF files that are over ~200 pages, we have received some as high as 700 pages.
Our command that is run on every image
After doing some research it seems like imagemagick's convert command isn't the best tool for our task, and that might be better, but this is what we have to work with and I need to be able to convert the submitted documents, even if it is slow.
We have attempted the following to remedy the situation without any success:
-limit memory/area/map/disk: We reach the threshold for memory and disk usage, expanding the resources further is not an option. Below is the result we get
Using `Stream`: We attempted this but had issues actually getting the resulting image to appear correctly. In addition mapping out the specific dimensions for each image along with depth seemed like it would leave too many areas open for error.
Converting 10 pages at a time: This approach also crashed the system
example
etc.. etc..
EDIT: We are running this on
ImageMagick 6.7.2-7 2015-02-27 Q16
Ghostscript 8.70 (2009-07-31)
However we will soon be using ImageMagick-7.0.1-3, my coworker has one machine with this setup and he is seeing the same results as the rest of us.
99% of all content are scanned images and we are failing to convert PDF files that are over ~200 pages, we have received some as high as 700 pages.
Our command that is run on every image
Code: Select all
convert -density 300x300 {FILE_NAME} -quality 100 -resize 'x1080<' -scene 1 -background white -alpha off /temp/%d.jpg
Code: Select all
pdfimages
We have attempted the following to remedy the situation without any success:
-limit memory/area/map/disk: We reach the threshold for memory and disk usage, expanding the resources further is not an option. Below is the result we get
exampleconvert: cache resources exhausted `/tmp/magick-d1KLnTNt-00000044' @ error/cache.c/OpenPixelCache/4125.
convert: Too many IDAT's found `/tmp/magick-d1KLnTNt-00000044' @ error/png.c/MagickPNGErrorHandler/1728.
convert: corrupt image `/tmp/magick-d1KLnTNt-00000044' @ error/png.c/ReadPNGImage/3695.
Code: Select all
convert -density 300x300 {FILE_NAME} -quality 100 -resize 'x1080<' -scene 1 -background white -alpha off -limit memory 2G -limit map 2G -limit area -2G -limit disk 5G
Using `Stream`: We attempted this but had issues actually getting the resulting image to appear correctly. In addition mapping out the specific dimensions for each image along with depth seemed like it would leave too many areas open for error.
Converting 10 pages at a time: This approach also crashed the system
example
Code: Select all
convert -density 300x300 {FILE_NAME}[0-10] -quality 100 -resize 'x1080<' -scene 1 -background white -alpha off -limit memory 2G -limit map 2G -limit area -2G -limit disk 5G
Code: Select all
convert -density 300x300 {FILE_NAME}11-20] -quality 100 -resize 'x1080<' -scene 1 -background white -alpha off -limit memory 2G -limit map 2G -limit area -2G -limit disk 5G
EDIT: We are running this on
ImageMagick 6.7.2-7 2015-02-27 Q16
Ghostscript 8.70 (2009-07-31)
However we will soon be using ImageMagick-7.0.1-3, my coworker has one machine with this setup and he is seeing the same results as the rest of us.