Efficiently split extremely 700 page PDF into JPG without crashing?

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
ahDev
Posts: 3
Joined: 2016-05-24T11:10:23-07:00
Authentication code: 1151

Efficiently split extremely 700 page PDF into JPG without crashing?

Post by ahDev »

We use imagemagick to convert user submitted files scanned at 300DPI into individual images for the web, and we currently support two multipage formats TIF and PDF. We have to keep an extremely high quality so the user can 'zoom in' and see extreme detail and cannot compress at all.


99% of all content are scanned images and we are failing to convert PDF files that are over ~200 pages, we have received some as high as 700 pages.

Our command that is run on every image

Code: Select all

convert -density 300x300 {FILE_NAME} -quality 100 -resize 'x1080<' -scene 1 -background white -alpha off /temp/%d.jpg
After doing some research it seems like imagemagick's convert command isn't the best tool for our task, and that

Code: Select all

pdfimages
might be better, but this is what we have to work with and I need to be able to convert the submitted documents, even if it is slow.

We have attempted the following to remedy the situation without any success:

-limit memory/area/map/disk: We reach the threshold for memory and disk usage, expanding the resources further is not an option. Below is the result we get
convert: cache resources exhausted `/tmp/magick-d1KLnTNt-00000044' @ error/cache.c/OpenPixelCache/4125.
convert: Too many IDAT's found `/tmp/magick-d1KLnTNt-00000044' @ error/png.c/MagickPNGErrorHandler/1728.
convert: corrupt image `/tmp/magick-d1KLnTNt-00000044' @ error/png.c/ReadPNGImage/3695.
example

Code: Select all

convert -density 300x300 {FILE_NAME} -quality 100 -resize 'x1080<' -scene 1 -background white -alpha off -limit memory 2G -limit map 2G -limit area -2G -limit disk 5G

Using `Stream`: We attempted this but had issues actually getting the resulting image to appear correctly. In addition mapping out the specific dimensions for each image along with depth seemed like it would leave too many areas open for error.


Converting 10 pages at a time: This approach also crashed the system


example

Code: Select all

convert -density 300x300 {FILE_NAME}[0-10] -quality 100 -resize 'x1080<' -scene 1 -background white -alpha off -limit memory 2G -limit map 2G -limit area -2G -limit disk 5G

Code: Select all

convert -density 300x300 {FILE_NAME}11-20] -quality 100 -resize 'x1080<' -scene 1 -background white -alpha off -limit memory 2G -limit map 2G -limit area -2G -limit disk 5G
etc.. etc..



EDIT: We are running this on

ImageMagick 6.7.2-7 2015-02-27 Q16
Ghostscript 8.70 (2009-07-31)

However we will soon be using ImageMagick-7.0.1-3, my coworker has one machine with this setup and he is seeing the same results as the rest of us.
Last edited by ahDev on 2016-05-24T12:53:53-07:00, edited 1 time in total.
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Efficiently split extremely 700 page PDF into JPG without crashing?

Post by snibgo »

You should say what versions of IM and Ghostscript you are using.
snibgo's IM pages: im.snibgo.com
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Efficiently split extremely 700 page PDF into JPG without crashing?

Post by fmw42 »

convert -density 300x300 {FILE_NAME} -quality 100 -resize 'x1080<' -scene 1 -background white -alpha off -limit memory 2G -limit map 2G -limit area -2G -limit disk 5G
Where is the output image in the command line? You need to specify the output file.

Try putting the limits right after convert.
ahDev
Posts: 3
Joined: 2016-05-24T11:10:23-07:00
Authentication code: 1151

Re: Efficiently split extremely 700 page PDF into JPG without crashing?

Post by ahDev »

snibgo wrote:You should say what versions of IM and Ghostscript you are using.

Sorry I will edit the original post as well,

We are running this on

ImageMagick 6.7.2-7 2015-02-27 Q16
Ghostscript 8.70 (2009-07-31)

However we will soon be using ImageMagick-7.0.1-3, my coworker has one machine with this setup and he is seeing the same results as the rest of us.
fmw42 wrote:Where is the output image in the command line? You need to specify the output file.

Try putting the limits right after convert.
That was a copy-paste mistake, I attempted it again with the following and it failed in an identical way. Memory usage and disk usage for the tmp file are an issue.

Code: Select all

convert  -limit memory 2G -limit map 2G -limit area -2G -limit disk 5G -density 300 {FILE_NAME} -quality 100 -resize 'x1080<' -scene 1 -background white -alpha off  ./tmp/%d.jpg
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Efficiently split extremely 700 page PDF into JPG without crashing?

Post by fmw42 »

What are the memory capacities of your machine? Are you setting limits beyond the amount of memory available?

Does it work with a PDF with just a few pages?

Your Ghostscript is rather old. Have you tried upgrading Ghostscript?
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Efficiently split extremely 700 page PDF into JPG without crashing?

Post by snibgo »

Your Ghostscript is old. I suggest you try an upgrade.

When IM has an input like x.pdf[0-9], I think it passes the entire input to Ghostscript, because IM doesn't know where the page boundaries are. It tells GS "-dFirstPage=1 -dLastPage=10".

I don't know what resources GS uses but IM can't control that. Setting IM limits won't affect GS.

If your PDF files are simply one scan per page, you are using the wrong tool.
snibgo's IM pages: im.snibgo.com
ahDev
Posts: 3
Joined: 2016-05-24T11:10:23-07:00
Authentication code: 1151

Re: Efficiently split extremely 700 page PDF into JPG without crashing?

Post by ahDev »

fmw42 wrote:What are the memory capacities of your machine? Are you setting limits beyond the amount of memory available?

Does it work with a PDF with just a few pages?

Your Ghostscript is rather old. Have you tried upgrading Ghostscript?

We are now setting the memory limit to 2G, our machine has 16GB total. It does work if the PDF has under 250 or so pages. The servers are dedicated EC2 instances that only run conversions in a java app.

The ghostscript version is quite old, I thought for some reason upgrading IM would include upgrading the GS version but I guess not. I will attempt to upgrade GS and see what will happen.
snibgo wrote:Your Ghostscript is old. I suggest you try an upgrade.

When IM has an input like x.pdf[0-9], I think it passes the entire input to Ghostscript, because IM doesn't know where the page boundaries are. It tells GS "-dFirstPage=1 -dLastPage=10".

I don't know what resources GS uses but IM can't control that. Setting IM limits won't affect GS.

If your PDF files are simply one scan per page, you are using the wrong tool.
The PDF files are usually one continuous document, and sometimes they have additional scans supporting that document. For example someone will fill out a form that has 100 pages, they will then scan the 100 pages plus 30 pages of old receipts that pertain to that document and have it all in one PDF. We then take that PDF and split it into individual images for faster loading on a web front end. I agree the more I read about image magick the more I think it isn't the best for this job but 75% are actually submitted TIF files and it works well for that.
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Efficiently split extremely 700 page PDF into JPG without crashing?

Post by snibgo »

IM is a powerful raster image processor. It can also (using Ghostscript) convert PDF pages into raster images, and then do any processing.

But if your PDF pages contain only embedded raster images, then:

1. IM passes the PDF to GS.
2. GS reconstructs the pages by re-sampling the embedded raster images, to construct a raster image of each page.
3. GS passes those page images back to IM.

This is a lot of work, and the re-sampling will generally mess up the images.

By contrast, pdfimages does this:

1. Extract all the raster images, as they stand, into files.

That's it. That's why, for this task, pdfimages should be used. You might then use IM to process the images, for resizing and so on.
snibgo's IM pages: im.snibgo.com
Post Reply