Page 1 of 1

ReadImage of PDF without "density" specified

Posted: 2010-07-09T12:44:51-07:00
by kaberdude
I have a PDF file which contains pages of images. I am writing an application, which uses the magickcore API, that will read the PDF and compute a new PDF file with modified images.

By default, ImageMagick reads a PDF file with a density of "72x72". If I read a PDF with images of higher resolution (e.g., 150x150), the images are down sampled, resulting in a very low quality image, which is very annoying. This has been observed before by many others, and recently posted in a discussion (viewtopic.php?f=1&t=16541 ). One solution is to specify the density before reading the file. For example:

exception = AcquireExceptionInfo();
image_info = CloneImageInfo((ImageInfo *) NULL);
(void) strcpy(image_info->filename, argv[0]);
image_info->density = AcquireString("150x150"); // YUCK!!!!!
images = ReadImage(image_info, exception);

However, I would prefer to not specify the density, because I don't know what it will be given an arbitrary PDF file. However, I know it's possible to compute the resolution of an image from the information contained just in a PDF. This is because Adobe Acrobat is able to read a PDF file, know the size of the page and images contained in it, and and display full resolution of the images in the file. Further, I do not want to ask users of my program to specify the "density". They won't know what that means, they don't care to know, and will probably get it wrong every time. They want images to be full resolution.

Looking at the ImageMagick API, I do not see an apparent way to compute the resolution. Does anyone know how using the magickcore API?

If no one has a solution, I do see one, extremely yucky alternative: parse the PDF file and compute it myself. I would prefer not to do this, since ImageMagick already knows the structure of PDF files. However, it is possible using the following steps:

1) Read a PDF file as text. Find a /Page, say the first one, and get the /Content object. E.g.,

/Type /Page
/Parent 2 0 R
/Resources <<
/XObject << /Im1 22 0 R >>
/ProcSet 20 0 R >>
/MediaBox [0 0 918.24 683.04]
/CropBox [0 0 918.24 683.04]
/Contents 18 0 R
/Thumb 25 0 R

=> object 18.

2) Assume that the content on the page is just an image object. Get scaling of content image object. E.g.,

18 0 obj
<<
/Length 19 0 R
>>
stream
q
918.24 0 0 683.04 0 0 cm
/Im1 Do
Q
endstream
endobj

==> object Im1 is scaled using 918.24 per sample units.

3) Get image size in sample units. E.g.,

/Type /XObject
/Subtype /Image
/Name /Im1
/Filter [ /RunLengthDecode ]
/Width 1913
/Height 1423
/ColorSpace 24 0 R
/BitsPerComponent 8
/Length 23 0 R

=> Image is 1913 x 1425 sample units.

4) Compute resolution

1913 / 918.24 * 72 = 150

(Assumption: 1⁄72 default user space resolution.)

(see http://www.adobe.com/devnet/acrobat/pdf ... 0_2008.pdf )

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-09T16:55:04-07:00
by snibgo
My understanding (which could be wrong) is that ImageMagick just hands the PDF over to GhostScript to handle. If so, your suggestion might be more usefully addressed to the Ghostscript people.

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-09T19:02:07-07:00
by kaberdude
Debugging into ImageMagick, I found out a couple things.

While ImageMagick does make a call to Ghostscript, ImageMagick specifies a resolution in the call (see file ".../coders/pdf.c" in the ImageMagick source tree, line 633, variable "command"). The call essentially converts the pdf file to pnm, "portable anymap format", which is then read by ImageMagick.

The call ImageMagick makes is:

"C:/Program Files/gs/gs8.71/bin/gswin32c.exe" -q -dQUIET -dPARANOIDSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dEPSCrop -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pnmraw" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" "-sOutputFile=C:/Users/Ken/AppData/Local/Temp/magick-ia49RDTx" "-fC:/Users/Ken/AppData/Local/Temp/magick-geXgcGRV" "-fC:/Users/Ken/AppData/Local/Temp/magick-NWve6a7v"

In this call, ImageMagick requests a conversion with a resolution of 72x72 (via option string "-r72x72"), but it is not the correct resolution. It should be 150x150, at least for my input file. While pdf.c has a rudimentary parser (lines 426-535), it does not compute the optimal resolution as I suggested in my post. Unfortunately, Ghostscript also seems to have a problem, because if I call gswin32c.exe myself without the -r option, gswin32c.exe produces a bitmap that is 72x72 in resolution, the bitmap is the wrong size, and the image is down sampled. It seems that the -r option has to be specified. Great!

Since I can't change the ImageMagick code, nor the Ghostscript code, I see no alternative but to parse the PDF file myself to determine the correct resolution.

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-09T19:56:11-07:00
by snibgo
ImageMagick is open source. You can make a change on your own copy, then suggest that patch is incorporated into the product.

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-10T03:48:10-07:00
by Drarakel
There is no "correct resolution" that could be obtained for all PDFs.
A tool that parses the PDF could give hints about a good density value for a good quality - at most. But in many cases, an automatic approach to obtain a value will reach a dead end. I mean, I myself am sometimes obtaining a 'good' value by looking at the properties of the background image for example. And I'm also using a bunch of tools in some cases - and all that could be made easier if there was a tool that integrates all of the available techniques. (There are some useful plug-ins in Adobe Acrobat, I think. But probably even Adobe can't tell a "correct resolution" for all PDFs.(?))
Anyway, you basically would have to write a full-fledged PDF viewer in order to (try to) compute your values.

Regarding your steps:
Even for PDFs that contain only one image, it's usually not that easy. Some images can be cropped. Some images can be centered within the page - with an additional bleed area. Things like that.
And for other PDFs, you will run into bigger problems. I'm not an expert in the PDF format.. But I can say that simply looking at the content object will lead nowhere sometimes. First, the content object can be compressed (not parsable). You would have to decompress the file first. Then: What if it contains fonts and vector graphics? What if it contains hundreds of objects? What if images are rotated/stretched/etc.? That's not an exaggeration - that's what the PDF format was designed for (not just holding a single rasterized image).

And, yes, Ghostscript would be the better place to integrate such jobs (it already has to parse the whole PDF). But, as I said, for a big part of the PDFs, it's not easily possible or not possible at all to obtain 'the one' density value. And I also have to say that there are other, more important issues in Ghostscript, like full color management or correct antialiasing. Issues that can be solved (with a lot of work of course) with parsing alone - contrary to the density value thing.

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-10T06:45:18-07:00
by kaberdude
In the many thousands of PDF files I have used with Adobe Acrobat, Acrobat seems to always display the images with full resolution (no down sampling) with the correct size, whether created through LaTex, Adobe Acrobat merge/convert, print to Adobe PDF file, Capture Perfect, or elsewhere. To me, that's pretty strong evidence that there is enough information in the PDF to scale an image with full resolution to a given size. (And, given my cursory reading of the PDF spec, this seems to be true.) But, there really is no one fixed density for all images in a PDF. I have created PDF's containing multiple images of different PPI. (Adobe Acrobat has a "pre-flight" tool that down samples images if they go over a limit, which seems to me to be pretty much equivalent to the -r option of gswin32c.exe.) I agree, it isn't good to write a hack to parse a PDF to get a resolution, not only because images have different densities, but because the expertise for parsing PDF's is in Ghostscript. I will pursue this further in the Ghostscript forums to understand why gswin32c down samples images even when -r is not specified.

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-10T12:04:28-07:00
by snibgo
When Adobe Acrobat rasterises a PDF for a raster device (screen or printer), it must re-sample any raster images present in that PDF, unless the input and device resolutions happen to match. This might involve up- or down-sampling. There is no escaping this, and any software has to do it.

If the PDF contains multiple raster images at different resolutions (as well as vector data at "no resolution"), some may up up-sized and others down-sized.

As a rule of thumb, the ImageMagick default of 72 dpi gives poor quality, so a user might supersample: choose a higher density then resize. I assume Adobe Acrobat does something similar without being told.

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-10T12:42:53-07:00
by Drarakel
I would say, supersampling is needed for Ghostscript, because it sometimes doesn't do a complete antialiasing. With Adobe, that rather 'expensive' method shouldn't be necessary - as it does a 'regular' antialiasing for all vectorial elements. (Well, it's a bit off-topic..)

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-10T18:10:17-07:00
by anthony
Ghostscript does have aliasing settings, whcih it uses for 'display into some X window'. This is used by many other programs that use ghostscript as its engine, for example xpdf. Unfortunately the PDF to raster image drivers that IM uses do not have this capability!!! Arrrggghhhhh...

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-10T18:26:00-07:00
by magick
ImageMagick typically uses Ghostscript's pnmraw or pngalpha device for RGB and the PAM device for CMYK. We include the -dTextAlphaBits=4 -dGraphicsAlphaBits=4 command line options. To get decent font rendering, you typically need to supersample at a density of 400 and then downsample by 25%. If you know of a better way to render PDF / PS to an image format that ImageMagick can grok, post your solution here.

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-10T18:34:22-07:00
by fmw42
Per Magick's suggestion, I have used the following supersampling

convert -density 288 image.pdf -resize 25% resultimage

288=4x72dpi so I resize by 1/4=25% to get back to a nominal 72 dpi like result, but you can resize to any (larger) percent or not at all if you want a bigger result or use a larger dpi.

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-11T03:21:02-07:00
by Drarakel
Supersampling can be good - but that wasn't the problem from the OP. :wink:

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-11T09:26:33-07:00
by kaberdude
Yes, the purpose of Ghostscript is to render PDF's (a vector graphics format) into bitmaps. So, to perform the rendering, it has to choose some density. The fact that PDF's can also contain a bitmap results in the question: what density should it be rendered at? By default, Ghostscript renders with a pitiful value of 72 ppi, which causes down sampling for images with > 72 ppi. Adobe Acrobat exports to TIFF and allows the user to let Acrobat to automatically choose the density. Ghostscript doesn't seem to offer that option, so Imagemagick is stuck.

For me at least, an "optimal" rendering would be at a density that is at least as great as the resolution for the image with the greatest resolution, so that detail is not lost in the rendering with Ghostscript. The density could of course be greater than this, but that would result in larger bitmaps. I think I can live with the fact that some images will be expanded. Supersampling antialiasing of bitmap objects for me is not necessary (not sure what it would mean to antialias a bitmap itself), so Ghostscript rendering is fine as long as I can know apriori the "optimal" density.

I can solve my problem if I had a tool that could give me the densities of all the bitmap images in the PDF. While Ghostscript seems to offer an API for developers, the API doesn't seem to offer an API into the parser, nor does it offer a tool to give me properties of objects in the file. Poking around a bit, there are some PDF parsers out there, but nothing seems to jump out. Anyone know of a general purpose PDF parser that I can hack?

Re: ReadImage of PDF without "density" specified

Posted: 2010-07-15T23:24:12-07:00
by Mazin
It's hard to tell by the vague description of your application, but if all you want to do is extract the un-resized images themselves from the PDF, you can use xpdf or poppler's `pdfimages` command which extracts all images from a pdf as ppm or jpg without doing any PDF rendering. You can then reassemble a new PDF using data from ghostscript or ImageMagick.

Otherwise, there's no "correct" way to compute what resolution to export at. What if a page has a 100 ppi image and a 150 ppi image? Do you upscale the 100 ppi image at a yucky 2:3 ratio? What if a page has an image stretched in one dimension? What if a page has an image rotated 30 degrees? What if a page has only vector graphics? Those are mostly arbitrary decisions you'll have to make, I think.