fmw42 wrote:... how do you tell it is a raster image imbedded in a PDF shell rather than a vector PDF?
I do it manually. The first thing I do with any PDF is open it in Adobe Reader. This PDF contained text, but it looked grainy. When I clicked on a page, the entire page turned blue -- it was all selected. I couldn't select words of text. So the entire page was a single image.
IM's identify tells us about the images, after GS has rasterised each page. Each image has only two colours, so there is no anti-aliasing, which is a good hint that this isn't a conventional text PDF.
For an automated test, I suggest both identify and pdfimages.
Code: Select all
f:\web\im>%IM%identify application-pdf.pdf
application-pdf.pdf[0] PDF 595x842 595x842+0+0 16-bit sRGB 4.87KB 0.000u 0:00.000
application-pdf.pdf[1] PDF 595x842 595x842+0+0 16-bit sRGB 4.87KB 0.000u 0:00.000
application-pdf.pdf[2] PDF 595x842 595x842+0+0 16-bit sRGB 4.87KB 0.000u 0:00.016
application-pdf.pdf[3] PDF 595x842 595x842+0+0 16-bit sRGB 4.87KB 0.000u 0:00.016
application-pdf.pdf[4] PDF 595x842 595x842+0+0 16-bit sRGB 4.87KB 0.000u 0:00.031
GS has made 5 pages, all the same size, as we would expect from most documents.
Code: Select all
f:\web\im>pdfimages -list application-pdf.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1650 2338 gray 1 1 ccitt no 7 0 200 200 1581B 0.3%
2 1 image 1650 2338 gray 1 1 ccitt no 12 0 200 200 13.9K 3.0%
3 2 image 1650 2338 gray 1 1 ccitt no 17 0 200 200 9867B 2.0%
4 3 image 1650 2338 gray 1 1 ccitt no 22 0 200 200 18.5K 3.9%
5 4 image 1650 2338 gray 1 1 ccitt no 27 0 200 200 6211B 1.3%
Pdfimages has found five raster images, all the same size (width * height). And there is exactly one raster per page. This could happen by coincidence, as each PDF page could contain a raster image plus vector data. (I've never seen it happen by coincidence, but I don't process thousands of PDFs.)
If we wanted to be more sure, we could extract the pages with IM, and compare each to the corresponding image from pdfimages.
EDIT: Incidentally, pdfimages tells us that every raster is 200 dpi, so this is the "-density" setting we should use for IM.
Code: Select all
f:\web\im>%IM%convert -density 200 application-pdf.pdf[0] -background White -layers flatten imabcd.png
f:\web\im>%IMDEV%convert abcd-000.png ( imabcd.png -trim +repage ) -process srchimg NULL:
0 @ 617,125
The trimmed IM version of page 0 exactly matches (is contained within) the pdfimages extract. So we know the page contains no vector data, and we have used the correct density for IM, and we could use either image for the final output.