PDF to TIFF Multipage, aliasing issue.

dpedder · Post by **dpedder** » 2017-11-13T12:46:20-07:00

Hi,

When running the following code on a multi-page pdf, the tiffs generated appears to be quite aliased

convert -density 300 ../full.pdf -depth 8 -strip -background white output%d.tiff

But if I extract one of the pages, and convert this single PDF to a tiff, the text is smooth

Code: Select all

convert -density 300 ../single.pdf -depth 8 -strip -background white output.tiff

Why is this happening and how can I ensure anti-aliasing is applied to the multi-page PDF?

Appreciate any help.

Running under Vagrant, ubuntu 16.04.3, ImageMagick 6.8.9-9 Q16 i686 2017-07-31

Post by **fmw42** » 2017-11-13T12:49:04-07:00

Please post your original PDF file so we can examine it.

Post by **snibgo** » 2017-11-13T12:59:33-07:00

dpedder wrote:But if I extract one of the pages, ...

How did you exctract it? That may have changed some metadata.

dpedder · Post by **dpedder** » 2017-11-14T13:57:42-07:00

snibgo wrote: ↑2017-11-13T12:59:33-07:00
dpedder wrote:But if I extract one of the pages, ...
How did you exctract it? That may have changed some metadata.

Good thought. That appears to have been the issue.

The PDFs were sourced from Companie's House. I'm using a Mac and the built in Preview tools for extracting individual pages. If I re-save the PDF using Preview, then the aliasing issues goes away.

PDF is available here (04 Mar 20017) to test. If you run the above convert commands, you get the aliasing, but if you re-save the PDF as a PDF within Preview (unsure if this works for any other PDF app), and convert again, the aliasing is removed. Very odd.

Any ideas why the PDFs are producing this aliasing effect and is there a way to get around the re-saving method from within the convert tool?

Post by **fmw42** » 2017-11-14T14:05:23-07:00

Using IM 6.9.9.23 Q16 Mac OSX Sierra and Ghostscript 9.21, I can confirm that there is a slightly better antialiasing using the result saved after Preview when run through:

convert -density 300 file.pdf[0] -alpha off -compress none file.tif

convert -density 300 file_preview.pdf[0] -alpha off -compress none file_preview.tif

ImageMagick uses Ghostscript to process PDF files. Perhaps Preview modifies the PDF with slight antialising.

Post by **snibgo** » 2017-11-14T14:30:04-07:00

I don't know which document at https://beta.companieshouse.gov.uk/comp ... ng-history you downloaded. I downloaded the "04 May 2017" (5 pages).

This is a scanned document. Every page contains a single raster image. If we want raster images, the obvious tool isn't ImageMagick, but pdfimages. For example:

Code: Select all

pdfimages -png application-pdf.pdf abcd

This creates abcd-000.png etc.

The problem with using IM is that IM will re-sample the pages, and hence the raster images, degrading the quality.

Post by **fmw42** » 2017-11-14T15:50:52-07:00

Good point snibgo. But for my own benefit, how do you tell it is a raster image imbedded in a PDF shell rather than a vector PDF? Is there something in the identify -verbose information that clues one in to that?

Post by **snibgo** » 2017-11-14T22:28:11-07:00

fmw42 wrote:... how do you tell it is a raster image imbedded in a PDF shell rather than a vector PDF?

I do it manually. The first thing I do with any PDF is open it in Adobe Reader. This PDF contained text, but it looked grainy. When I clicked on a page, the entire page turned blue -- it was all selected. I couldn't select words of text. So the entire page was a single image.

IM's identify tells us about the images, after GS has rasterised each page. Each image has only two colours, so there is no anti-aliasing, which is a good hint that this isn't a conventional text PDF.

For an automated test, I suggest both identify and pdfimages.

Code: Select all

f:\web\im>%IM%identify application-pdf.pdf
application-pdf.pdf[0] PDF 595x842 595x842+0+0 16-bit sRGB 4.87KB 0.000u 0:00.000
application-pdf.pdf[1] PDF 595x842 595x842+0+0 16-bit sRGB 4.87KB 0.000u 0:00.000
application-pdf.pdf[2] PDF 595x842 595x842+0+0 16-bit sRGB 4.87KB 0.000u 0:00.016
application-pdf.pdf[3] PDF 595x842 595x842+0+0 16-bit sRGB 4.87KB 0.000u 0:00.016
application-pdf.pdf[4] PDF 595x842 595x842+0+0 16-bit sRGB 4.87KB 0.000u 0:00.031

GS has made 5 pages, all the same size, as we would expect from most documents.

Code: Select all

f:\web\im>pdfimages -list application-pdf.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1650  2338  gray    1   1  ccitt  no         7  0   200   200 1581B 0.3%
   2     1 image    1650  2338  gray    1   1  ccitt  no        12  0   200   200 13.9K 3.0%
   3     2 image    1650  2338  gray    1   1  ccitt  no        17  0   200   200 9867B 2.0%
   4     3 image    1650  2338  gray    1   1  ccitt  no        22  0   200   200 18.5K 3.9%
   5     4 image    1650  2338  gray    1   1  ccitt  no        27  0   200   200 6211B 1.3%

Pdfimages has found five raster images, all the same size (width * height). And there is exactly one raster per page. This could happen by coincidence, as each PDF page could contain a raster image plus vector data. (I've never seen it happen by coincidence, but I don't process thousands of PDFs.)

If we wanted to be more sure, we could extract the pages with IM, and compare each to the corresponding image from pdfimages.

EDIT: Incidentally, pdfimages tells us that every raster is 200 dpi, so this is the "-density" setting we should use for IM.

Code: Select all

f:\web\im>%IM%convert -density 200 application-pdf.pdf[0] -background White -layers flatten imabcd.png

f:\web\im>%IMDEV%convert abcd-000.png ( imabcd.png -trim +repage ) -process srchimg NULL:
0 @ 617,125

The trimmed IM version of page 0 exactly matches (is contained within) the pdfimages extract. So we know the page contains no vector data, and we have used the correct density for IM, and we could use either image for the final output.

Post by **fmw42** » 2017-11-15T10:14:23-07:00

Thanks, snibgo. Great discussion.

snibgo wrote:EDIT: Incidentally, pdfimages tells us that every raster is 200 dpi, so this is the "-density" setting we should use for IM.

This is a very useful piece of information. I have been using trial and error to figure that out.

dpedder · Post by **dpedder** » 2017-11-15T13:18:55-07:00

snibgo wrote: ↑2017-11-14T14:30:04-07:00 I don't know which document at https://beta.companieshouse.gov.uk/comp ... ng-history you downloaded. I downloaded the "04 May 2017" (5 pages).

This is a scanned document. Every page contains a single raster image. If we want raster images, the obvious tool isn't ImageMagick, but pdfimages. For example:
Code: Select all
pdfimages -png application-pdf.pdf abcd
This creates abcd-000.png etc.

The problem with using IM is that IM will re-sample the pages, and hence the raster images, degrading the quality.

Cheers snibgo, that appears to be the missing piece to the puzzle!

Thanks Chaps.

Legacy ImageMagick Discussions Archive

PDF to TIFF Multipage, aliasing issue.

PDF to TIFF Multipage, aliasing issue.

Re: PDF to TIFF Multipage, aliasing issue.

Re: PDF to TIFF Multipage, aliasing issue.

Re: PDF to TIFF Multipage, aliasing issue.

Re: PDF to TIFF Multipage, aliasing issue.

Re: PDF to TIFF Multipage, aliasing issue.

Re: PDF to TIFF Multipage, aliasing issue.

Re: PDF to TIFF Multipage, aliasing issue.

Re: PDF to TIFF Multipage, aliasing issue.

Re: PDF to TIFF Multipage, aliasing issue.