Page 1 of 1

PDF to PNG conversion is non-deterministic

Posted: 2018-09-28T10:51:18-07:00
by syl_leroux
I noticed if I convert the same PDF to several PNG, the PNG files are different according to the `diff` tool.

Steps to reproduce:

Code: Select all

convert xc:red in.pdf
convert in.pdf out0.png
convert in.pdf out1.png
diff out?.png

-->Binary files out0.png and out1.png differ
In my use case, this is an issue since the destination PNGs are part of a Git repository, so each time they are regenerated from the source PDF, even if that latter hasn't changed, Git mark the PNGs as modified since something has changed at the binary level.

Is there some options or other post-processing I could use to ensure conversion from PDF produces _exactly_ the same output PNG?

EDIT:

Code: Select all

$ convert --version
Version: ImageMagick 6.9.7-4 Q16 x86_64 20170114 http://www.imagemagick.org
Copyright: © 1999-2017 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP 
Delegates (built-in): bzlib djvu fftw fontconfig freetype jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png tiff wmf x xml zlib
$ gs --version
9.20

Re: PDF to PNG conversion is non-deterministic

Posted: 2018-09-28T10:54:40-07:00
by syl_leroux
Ok, with further investigations it appears the embedded metadata are different (creation time) :

Code: Select all

diff <(hexdump -C out0.png) <(hexdump -C out1.png)
10,11c10,11
< 00000090  3e 00 00 00 07 74 49 4d  45 07 e2 09 1c 13 29 27  |>....tIME.....)'|
< 000000a0  7f ab ad d6 00 00 00 0a  49 44 41 54 08 d7 63 60  |........IDAT..c`|
---
> 00000090  3e 00 00 00 07 74 49 4d  45 07 e2 09 1c 13 29 29  |>....tIME.....))|
> 000000a0  98 13 80 d1 00 00 00 0a  49 44 41 54 08 d7 63 60  |........IDAT..c`|
14,15c14,15
< 000000d0  31 38 2d 30 39 2d 32 38  54 31 39 3a 34 31 3a 33  |18-09-28T19:41:3|
< 000000e0  39 2b 30 32 3a 30 30 4e  c9 92 03 00 00 00 25 74  |9+02:00N......%t|
---
> 000000d0  31 38 2d 30 39 2d 32 38  54 31 39 3a 34 31 3a 34  |18-09-28T19:41:4|
> 000000e0  31 2b 30 32 3a 30 30 77  e3 d5 7d 00 00 00 25 74  |1+02:00w..}...%t|
18c18
< 00000110  33 39 2b 30 32 3a 30 30  3f 94 2a bf 00 00 00 1c  |39+02:00?.*.....|
---
> 00000110  34 31 2b 30 32 3a 30 30  06 be 6d c1 00 00 00 1c  |41+02:00..m.....|
Is there a way to avoid embedding those data?

Re: PDF to PNG conversion is non-deterministic

Posted: 2018-09-28T10:59:52-07:00
by snibgo
By default, PNG outputs contain date metadata. To exclude it, use "-define png:exclude-chunk=date".

Re: PDF to PNG conversion is non-deterministic

Posted: 2018-09-28T11:21:01-07:00
by syl_leroux
I found a workaround:

Code: Select all

# Metadata name seems to be case-sensitive!
convert in.pdf -set datecreate "" -set "Modify Date" "" -set "datemodify" "" out1.png


exiftool -G -e -n out1.png 
[ExifTool]      ExifTool Version Number         : 10.40
[...]
[PNG]           Pixel Units                     : 0
[PNG]           Datecreate                      : 
[PNG]           Datemodify                      : 
[PNG]           Modify Date                     : 
[PNG]           Pdf Hi Res Bounding Box         : 1x1+0+0
[PNG]           Pdf Version                     : PDF-1.3  1 0 obj <<
Interestingly, `exiftool` is able to remove the `PNG:Modify Date` tag. But not `PNG:Datecreate` and `PNG:Datemodify` which seems non-standard PNG metadata. Or am I wrong?

Code: Select all

$ exiftool -PNG:"ModifyDate"= out1.png 
    1 image files updated
$ exiftool -PNG:"datecreate"= out1.png 
Warning: Tag 'PNG:datecreate' is not defined
Nothing to do.
$ exiftool -PNG:"datemodify"= out1.png 
Warning: Tag 'PNG:datemodify' is not defined
Nothing to do.

Re: PDF to PNG conversion is non-deterministic

Posted: 2018-09-28T11:26:04-07:00
by syl_leroux
Sorry @snibgo, I missed your answer:
By default, PNG outputs contain date metadata. To exclude it, use "-define png:exclude-chunk=date".
Thank you! That still requires a second pass with ExitTool since the "Modify Date" metadata remains present in the output:

Code: Select all

$ convert in.pdf -define png:exclude-chunk=date out1.png
$ exiftool -G -e -n out1.png 
[ExifTool]      ExifTool Version Number         : 10.40
[...]
[PNG]           Pixel Units                     : 0
[PNG]           Modify Date                     : 2018:09:28 20:24:26
[PNG]           Pdf Hi Res Bounding Box         : 1x1+0+0
[PNG]           Pdf Version                     : PDF-1.3  1 0 obj <<
$ exiftool -PNG:"ModifyDate"= out1.png 
    1 image files updated

Re: PDF to PNG conversion is non-deterministic

Posted: 2018-09-28T11:36:13-07:00
by syl_leroux
Got it: I need to exclude *both* the `date` and `time` chunks:

Code: Select all

$convert in.pdf -define png:exclude-chunk=time,date out1.png

$ #No more date/time info in the PNG metadata:
$exiftool -G -e -n out1.png | grep -i date
[File]          File Modification Date/Time     : 2018:09:28 20:34:12+02:00
[File]          File Access Date/Time           : 2018:09:28 20:25:07+02:00
[File]          File Inode Change Date/Time     : 2018:09:28 20:34:12+02:00