Page 1 of 1
PDF to PNG conversion is non-deterministic
Posted: 2018-09-28T10:51:18-07:00
by syl_leroux
I noticed if I convert the same PDF to several PNG, the PNG files are different according to the `diff` tool.
Steps to reproduce:
Code: Select all
convert xc:red in.pdf
convert in.pdf out0.png
convert in.pdf out1.png
diff out?.png
-->Binary files out0.png and out1.png differ
In my use case, this is an issue since the destination PNGs are part of a Git repository, so each time they are regenerated from the source PDF, even if that latter hasn't changed, Git mark the PNGs as modified since something has changed at the binary level.
Is there some options or other post-processing I could use to ensure conversion from PDF produces _exactly_ the same output PNG?
EDIT:
Code: Select all
$ convert --version
Version: ImageMagick 6.9.7-4 Q16 x86_64 20170114 http://www.imagemagick.org
Copyright: © 1999-2017 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP
Delegates (built-in): bzlib djvu fftw fontconfig freetype jbig jng jp2 jpeg lcms lqr ltdl lzma openexr pangocairo png tiff wmf x xml zlib
$ gs --version
9.20
Re: PDF to PNG conversion is non-deterministic
Posted: 2018-09-28T10:54:40-07:00
by syl_leroux
Ok, with further investigations it appears the embedded metadata are different (creation time) :
Code: Select all
diff <(hexdump -C out0.png) <(hexdump -C out1.png)
10,11c10,11
< 00000090 3e 00 00 00 07 74 49 4d 45 07 e2 09 1c 13 29 27 |>....tIME.....)'|
< 000000a0 7f ab ad d6 00 00 00 0a 49 44 41 54 08 d7 63 60 |........IDAT..c`|
---
> 00000090 3e 00 00 00 07 74 49 4d 45 07 e2 09 1c 13 29 29 |>....tIME.....))|
> 000000a0 98 13 80 d1 00 00 00 0a 49 44 41 54 08 d7 63 60 |........IDAT..c`|
14,15c14,15
< 000000d0 31 38 2d 30 39 2d 32 38 54 31 39 3a 34 31 3a 33 |18-09-28T19:41:3|
< 000000e0 39 2b 30 32 3a 30 30 4e c9 92 03 00 00 00 25 74 |9+02:00N......%t|
---
> 000000d0 31 38 2d 30 39 2d 32 38 54 31 39 3a 34 31 3a 34 |18-09-28T19:41:4|
> 000000e0 31 2b 30 32 3a 30 30 77 e3 d5 7d 00 00 00 25 74 |1+02:00w..}...%t|
18c18
< 00000110 33 39 2b 30 32 3a 30 30 3f 94 2a bf 00 00 00 1c |39+02:00?.*.....|
---
> 00000110 34 31 2b 30 32 3a 30 30 06 be 6d c1 00 00 00 1c |41+02:00..m.....|
Is there a way to avoid embedding those data?
Re: PDF to PNG conversion is non-deterministic
Posted: 2018-09-28T10:59:52-07:00
by snibgo
By default, PNG outputs contain date metadata. To exclude it, use "-define png:exclude-chunk=date".
Re: PDF to PNG conversion is non-deterministic
Posted: 2018-09-28T11:21:01-07:00
by syl_leroux
I found a workaround:
Code: Select all
# Metadata name seems to be case-sensitive!
convert in.pdf -set datecreate "" -set "Modify Date" "" -set "datemodify" "" out1.png
exiftool -G -e -n out1.png
[ExifTool] ExifTool Version Number : 10.40
[...]
[PNG] Pixel Units : 0
[PNG] Datecreate :
[PNG] Datemodify :
[PNG] Modify Date :
[PNG] Pdf Hi Res Bounding Box : 1x1+0+0
[PNG] Pdf Version : PDF-1.3 1 0 obj <<
Interestingly, `exiftool` is able to remove the `PNG:Modify Date` tag. But not `PNG:Datecreate` and `PNG:Datemodify` which seems non-standard PNG metadata. Or am I wrong?
Code: Select all
$ exiftool -PNG:"ModifyDate"= out1.png
1 image files updated
$ exiftool -PNG:"datecreate"= out1.png
Warning: Tag 'PNG:datecreate' is not defined
Nothing to do.
$ exiftool -PNG:"datemodify"= out1.png
Warning: Tag 'PNG:datemodify' is not defined
Nothing to do.
Re: PDF to PNG conversion is non-deterministic
Posted: 2018-09-28T11:26:04-07:00
by syl_leroux
Sorry @snibgo, I missed your answer:
By default, PNG outputs contain date metadata. To exclude it, use "-define png:exclude-chunk=date".
Thank you! That still requires a second pass with ExitTool since the "Modify Date" metadata remains present in the output:
Code: Select all
$ convert in.pdf -define png:exclude-chunk=date out1.png
$ exiftool -G -e -n out1.png
[ExifTool] ExifTool Version Number : 10.40
[...]
[PNG] Pixel Units : 0
[PNG] Modify Date : 2018:09:28 20:24:26
[PNG] Pdf Hi Res Bounding Box : 1x1+0+0
[PNG] Pdf Version : PDF-1.3 1 0 obj <<
$ exiftool -PNG:"ModifyDate"= out1.png
1 image files updated
Re: PDF to PNG conversion is non-deterministic
Posted: 2018-09-28T11:36:13-07:00
by syl_leroux
Got it: I need to exclude *both* the `date` and `time` chunks:
Code: Select all
$convert in.pdf -define png:exclude-chunk=time,date out1.png
$ #No more date/time info in the PNG metadata:
$exiftool -G -e -n out1.png | grep -i date
[File] File Modification Date/Time : 2018:09:28 20:34:12+02:00
[File] File Access Date/Time : 2018:09:28 20:25:07+02:00
[File] File Inode Change Date/Time : 2018:09:28 20:34:12+02:00