Page 1 of 3

identify does not recognize Color Spaces properly

Posted: 2016-11-02T16:15:48-07:00
by pauloney
I have a small PDF file (that one can open with a text editor)

https://www.dropbox.com/s/agjiwei4hga3n7i/im-bug.pdf

and clearly see that it is written in the "CMYK" Color Space, but "identify" reports it as being "sRGB".

$ identify -verbose im-bg.pdf | grep Colorspace
Colorspace: sRGB

Paulo de Souza

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-02T16:33:21-07:00
by snibgo
Here's the file, with non-printable bytes replaced by "{nn}".

Code: Select all

%PDF-1.5\n
%{d0}{d4}{c5}{d8}\n
3 0 obj\n
<<\n
/Length 235       \n
>>\n
stream\n
0 0 0 1 k 0 0 0 1 K\n
0 g 0 G\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n
1 1 0 0 k 1 1 0 0 K\n
q\n
0 0 28.346 28.346 re f\n
Q\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n\n
endstream\n
endobj\n
7 0 obj\n
<<\n
/Producer (pdfTeX-1.40.17)\n
/Creator (TeX)\n
/CreationDate (D:20161102145323-07'00')\n
/ModDate (D:20161102145323-07'00')\n
/Trapped /False\n
/PTEX.Fullbanner (This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2)\n
>>\n
endobj\n
4 0 obj\n
<<\n
/Type /ObjStm\n
/N 4\n
/First 22\n
/Length 257       \n
>>\n
stream\n
2 0 1 105 5 139 6 191\n
% 2 0 obj\n
<<\n
/Type /Page\n
/Contents 3 0 R\n
/Resources 1 0 R\n
/MediaBox [0 0 28.346 28.346]\n
/Parent 5 0 R\n
>>\n
% 1 0 obj\n
<<\n
/ProcSet [ /PDF ]\n
>>\n
% 5 0 obj\n
<<\n
/Type /Pages\n
/Count 1\n
/Kids [2 0 R]\n
>>\n
% 6 0 obj\n
<<\n
/Type /Catalog\n
/Pages 5 0 R\n
>>\n\n
endstream\n
endobj\n
8 0 obj\n
<<\n
/Type /XRef\n
/Index [0 9]\n
/Size 9\n
/W [1 2 1]\n
/Root 6 0 R\n
/Info 7 0 R\n
/ID [<391431FEE2C4F5CB174EA573217A0305> <391431FEE2C4F5CB174EA573217A0305>]\n
/Length 36        \n
>>\n
stream\n
{00}{00}{00}{ff}{02}{00}{04}{01}{02}{00}{04}{00}{01}{00}{0f}{00}{01}{02}7{00}{02}{00}{04}{02}{02}{00}{04}{03}{01}{01}4{00}{01}{03}{8f}{00}\n
endstream\n
endobj\n
startxref\n
911\n
%%EOF\n
I don't know the internals of PDF. How do you clearly see that it is written in the "CMYK" Color Space?

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-02T16:35:23-07:00
by pauloney
Because of the "K" at the end of these line:

stream\n
0 0 0 1 k 0 0 0 1 K\n
0 g 0 G\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n
1 1 0 0 k 1 1 0 0 K\n
q\n
0 0 28.346 28.346 re f\n
Q\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n
0 0 0 1 k 0 0 0 1 K\n\n
endstream\n

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-02T17:00:06-07:00
by snibgo
Ah, thanks.

Perhaps a developer will look at this. I think (but could be wrong) that Ghostscript always returns images as sRGB.

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-02T17:07:10-07:00
by pauloney
If that is the case, then it will be highly inappropriate for "identify" to report on Color Spaces of PDF files.... because PDF can have anyone of some 17 color spaces, I believe ...

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-02T17:28:35-07:00
by fmw42
I think it would be helpful if you specify what is your IM version and platform.

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-02T17:37:06-07:00
by pauloney
So far I have seen it in several machines, so I think the error is pervasive. The machine I am using right now and I can test things is: Version: ImageMagick 6.8.9-9 Q16 x86_64 2016-06-01 Ubuntu 14.04.

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-02T18:56:16-07:00
by fmw42
How was this file created? Was it a CMYK raster image imbedded in an sRGB pdf vector shell?

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-02T22:16:42-07:00
by pauloney
snibgo wrote:Ah, thanks.

Perhaps a developer will look at this. I think (but could be wrong) that Ghostscript always returns images as sRGB.
You are wrong! Here is another (simple) PDF file which is recognized fine by "identify" as being CMYK.

https://www.dropbox.com/s/eb7o1ftlhfpfdgi/red-cmyk.pdf

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-03T00:58:16-07:00
by snibgo
Ah, well, at least I was right about being wrong!


"convert -verbose red-cmyk.pdf x.jpg" tells us that IM is telling Ghostscript the PDF is CMYK. So IM itself recognises this. I see that red-cmyk.pdf contains a line "/ColorSpace/DeviceCMYK" which is probably a good clue. Your first file didn't have that.

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-03T06:14:40-07:00
by pauloney
I am looking at the code and I can't claim I understand it, but it looks to me that IM is doing determination of the Color Space of the file by heuristics! And you can't really do that - one could have Color Spaces which are defined and not used, and a bunch of other situations....

And it looks like if some tests fail it reports back "RGB" which is again completely wrong.

And I do not see gs involved in it in any way....

Someone that understands this should look a bit closer....this seems really messed up - and literally "identify" is the only way to determine a Color Space of a PDF file in Linux.

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-03T07:41:43-07:00
by magick
PDF might internally include strokes or images that may be in a variety of colorspaces, but the colorspace associated with the rendered results depends on the output device. A single PDF may include multiple pages, each in a different colorspace. PDF permits comments that might alert the user to which colorspace is utilized such as the CMYKProcessColor comment. ImageMagick uses these comments as hints to determine which output device to render the PDF. The colorspace reported by ImageMagick is that of the selected output device.

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-03T08:06:49-07:00
by pauloney
magick wrote:PDF might internally include strokes or images that may be in a variety of colorspaces,


Absolutely true! And in that case they are all part of the colorspace of the file and they should all be reported.
magick wrote:but the colorspace associated with the rendered results depends on the output device.
I am not sure where you are going here! A colorspace is a property of a file and not of the rendering on any particular device! The colorspaces of a file is a well defined element which got nothing to do with rendering.
magick wrote:A single PDF may include multiple pages, each in a different colorspace.


Absolutely true! And in that case they are all part of the colorspace of the file and they should all be reported.
magick wrote: PDF permits comments that might alert the user to which colorspace is utilized such as the CMYKProcessColor comment.


And these are only comments and do NOT change or ADD anything to the colorspace of a file. They should NOT be reported. One can go even further -- PDF can have colorspaces defined and NOT used -- these should also not be reported since they are not part of the file.
magick wrote:ImageMagick uses these comments as hints to determine which output device to render the PDF.


No further explanation ... indeed a big bug!
magick wrote: The colorspace reported by ImageMagick is that of the selected output device.
Again, this is NOT the definition of the colorspace of a file, in anyways, my second example proves your statement above is false.

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-03T08:51:19-07:00
by magick
Identify reports the colorspace of the image as produced by the Ghostscript output device and as such it is returning proper results. Its the colorspace of the rendered image. We do support properties directly associated with the input image in some cases, e.g. tiff:photometric. What we need is a PDF:Colorspaces property reflecting the colorspaces covered by the PDF itself rather than the output device, but that information is not currently available to us. We don't render the PDF internally, instead we rely on the Ghostscript library to render the PDF and can only access PDF attributes that Ghostscript provides. If you know of a way that we can leverage Ghostscript to return the information you seek, let us know and we will add a patch to return PDF:Colorspaces in a future version of ImageMagick.

Re: identify does not recognize Color Spaces properly

Posted: 2016-11-03T09:33:45-07:00
by pauloney
Three things here then:

1- We have a problem of labeling, if it is "Color Space as rendered by the GhostScript output device" then it should be labelled as such and not as "Color Space" that has a meaning defined in the PDF Reference Manual and and an usage that is standard in dozens of tools.

2- What do you mean by "but that information is not currently available to us" ? That information is in the PDF file! Several tools like Acrobat and othere pre-flight tools report on the Color Spaces directly from the PDF.

3- I don't believe rendering is necessary to determine the ColorSpace of a PDF file -- I can easily cook a PDF that takes 5 minutes to render, but Acrobat shows its ColorSpace in seconds.

Could you show me the type of GS commands being used to get this information from GS? I would like to experiment with my second image where the GS rendering is RGB, but identify is reporting CMYK!