Page 1 of 1
Problems with combining Set and Mogrify
Posted: 2012-05-24T15:51:03-07:00
by gaimrox
Hi,
I'm attempting to detect duplicates among our internal image hosting service. Currently it has about 65K images, a fair number of which I suspect are duplicates.
We have been using ImageMagick to open the images as uploaded from the user, validate them, and then store them into our database for about 5 years.
My algorithm is as follows:
Open each image one at a time
Perform a mogrify->strip to remove all comment/exif data
If the image is a PNG, strip the date:create and date:modify that randomly started being added in like 2009 (prior to that the images lack this problem)
Store the image
Determine the MD5 of the image and store that for a later deduping sweep
In the case of lossy images it's interesting that cycled images (those added, downloaded and re-added) will not be detected as duplicates here, but there isn't too much I can do about that at this point.
So here is my problem, I wrote all the above code, and deployed it. I then came to find out that the PNG images still had date:create/date:modify even though the API call was made. I believe something is broken because when I switched the mogrify and date stripping around, the output was altered in an unexpected way.
Here is my code, along with the MD5 printed at various points of the processed image:
Code: Select all
warn md5_hex($im->ImageToBlob()); # 2985ceb411ffc2ca80e845c09f389160
# strip out unique attrs from the image that might mess up the final file
$im->Mogrify('strip');
warn md5_hex($im->ImageToBlob()); # de4c581bde9a6c7d5b30d234ac37167e
$im->Set( 'date:modify' => '');
warn md5_hex($im->ImageToBlob()); # de4c581bde9a6c7d5b30d234ac37167e
$im->Set( 'date:create' => '');
warn md5_hex($im->ImageToBlob()); # de4c581bde9a6c7d5b30d234ac37167e
I then flipped the order of the mogrify and date modify:
Code: Select all
warn md5_hex($im->ImageToBlob()); # 5f6c94c736f6614a17449bdc6710fd96
$im->Set( 'date:modify' => '');
warn md5_hex($im->ImageToBlob()); # b5f0d9a8df86ff58ed1b345acc533b78
$im->Set( 'date:create' => '');
warn md5_hex($im->ImageToBlob()); # 9278c7386812b311593b5143331ced52
# strip out unique attrs from the image that might mess up the final file
$im->Mogrify('strip');
warn md5_hex($im->ImageToBlob()); # de4c581bde9a6c7d5b30d234ac37167e
Note that once mogrify is run, the MD5 never changes. I have determined that Mogrify does not alter date:create/date:modify - so that is not good. I think this is a bug. Thoughts?
As a sidenote I hope others can find this post about how to clear date:modify and date:create as the documentation is VERY confusing about how to do this. I'm still not sure the above is the correct procedure.
Thanks for reading this far, I have been working on this over a week!
Re: Problems with combing Set and Mogrify
Posted: 2012-05-25T04:37:44-07:00
by magick
For duplicates, you can use $im->Signature() or $im->Compare(). Both look at the image pixels themselves. Compare() allows for fuzzyness. You can threshold permitting 2 images that are slightly modified to be considered duplicates.
Re: Problems with combining Set and Mogrify
Posted: 2012-05-25T08:14:18-07:00
by gaimrox
Hi - thanks for the suggestion. I am aware of the Signature method, but for our use case it makes more sense to normalize the image metadata and generating a signature that way.
I did not know of the Compare method, but we accept a large number of images per day, and comparing an uploaded image to 65,000 stored messages repeatedly would be pretty expensive. This is all live and interactive on a website, so selecting all 65K images and then comparing the uploaded one to each image would take quite a while. Thanks for the suggestion though.
I am nearly certain there is a bug in ImageMagick when combining these two method calls. I think that a call to mogrify sets a flag that records default "modify/create" values in the image regardless of you having "Set" them.
Are there any other ways of clearing the "modify/create" values for PNGs? I am surprised how hard this is.
thanks
Re: Problems with combing Set and Mogrify
Posted: 2012-05-25T08:55:07-07:00
by glennrp
One simple way is to use "-define png:exclude-chunk=date" which prevents the
png encoder from writing the date-related text chunks. But "-strip" is supposed
to do that operation automatically; I don't see why you are still getting the
date chunks (perhaps it's because you are doing the stripping through an API
instead of through the commandline; if that is the case the API equivalent
of the "-define" should work for you).
Edit: Looking at the code, it seems that montage with "strip" does call
StripImage() and StripImage() does define out the PNG date chunks.
Are you running a recent version of ImageMagick on your server? This
feature was added at version 6.6.6 and 6.6.7 according to the ChangeLog.
Re: Problems with combing Set and Mogrify
Posted: 2012-05-25T12:51:37-07:00
by gaimrox
Hi glennrp,
I am running "6.7.4.4_1" on FreeBSD, so this seemingly is an ongoing problem. I am willing to run all sorts of tests to ferret out the problem, I'm just not sure what else to do.
thanks.
Re: Problems with combining Set and Mogrify
Posted: 2012-05-27T23:44:23-07:00
by anthony
My understanding is that IM automatically creates 'date:modify' and 'date:create' image properity strings when it reads in an image from from file.
You should be able to remove those properties before writing...
Code: Select all
convert logo: logo.jpg
convert logo.jpg +set date:create +set date:modify logo1.png
sleep 2; touch logo.jpg
convert logo.jpg +set date:create +set date:modify logo2.png
diff -s logo1.png logo2.png
The "diff" should report... Files logo1.png and logo2.png are identical
as the time stamps were removed.
NOTE: on read. those time stamps are overwritten by the timestamp of the file read! Really they are information timestamps and perhaps they should not be per-image 'properties' but per-image artifacts. (artifacts are per-image 'operational data' which is NOT ment to be written with the image.
Comments and suggestions about this?
Re: Problems with combining Set and Mogrify
Posted: 2012-05-29T13:09:07-07:00
by gaimrox
Hi anthony,
What you are performing in your suggestion is unfortunately not directly available via the API.
If you do a search in this forum for stripping off the "create/modify" values using the perl API you will find a number of posts that contradict eachother.
It's not even clear to me right now if strip does remove it, or if I must manually remove it. It's also not clear if my manual removing below actually succeeds in manually removing the values, as there is no good way to look at those values.
I think it's safe to say that automatically setting those values on PNGs at read time is a recipe for problems. A hidden event occurs that you cannot disable, and you thereafter must guess as to how best to restore the file.
In my case I strip the file, remove the dates, and then save the record. If I run the script again it will discover that the date values are not removed, and will then remove them. At this point the file actually no longer has the dates.
My current course of action is to strip the file, output to blob, input back from blog, strip dates, and then save. There is some evidence to show that this produces a different result than all the other suggestions I've seen, and I think we can agree that is bad that I must jump through such hoops.
Re: Problems with combining Set and Mogrify
Posted: 2012-05-29T13:42:18-07:00
by gaimrox
Yeah... so I wrote the code as I outlined above, and now it works properly. Either there is a bug, or some documentation needs to be written in order to explain what exactly is going on.
Here is my code to accept and normalize an image:
Code: Select all
my $im = Image::Magick->new();
if ($im->BlobToImage($imagedata) == 0) {
$self->log_error("Image rejected, corrupt binary data.");
throw RWDE::DataBadException({ info => 'The image appears to be corrupted or of an unrecognized type.' });
}
# make sure the uploaded image is in an accepted format
MM::Image->Check_extension({ extension => lc($im->Get('magick')) });
# strip out unique attrs from the image that might mess up the final file
$im->Mogrify('strip');
# create a second object to work around bug in imagemagick
my $im2 = Image::Magick->new();
$im2->BlobToImage($im->ImageToBlob());
$im = $im2;
# remove the default date stamps that imageMagick adds
$im->Set( 'date:modify' => '');
$im->Set( 'date:create' => '');
It appears to me that calling mogrify locks the image data in some way. You can see in my previous post that setting a null date after a strip actually does not work - but I was able to confirm that the date is still there... hence my conclusion.
In addition I flipped around the strip and date clearing with eachother, and the image then was left with a date within my database. This supports the above conclusion as well.
Re: Problems with combining Set and Mogrify
Posted: 2012-07-18T00:06:47-07:00
by anthony
The date properities will always re-appear anytime IM reads a file. these are the date stamps of the file read!
Any date stamp saved in the file itself should be ignored.
Actually these probably should be stored as image artefacts rather than as image properties so they don't get saved with the image.
Re: Problems with combining Set and Mogrify
Posted: 2012-07-28T18:40:05-07:00
by gaimrox
I disagree with your assessment.
I do not have this problem with any images except for JPG. In addition I only have this problem with a small number of JPG, maybe 2% of the total 60K JPG that I have.
If this were something that was supposed to happen, it would happen for all images.
Also, please read my hack-solution above. This actually does fix the problem for 99.99% of the images that we currently accept. I have only had 1 single failure since I updated the code with the attached hack.
Ideally this problem would be totally solved though. I use this code as a rudimentary form of deduping images upon upload to an image hosting service. In the event that people upload the same image I want to block it. Of course JPG is lossy so there is a problem here, but it still cut down a huuuge amount of my duplicates.
I am considering transitioning over to GD because nobody seems to have any idea why or how this code is broken, and that's a little scary to me. I think the same backend lib is used in GD so probably my deduping would not be interrupted by switching over.
Or maybe somebody will read this who knows exactly what's going on and then I can just fix my app
Re: Problems with combining Set and Mogrify
Posted: 2012-07-28T18:41:07-07:00
by gaimrox
Also as a clarification, I do not have any files at all here.
I accept these images as GLOBs via perl and then I store them in a DB. I never physically write them to a disk in the standard sense.
Re: Problems with combining Set and Mogrify
Posted: 2013-01-10T19:19:49-07:00
by gaimrox
Reporting back a few months later and with a much newer version of ImageMagick - problem continues.
I'm on "6.7.7.7_1" now, and the exact same issue reported above continues to occur on about 1 of every 300 images I process. The strip functionality definitely appears to have some sort of long standing bug.
Re: Problems with combining Set and Mogrify
Posted: 2017-02-24T19:47:22-07:00
by gaimrox
Reporting back many years later. Issue still persists.
Is there anyone with ideas on a possible workaround?
Re: Problems with combining Set and Mogrify
Posted: 2017-02-24T20:26:43-07:00
by fmw42
Have you tried upgrading to the latest IM 6.9.7.9 or IM 7.0.5.0?