How do you detect duplicates? And how does IM Fingerprint work?
Posted: 2018-11-18T03:19:49-07:00
Hi there,
I tried searching for copy/duplicate/detect duplicate, though I didn't find something here. If I overlooked something (as in this has been already answered somewhere here on the board) please let me know. I also read: https://www.imagemagick.org/Usage/compare/. I am looking for a way to automate this.
Long story short: Something like a year ago I lost my photos as well as my backups and did end up with a folder containing most of them as well as modified (denoise, gamma, sharpen, scaled) duplicates of the original. Now I need to get rid of the duplicates. First of all I really just want to detect duplicates - choosing which of the duplicates to keep isn't that important currently.
So I tried the following:
1. Simple IM Fingerprint (storing all photos fingerprint in an array and while iterating over all my photos checking if something matches) - that seems to work quite good.
2. Downscale to 64x64 (as well tested 32x32), convert to grayscale, created 3 by 90-degree rotated versions, take the fingerprints of that to check for duplicates.
I might need a helping hand / idea about 2. To downscale
- first I used sample. That is pretty fast though no copies are detected.
- then I used scale. That is a little bit slower though still no copies are detected.
- then I used resize with POINT and BOX a little bit slower - still no copies.
- then I used resize with GAUSSIAN and HERMITE - GAUSSIAN is the slowest(!), HERMITE is a bit slower than above variants. THIS one detects some duplicates (so.. yes, it does work. It's just a little bit too slow).
Using sample/scale and follow that by a gaussian blur is still faster than using resize with GAUSSIAN - but it does not detect duplicates. So I'm curious why is a GAUSSIAN_RESIZE as well as HERMITE_RESIZE working and SAMPLE/SCALE+GAUSSIAN/BLUR not?
By the way, the fingerprint I am using is the one PHP's \Imagick::getImageSignature() gives back. Is that probably wrong to use for what I want to do? I'm not limited to PHP, Bash would be fine as well. How do you do that?
I noticed that auto-levels does not change the fingerprint. Looking for a way that color-distorted or gamma-corrected photos would still be detected as copies. For that I do the grayscale conversation. I also thought and tried creating an edge mask to use that - however, creating that mask takes way too long.
Thanks in advance,
Jean
I tried searching for copy/duplicate/detect duplicate, though I didn't find something here. If I overlooked something (as in this has been already answered somewhere here on the board) please let me know. I also read: https://www.imagemagick.org/Usage/compare/. I am looking for a way to automate this.
Long story short: Something like a year ago I lost my photos as well as my backups and did end up with a folder containing most of them as well as modified (denoise, gamma, sharpen, scaled) duplicates of the original. Now I need to get rid of the duplicates. First of all I really just want to detect duplicates - choosing which of the duplicates to keep isn't that important currently.
So I tried the following:
1. Simple IM Fingerprint (storing all photos fingerprint in an array and while iterating over all my photos checking if something matches) - that seems to work quite good.
2. Downscale to 64x64 (as well tested 32x32), convert to grayscale, created 3 by 90-degree rotated versions, take the fingerprints of that to check for duplicates.
I might need a helping hand / idea about 2. To downscale
- first I used sample. That is pretty fast though no copies are detected.
- then I used scale. That is a little bit slower though still no copies are detected.
- then I used resize with POINT and BOX a little bit slower - still no copies.
- then I used resize with GAUSSIAN and HERMITE - GAUSSIAN is the slowest(!), HERMITE is a bit slower than above variants. THIS one detects some duplicates (so.. yes, it does work. It's just a little bit too slow).
Using sample/scale and follow that by a gaussian blur is still faster than using resize with GAUSSIAN - but it does not detect duplicates. So I'm curious why is a GAUSSIAN_RESIZE as well as HERMITE_RESIZE working and SAMPLE/SCALE+GAUSSIAN/BLUR not?
By the way, the fingerprint I am using is the one PHP's \Imagick::getImageSignature() gives back. Is that probably wrong to use for what I want to do? I'm not limited to PHP, Bash would be fine as well. How do you do that?
I noticed that auto-levels does not change the fingerprint. Looking for a way that color-distorted or gamma-corrected photos would still be detected as copies. For that I do the grayscale conversation. I also thought and tried creating an edge mask to use that - however, creating that mask takes way too long.
Thanks in advance,
Jean