Identifying redacted images
Identifying redacted images
I'm trying to figure out a way to determine whether a scanned image of a document has been redacted with a block (or blocks) of solid black. I've been fiddling around with histograms, but I have the feeling that I'm just looking in the wrong place. I'd really appreciate a point in the right direction. Does anybody have any advice?
- anthony
- Posts: 8883
- Joined: 2004-05-31T19:27:03-07:00
- Authentication code: 8675308
- Location: Brisbane, Australia
Re: Identifying redacted images
prehaps a couple of example images of what you mean by 'redacted'
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
https://imagemagick.org/Usage/
Re: Identifying redacted images
I'm afraid I can't provide samples of the actual files I'm working with. These are scans of legal documents and they're very tightly controlled. I'll see if I can fake up some if you think it would help, though. Each one is basically just a scan of a white page with black printed text. After scanning, some of them have been edited with Photoshop or something similar and rectangular black boxes were drawn to mark out certain bits of text.
I'm dealing with thousands of TIFFs, too many to go through by hand, even *if* these were documents that interns or contract workers could legally view. Putting together a script that could recognize a big black box on a page that's otherwise filled with text would make my life a lot easier, so I appreciate any advice you can offer.
I'm dealing with thousands of TIFFs, too many to go through by hand, even *if* these were documents that interns or contract workers could legally view. Putting together a script that could recognize a big black box on a page that's otherwise filled with text would make my life a lot easier, so I appreciate any advice you can offer.
- anthony
- Posts: 8883
- Joined: 2004-05-31T19:27:03-07:00
- Authentication code: 8675308
- Location: Brisbane, Australia
Re: Identifying redacted images
I see. First problem is easy. finding black boxes.
These are much darker overall. So get a page with just one 'black box'
blur the image and -trim it with a -fuzz setting and identify the result.
The trim should if the arguments were right reduce itself to the area around the
black box.
See IM Examples Fuzzy Trim...
http://www.imagemagick.org/Usage/crop/#trim_blur
The problem however is that there is no simple way (at the moment) of dividing an image
that contains two or more 'black boxes'.
One way would be to threshold the 'blured' image so that only the areas with black boxes
has a black box. You can then try to run this though a script like... 'divide_vert'
http://www.imagemagick.org/Usage/scripts/divide_vert
which divides the image into a vertical stack of blank and non-blank areas.
This script is only a rough outline of a proposed addition to IM examples to sub-divide images
based on blank areas. It is only an example but one that does work.
WARNING: think about how you want ot handle the case of two boxes that are on on seperate
but consecutive lines. these may be throught of as being one box instead of two boxes.
These are much darker overall. So get a page with just one 'black box'
blur the image and -trim it with a -fuzz setting and identify the result.
The trim should if the arguments were right reduce itself to the area around the
black box.
See IM Examples Fuzzy Trim...
http://www.imagemagick.org/Usage/crop/#trim_blur
The problem however is that there is no simple way (at the moment) of dividing an image
that contains two or more 'black boxes'.
One way would be to threshold the 'blured' image so that only the areas with black boxes
has a black box. You can then try to run this though a script like... 'divide_vert'
http://www.imagemagick.org/Usage/scripts/divide_vert
which divides the image into a vertical stack of blank and non-blank areas.
This script is only a rough outline of a proposed addition to IM examples to sub-divide images
based on blank areas. It is only an example but one that does work.
WARNING: think about how you want ot handle the case of two boxes that are on on seperate
but consecutive lines. these may be throught of as being one box instead of two boxes.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
https://imagemagick.org/Usage/