Page 1 of 1

Find the white blocks between text and pictures

Posted: 2008-09-02T05:23:59-07:00
by rmagick
An RMagick user asks:
I have a number of black-and-white scanned pages. To prepare them for OCR,
I have to split them in columns and rows. Additionally, somewhere in between, there
are pictures, which also need to be separated.

So, in a page that might look like this:

Text1 Text4 Text6

Text2 Pict1 Text7

Text3 Text5 Pict2

I'd like to find the largest blocks of white which separate the texts and pictures, both horizontally
and vertically.
Is there a way to do this with ImageMagick? I can easily convert the commands and options into RMagick code for him.

Re: Find the white blocks between text and pictures

Posted: 2008-09-02T21:49:05-07:00
by anthony
Not directly in IM, but Im can be used to seperate the blocks.

I wrote a script as a proposed horizontal and vertial 'block' segmentation algorithm.

See the Script divide_vert which looks for rows of pixels that is all the same color (exactly the same at this time, no fuzz :-( ) and outputs either just the interesting blocks, or those bloce with the 'spacing' blank images, it found between those blocks.

this script was creating in response to someone elses problem, and should be able to be expanded to do what you request.

If you can give me a 'shrunk' example image, I may be able to work on it a bit more to make it do what you want, and make a utility that would be much more useful in general.


NOTE: most good OCR software have options to select the areas of the image that contains the text to be converted.

ASIDE: The DjVu image format is designed to do this horizontal / vertical separation of blocks right down to character level so as to find and delete duplicate images (characters) and thus shrink scanned images to the MAX, regardless of the font and style of the book being scanned!
This naturally lends itself perfectly for OCR conversion of the smaller images.

Re: Find the white blocks between text and pictures

Posted: 2008-09-04T11:34:36-07:00
by rmagick
Thanks for the tip, Anthony! I've passed it along to the RMagick user and volunteered to help convert it to Ruby and RMagick. I have to say though, after looking at your "heroic" script, I'm glad I get to use Ruby for my IM-ing :D

Re: Find the white blocks between text and pictures

Posted: 2008-09-04T16:36:32-07:00
by anthony
You are quite welcome. The script was designed as a proof of concept rather than any specifically useful application, which is why it only does vertical segmentation.

Segmentation of an image into separate parts is an area that IM is sorely lacking, even if those parts are already well defined as in black and white scans of documents.

I would like to see vertical, horizontal, segmentation, and the black and white mask separation techniques (see segment_image script) programmed into the IM core for speed, as well as the addition of color and texture area division methods.

Please keep us (especially me) informed as to your RMagick progress in this matter.

Re: Find the white blocks between text and pictures

Posted: 2008-09-04T16:54:16-07:00
by fmw42
Has anyone considered building upon the blob counting method of the following post (of which el_supremo has provided some code):

viewtopic.php?f=1&t=10889

Did or could any of this be built into an IM function?

Re: Find the white blocks between text and pictures

Posted: 2008-09-04T17:35:10-07:00
by anthony
el_supremo method was designed only to locate and output the largest blob, and did so with his own flood fill method.

My script, generates a separate image for each and every blob, regardless of colors.

This type of segmentation can also make good use of image 'morphology' operations as shown by Fred Weinhaus Scripts, to expand and merge 'near' segments. This also has yet to be added into the IM core.

Finally when you have a layered sequence of segment masks you can use them with my -layers composite operator to extract each mask from the original un-modified image.
This is something I have not started a section on in IM Examples, and probably should.