Page 1 of 1

extracting text area from image

Posted: 2007-06-10T12:25:03-07:00
by aciobanu
Hi, all!

I am trying to use ImageMagick to extract strictly
the text area from a photograph of a book page.

If you look at the image attached, I am interested in the
green area and, if possible, the red area.

The problem is that it has to be automated and work
for books of various sizes.

Do you guys have any idea how I could achieve this?

My idea so far, is:
apply a really crazy filter that would transform the
green area into o big uniform blob, so that I can
then extract its coordinates, and then use those
on the original image.

Image -> http://picasaweb.google.com/capsunel/Im ... 3571163346

Alex

Re: extracting text area from image

Posted: 2007-06-11T18:42:06-07:00
by aciobanu
If anybody is interested, there are some solutions proposed to this problem
on the mailing list. Here is the thread:
http://studio.imagemagick.org/pipermail ... html#19694

Alex

Re: extracting text area from image

Posted: 2007-06-11T19:38:32-07:00
by anthony
The problem is that you know the image (after the -lat) is a border area, followed by a white box area, and an internal area.

One ideas is to create a tmp image that you 'blur', then color reduce the image
to 2 colors. This hopefully will reduce the image to a three boxes that you can study to find the center of the margin of the page. With that you can the crop the image
to the margins, and trim it to just text itself.

This technique i call a fuzzy or blurred trim...
http://www.imagemagick.org/Usage/crop/#trim_blur

It is a starting point at least.

Re: extracting text area from image

Posted: 2007-06-11T19:42:20-07:00
by anthony
Another alturnative is to assume the outside areas are dark. So again in a temporary image, run it though a -median 10x10 filter to remove the text, to leave just a 'blank page'
That page should be a lot easier to determine the bounds off than a page filled with text!

A smaller -median filter will also help remove any 'junk' pixels left by the -lat operator.

Re: extracting text area from image

Posted: 2007-06-12T10:48:02-07:00
by aciobanu
I've tried blurring the image (after -lat) and it gives pretty
good results. I get a cloudy uniform blob where the text is.

Still, I should not expect to solve the whole thing from
command line.

One strategy I consider is the following:
1. place yourself in the center of the image
2. start "moving" in 4 directions simultaneously (North, South, East, West)
3. count how many times you find black pixels, (aka text, image, etc)
4. when you you start getting only white pixels you stop

The place where you have stopped at North, South East and West
will give you the Ymin, Ymax, Xmax, Xmin where the interesting area is.
(0,0 coord is the NW corner)

I've seen something similar done in the unpaper tool. It should be
easy to do with IM.

Alex

Re: extracting text area from image

Posted: 2007-06-12T19:02:01-07:00
by anthony
Use -median to remove the text, then look for where the paper ends ;)

Re: extracting text area from image

Posted: 2011-01-02T09:13:15-07:00
by jumpjack
anthony wrote:Use -median to remove the text, then look for where the paper ends ;)
I need to do same thing of the Original Poster, but I can't understand this reply.
Any help?

Re: extracting text area from image

Posted: 2011-01-03T18:21:01-07:00
by anthony
Please start a new thread with an example of YOUR image problem.

Re: extracting text area from image

Posted: 2011-01-03T18:41:10-07:00
by fmw42
I don't know if this will work on your image, but you can try my script, textcleaner, at the link below.

Re: extracting text area from image

Posted: 2011-01-04T02:17:25-07:00
by jumpjack
fmw42 wrote:I don't know if this will work on your image, but you can try my script, textcleaner, at the link below.
Thanks, but I do not need to clean the background, I want to extract text areas from a scanned page, and I need the script to find the areas by itself.

Re: extracting text area from image

Posted: 2011-01-04T02:28:31-07:00
by jumpjack
anthony wrote:Please start a new thread with an example of YOUR image problem.
Already opened:
viewtopic.php?f=1&t=10377

I was looking for some hints here too.

Re: extracting text area from image

Posted: 2011-01-14T06:47:04-07:00
by jumpjack
So?
No clues about how to determine areas containig words?