crop columns out of dictionary page

johnbent · Post by **johnbent** » 2014-12-26T15:45:52-07:00

I'm a hobbyist trying to preserve an endangered Pacific Island language by creating an online dictionary. I have a copy of the 1990 print dictionary and am trying to use tesseract to extract the text. I have scanned pages like the below and am looking for help with command line arguments to create two images for each page with one neatly cropped column in each. The commands must find the column whitespace separation themselves and cannot used a fixed width parameter since each page may have slightly different dimensions. Note that I also want to remove the page word markers. It might be useful to deskew the image as well because that might help tesseract. Thanks for any help!

Post by **fmw42** » 2014-12-26T17:00:16-07:00

You should always provide your version of IM and platform as syntax is different.

This is unix code that seems to work on your one image. (I do not know Windows Bat files, so a Windows expert would have to convert this script into an appropriate equivalent for Windows, if that is your platform)

I use some morphology to remove any small specs of black from the white background (and to fill in white specs in the black letters). Then I deskew the image. Then I trim it to remove the white borders as best it can do depending upon the deskew. Then I scale the image down to 1 row and convert to txt format. I then filter the txt format to get any pure white pixels. I then take the center white pixel and use that to crop the image into two halves. I have created a tmp.png file to process from the trim. So at the end I delete (remove) that temporary file. There are some echos that you can remove. I have left them so you can see the results of some of the steps.

Code: Select all

infile="dict-314.png"
inname=`convert -ping "$infile" -format "%t" info:`
suffix=`convert -ping "$infile" -format "%e" info:`
convert "$infile" -morphology smooth diamond:1 \
-background white -deskew 40% +repage \
-fuzz 10% -trim +repage tmp.png
OIFS=$IFS
IFS=$'\n'
white_arr=(`convert tmp.png -scale x1! txt: |\
tail -n +2 | tr -cs "0-9\n" " " | grep -E '.* .* 255'`)
echo "${white_arr[*]}"
num=${#white_arr[*]}
IFS=$OIFS
middle=`convert xc: -format "%[fx:round($num/2)]" info:`
echo "middle=$middle"
xcrop=`echo "${white_arr[$middle]}" | cut -d\  -f1`
echo "xcrop=$xcrop"
ww=`convert -ping tmp.png -format "%w" info:`
hh=`convert -ping tmp.png -format "%h" info:`
ww1=$((xcrop+1))
dim1="${ww1}x${hh}+0+0"
ww2=`convert xc: -format "%[fx:$ww-$xcrop-1]" info:`
xoff2=$ww1
dim2="${ww2}x${hh}+${xoff2}+0"
echo "dim1=$dim1; dim2=$dim2;"
convert tmp.png \
\( -clone 0 -crop $dim1 +repage -write ${inname}_left.$suffix \) \
\( -clone 0 -crop $dim2 +repage -write ${inname}_right.$suffix \) \
null:
rm -f tmp.png

A unix bash shell script could be made from this if that is what you need. The input image is specified at the top of the code in the very first line.

johnbent · Post by **johnbent** » 2014-12-26T17:21:51-07:00

That is so awesome and I'm so very appreciative. It works awesome and straight out of the box on my mac using imagemagick (imagemagick-6.8.9-8.mavericks.bottle.tar.gz). If you would be so very kind, would it be possible to just modify it a tiny bit to remove the page number from the bottom as well as the guide word on the top? For example, dict-314_right.png has a 7 at the bottom and ometech at the top.

Post by **fmw42** » 2014-12-26T18:54:50-07:00

That part is much harder as the code has no way to know what words you want and do not want. You could take the resulting two parts and and scale them down to one column each and look at the intensity of the top and bottom parts for the first gaps past some threshold of gray and then remove anything above or below the gaps. But I do not know how well that would work.

Another approach might be to scale to one column again as above and then just trim with a larger fuzz value to try to remove just the top and bottom at some very light gray. Then compute how much was trimmed and use that to crop the two images. You actually do not have to trim and can get the coordinates from the string format "%@". http://www.imagemagick.org/script/escape.php

Both depend upon the intensity of the top and bottom to be very bright since one work or character would scale down to be brighter than the long rows of text.

Post by **fmw42** » 2014-12-26T19:34:56-07:00

Try this. But I am not sure how well it will work. It depends upon having text at the top and bottom that you need to remove and that there is at least a 60 pixel vertical gap between that text and the text you want to keep

Code: Select all

infile="dict-314.png"
inname=`convert -ping "$infile" -format "%t" info:`
suffix=`convert -ping "$infile" -format "%e" info:`
convert "$infile" -auto-level -morphology smooth diamond:1 \
-background white -deskew 40% +repage \
-fuzz 10% -trim +repage tmp.png
OIFS=$IFS
IFS=$'\n'
white_arr=(`convert tmp.png -auto-level -scale x1! txt: |\
tail -n +2 | tr -cs "0-9\n" " " | grep -e '.* .* 255'`)
echo "${white_arr[*]}"
num=${#white_arr[*]}
IFS=$OIFS
middle=`convert xc: -format "%[fx:round($num/2)]" info:`
echo "middle=$middle"
xcrop=`echo "${white_arr[$middle]}" | cut -d\  -f1`
echo "xcrop=$xcrop"
ww=`convert -ping tmp.png -format "%w" info:`
hh=`convert -ping tmp.png -format "%h" info:`
ww1=$((xcrop+1))
dim1="${ww1}x${hh}+0+0"
ww2=`convert xc: -format "%[fx:$ww-$xcrop-1]" info:`
xoff2=$ww1
dim2="${ww2}x${hh}+${xoff2}+0"
echo "dim1=$dim1; dim2=$dim2;"
convert tmp.png \
\( -clone 0 -crop $dim1 +repage -write ${inname}_left.$suffix \) \
\( -clone 0 -crop $dim2 +repage -write ${inname}_right.$suffix \) \
null:
rm -f tmp.png


dim1=`convert ${inname}_left.$suffix -scale 1x! -scale 2x! -negate -fuzz 18% -format "%@" info:`
echo $dim1
dim1=`echo "$dim1" | sed -n 's/^2x\(.*\)$/\1/p'`
echo $dim1
ht=`echo "$dim1" | cut -d+ -f1`
yoff=`echo "$dim1" | cut -d+ -f3`
echo "ht=$ht; yoff=$yoff;"
y1=`convert xc: -format "%[fx:($yoff-50)<0?0:($yoff-50)]" info:`
y2=`convert xc: -format "%[fx:($ht+$yoff+50)>$hh?$hh:($ht+$yoff+50)]" info:`
echo "y2=$y2; y1=$y1;"
ht=$((y2-y1))
dim1="${ww1}x${ht}+0+$y1"
echo $dim1
convert ${inname}_left.$suffix -crop $dim1 +repage -trim +repage -bordercolor white -border 20 ${inname}_left.$suffix

dim2=`convert ${inname}_right.$suffix -scale 1x! -scale 2x! -negate -fuzz 18% -format "%@" info:`
echo $dim2
dim2=`echo "$dim2" | sed -n 's/^2x\(.*\)$/\1/p'`
echo $dim2
ht=`echo "$dim2" | cut -d+ -f1`
yoff=`echo "$dim2" | cut -d+ -f3`
echo "ht=$ht; yoff=$yoff;"
y1=`convert xc: -format "%[fx:($yoff-50)<0?0:($yoff-50)]" info:`
y2=`convert xc: -format "%[fx:($ht+$yoff+50)>$hh?$hh:($ht+$yoff+50)]" info:`
echo "y2=$y2; y1=$y1;"
ht=$((y2-y1))
dim2="${ww2}x${ht}+0+$y1"
echo $dim2
convert ${inname}_right.$suffix -crop $dim2 +repage -trim +repage -bordercolor white -border 20 ${inname}_right.$suffix

johnbent · Post by **johnbent** » 2015-01-04T20:53:25-07:00

Wow. You are really really good. I'm very grateful. I've made you a contributor to the project:

http://tekinged.com/about.php (Your name is in the box on the right)

Please let me know if you would prefer I not list you in the contributors.

I'm going to now start seeing if I can train tesseract for this language. If that fails, at least I'll have really nice columns to send to mechanical turk or to do myself with the other volunteers.

If I do need to revert to mechanical turk, it would be potentially really useful to further split the columns into individual word entries. Would that be possible? For example:

By the way, I tried to do the boxes manually and did a poor job. Hopefully a script could do better. I messed up the final two entries I'm noticing now but you probably get the idea. Thanks again very very much!

Post by **fmw42** » 2015-01-04T22:59:31-07:00

Splitting into rows for each dictionary word would be hard. The only reason my method worked was because the columns had good space between them. The rows for each word are the same as for the rows of text below them. So I do not see how one would easily find where each word begins. If there is enough space between each row of text and they are equally spaced, then it might be possible to separate each row of text by looking for the spaces after averaging down to one column. Then you would have average each separated row of text to one row of pixels and see where the dictionary word starts compared to each other indented row of pixels. The issue is if there is enough space between each row of text that it would be perfectly white when averaged to one column. That also depends upon how good the deskew is done. If there is any rotation left and the space is small, then there will be no visible white spaces when averaged to one column.

johnbent · Post by **johnbent** » 2015-01-21T09:25:27-07:00

Fred, thank you so very much for all your help. I've now found another similar dictionary on which the OCR works much better even though the text is entirely Palauan without any English! Unfortunately however your awesome columnize script doesn't work for these images and I can't figure out why. If you could see what is wrong, that would be super awesome of you. Here's the image and again I'm hoping to be able to deskew and columnize. Cropping out the top and bottom to get rid of the weird corner graphics, the page number, and the marker words would be great too. However, I think they are always a fixed pixel distance so that is one easy thing I'm able to do myself.

Post by **fmw42** » 2015-01-21T10:49:08-07:00

You have removed the image. So I cannot process it. But when I saw it a few minutes ago, I noticed that there is some symbol graphics at the top right and bottom right corners. My last script relied upon a 60 pixel gap to remove stuff at the top and bottom (note the arguments 50 in the script). This was customized for that one book. You probably need deskew as before and then manually crop the stuff at the top and bottom since there is not much room for error in automatically cropping. Then process the other part of the script that finds the center column and splits it into two parts.

johnbent · Post by **johnbent** » 2015-01-21T10:58:43-07:00

Shoot! Sorry; I'm an idiot! Here's the image:

Post by **fmw42** » 2015-01-21T16:14:10-07:00

The code I provided before was customized to that one style page/book. This page/book has different headers and footers. In fact, there is no titled header as in the previous book and only a graphic on the top left and bottom left that unfortunately has no space between it and the start of the text below it. It would be hard if not impossible to remove this automatically without the chance of cutting off part of the first row of text and the bottom row of text. Though, I could provide the code to do that with an argument for you to use to cut it off. Is the graphic only on the left side of every page or does it switch between the left and right sides for even and odd pages.

This code, to simply skew and cut the image into two halves is pretty generic. It works fine on this page.

Code: Select all

infile="page-034.png"
inname=`convert -ping "$infile" -format "%t" info:`
suffix=`convert -ping "$infile" -format "%e" info:`
convert "$infile" -auto-level -morphology smooth diamond:1 \
-background white -deskew 40% +repage \
-fuzz 10% -trim +repage tmp.png
OIFS=$IFS
IFS=$'\n'
white_arr=(`convert tmp.png -auto-level -scale x1! txt: |\
tail -n +2 | tr -cs "0-9\n" " " | grep -e '.* .* 255'`)
echo "${white_arr[*]}"
num=${#white_arr[*]}
IFS=$OIFS
middle=`convert xc: -format "%[fx:round($num/2)]" info:`
echo "middle=$middle"
xcrop=`echo "${white_arr[$middle]}" | cut -d\  -f1`
echo "xcrop=$xcrop"
ww=`convert -ping tmp.png -format "%w" info:`
hh=`convert -ping tmp.png -format "%h" info:`
ww1=$((xcrop+1))
dim1="${ww1}x${hh}+0+0"
ww2=`convert xc: -format "%[fx:$ww-$xcrop-1]" info:`
xoff2=$ww1
dim2="${ww2}x${hh}+${xoff2}+0"
echo "dim1=$dim1; dim2=$dim2;"
convert tmp.png -write show: \
\( -clone 0 -crop $dim1 +repage -fuzz 10% -trim +repage -write ${inname}_left.$suffix \) \
\( -clone 0 -crop $dim2 +repage -fuzz 10% -trim +repage -write ${inname}_right.$suffix \) \
null:
rm -f tmp.png

johnbent · Post by **johnbent** » 2015-01-22T12:21:28-07:00

Thank you; you're amazing and I'm super appreciative! It's awesome and it works!

But only about a third of the time.

http://tekinged.com/books/kerresel/images/columns/

By looking at the file sizes you can see for which pages it works and for which it doesn't.

For example, it works great on page-005 and page-006. But it doesn't work for page-004 or page-008.

The first page for each letter are weird and I wouldn't expect it to work on them. For example, page-001 and page-0003. I am planning to just do them manually since there are only 18 of them (no letter f, g, j, p, q, x, y, z in this language).

Post by **fmw42** » 2015-01-22T12:49:55-07:00

The text did not have enough contrast to get a good separation. I modified the code and page 4 works, now. Try the following:

Code: Select all

infile="page-004.png"
inname=`convert -ping "$infile" -format "%t" info:`
suffix=`convert -ping "$infile" -format "%e" info:`
convert "$infile" -auto-level -morphology smooth diamond:1 \
-background white -deskew 40% +repage \
-fuzz 10% -trim +repage tmp.png
OIFS=$IFS
IFS=$'\n'
white_arr=(`convert tmp.png -auto-level -threshold 50% -scale x1! txt: |\
tail -n +2 | tr -cs "0-9\n" " " | grep -e '.* .* 255'`)
echo "${white_arr[*]}"
num=${#white_arr[*]}
IFS=$OIFS
middle=`convert xc: -format "%[fx:round($num/2)]" info:`
echo "middle=$middle"
xcrop=`echo "${white_arr[$middle]}" | cut -d\  -f1`
echo "xcrop=$xcrop"
ww=`convert -ping tmp.png -format "%w" info:`
hh=`convert -ping tmp.png -format "%h" info:`
ww1=$((xcrop+1))
dim1="${ww1}x${hh}+0+0"
ww2=`convert xc: -format "%[fx:$ww-$xcrop-1]" info:`
xoff2=$ww1
dim2="${ww2}x${hh}+${xoff2}+0"
echo "dim1=$dim1; dim2=$dim2;"
convert tmp.png -write show: \
\( -clone 0 -crop $dim1 +repage -fuzz 10% -trim +repage -write ${inname}_left.$suffix \) \
\( -clone 0 -crop $dim2 +repage -fuzz 10% -trim +repage -write ${inname}_right.$suffix \) \
null:
rm -f tmp.png

Post by **fmw42** » 2015-01-22T13:02:48-07:00

It might work better with the -threshold at 75% rather than 50% so that the text is not thinned too much by the threshold

try

infile="page-004.png"
inname=`convert -ping "$infile" -format "%t" info:`
suffix=`convert -ping "$infile" -format "%e" info:`
convert "$infile" -auto-level -morphology smooth diamond:1 \
-background white -deskew 40% +repage \
-fuzz 10% -trim +repage tmp.png
OIFS=$IFS
IFS=$'\n'
white_arr=(`convert tmp.png -auto-level -threshold 75% -write show: -scale x1! txt: |\
tail -n +2 | tr -cs "0-9\n" " " | grep -e '.* .* 255'`)
echo "${white_arr[*]}"
num=${#white_arr[*]}
IFS=$OIFS
middle=`convert xc: -format "%[fx:round($num/2)]" info:`
echo "middle=$middle"
xcrop=`echo "${white_arr[$middle]}" | cut -d\ -f1`
echo "xcrop=$xcrop"
ww=`convert -ping tmp.png -format "%w" info:`
hh=`convert -ping tmp.png -format "%h" info:`
ww1=$((xcrop+1))
dim1="${ww1}x${hh}+0+0"
ww2=`convert xc: -format "%[fx:$ww-$xcrop-1]" info:`
xoff2=$ww1
dim2="${ww2}x${hh}+${xoff2}+0"
echo "dim1=$dim1; dim2=$dim2;"
convert tmp.png -write show: \
$ -clone 0 -crop $dim1 +repage -fuzz 10% -trim +repage -write ${inname}_left.$suffix $ \
$ -clone 0 -crop $dim2 +repage -fuzz 10% -trim +repage -write ${inname}_right.$suffix $ \
null:
rm -f tmp.png

johnbent · Post by **johnbent** » 2015-01-22T13:57:25-07:00

Thanks again Fred! The second one works much better! It works about 75% of the time.

http://tekinged.com/books/kerresel/imag ... ge-010.png is one that doesn't work for example.

The third script doesn't work at all . . . I think maybe my imagemagick installation is missing something since I see this in the output:

display: delegate library support not built-in `' (X11) @ error/display.c/DisplayImageCommand/1894.

I'm also seeing Usage output for both which suggests to me that maybe some command line arguments aren't being recognized and maybe some parts of the script aren't running?

http://tekinged.com/books/kerresel/imag ... e2_out.txt
http://tekinged.com/books/kerresel/imag ... e3_out.txt

Legacy ImageMagick Discussions Archive

crop columns out of dictionary page

crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page

Re: crop columns out of dictionary page