Prepare for OCR, find words and rotate.

eleison · Post by **eleison** » 2014-09-26T01:59:41-07:00

Hi everyone!

I have many jpgs with random placed words/sentence, background is white and text is black. So I want to use imagemagick to find and all lines of words, isolate them and rotate them to a horizontal axis. There is plenty of white space between each item(line). Sometimes there can be an image/graphics among the words too, best would be if imagemagick ignored those but wouldn't matter that much if they get isolated too.

This is how it could look like:

Method 1
After I isolate the words i was thinking I could rotate them using this model:
1. Somehow find the corners of the text.
2. Make a rectangle of the corner points, all lines are in rectangles in shape... They won't difference in lengths that much.
3. Make a smaller horizontal rectangle.
4. Rotate the first rectangle until its completely fills the small rectangle.

Update:
Method 2
Maby it's easier to rotate until imagemagick finds minimum possible height, or height equals under a certain set pixels.

This was just an idea that might work?! I am very open to ANY suggestions how to accomplish this. If this can be done with only the command line it's great, otherwise i'm most comfortable with PHP....

When this process is done i'm going to do OCR with Tesseract on the text.

Thank you all, and thank you for this fantastic forum

eleison · Post by **eleison** » 2014-09-29T00:55:09-07:00

I think i got the solution for the making the words horizontal. But can't really find a good solution for segment/separate the words/lines. One solution I've been thinking about is to look for black pixels, when i find one i mask that area around the black pixel roughly (since i got plenty of white space in between every object), then cut the masked area out, then repeat the step again until there is no more black pixels left. This should work because the lines/words have a maximum width.

How do I look for a black pixel, mask it and the cut it out to a new image?

I am of course open to any advice that could lead me in the right direction!!
The example image is almost identical in how my images will look like, the words/lines will always have plenty of white space in between. The placement of the words/lines/objects will vary some from image to image, still loads of white space in between.

Post by **fmw42** » 2014-09-29T09:05:59-07:00

What version of IM and platform? If on Linux/MacOSX or Windows with Cygwin, you could try my script multicrop and unrotate after blurring your image.

eleison · Post by **eleison** » 2014-09-29T09:31:41-07:00

Ubuntu and ImageMagick 6.5.4-7! What would be a good blurring filter, does the multicrop return crop coordinates? So i can crop the original-none-blurred image?

Thanks!

Post by **fmw42** » 2014-09-29T10:06:20-07:00

IM 6.5.4.7 is very very old. But you can try anyway. I think it should work. Try one of my examples first to test it. Blur your image enough so that there is no white space between and in letters. Use -blur. If you use option -u 2, it will unrotate. It does give the crop coordinates for each area it crops. If the unrotate gives a vertical oriented result, use convert ... -rotate 90 or -90 ... to make it horizontal. Note your background needs to be nearly constant. Use the -f argument for -fuzz to allow for some degree of variation in the background.

Example from my script page

Code: Select all

multicrop -f 10 -u 2 3ladies1.jpg tmp.jpg

Processing Image 0
Size: 267x268
Page Geometry: 660x588+18+14


Image Is Being Rotated 2.89 degrees

Processing Image 1
Size: 329x416
Page Geometry: 660x588+317+82


Image Is Being Rotated 2.93 degrees

Processing Image 2
Size: 269x269
Page Geometry: 660x588+17+303


Image Is Being Rotated -2.89 degrees

eleison · Post by **eleison** » 2014-09-29T12:03:56-07:00

Thank you! I will try this and see how it works!

I've read many different subjects about the multicrop script, someone said it is slow and makes huge temp files? Maybe it was an old post I read...

Post by **fmw42** » 2014-09-29T12:18:06-07:00

eleison wrote:Thank you! I will try this and see how it works! I've read many different subjects about the multicrop script, someone said it is slow and makes huge temp files? Maybe it was an old post I read...

It may not be very fast and may create large temp files if you input file is large. But if it gets the job done, then that is what counts. If you need it faster, then some one would have to code it formally into proper C code. That is not my strength.

eleison · Post by **eleison** » 2014-09-29T12:51:41-07:00

Ok, i'm very thankful for all your hard work with these scripts! I will try and see if multicrop works for me. The method i came up with in earlier posts maybe is worth a try though. I might try and compare both ways...

My steps would be.
1. Take cords from first black pixel i find.
2. Put a "big enough" crop rectangle around text.
3. Crop out text.
4. Trim crop.
5. Loop until no more black pixels left.

If i'm going to try this method, whats the best way of getting first black pixel?

Thanks again for your help!

Post by **fmw42** » 2014-09-29T13:18:18-07:00

Is it truly black and not just a dark gray?

Code: Select all

data=`compare -metric pae -subimage-search -dissimilarity threshold 1 -similarity-threshold 0 \
yourimage \( -size 1x1 xc:black \) null: 2>&1`
echo "$data"

see
http://www.imagemagick.org/Usage/compare/

eleison · Post by **eleison** » 2014-09-29T13:38:10-07:00

fmw42 wrote:Is it truly black and not just a dark gray?

I will do filters on the image to make it just black and white and as clear as I can. Maybe -normalize and -level would be enough? Would some grey pixels make trouble?

I will try your solution! Thanks! I remember I read something about putting image pixels into .txt. Maybe an even better way of doing this is to blur the text get the pixeldata into a .txt, with PHP look for the pixels with most black around, that maybe could work to get the center of each "text-object"... Hum. I'm not able to try all these things right now. When I get time I will try and see what works best!

Thanks again for all the help!

eleison · Post by **eleison** » 2014-10-11T04:47:54-07:00

Ok, now I had time to do some testing! I tried "multicrop", but seems like i'm doing something wrong... Se:

[root@localhost ~]# bash multicrop -f 10 multicrop_test.jpg multi_test.jpg
multicrop: line 142: type: multicrop: not found
dirname: missing operand
Try `dirname --help' for more information.
basename: missing operand
Try `basename --help' for more information.

Processing Image 0
Size: 414x291
Page Geometry: 2048x1151+155+122

convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 1
Size: 308x276
Page Geometry: 2048x1151+772+146

convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 2
Size: 490x322
Page Geometry: 2048x1151+524+615

This is an image with random black boxes on white background.

Post by **snibgo** » 2014-10-11T04:57:40-07:00

eleison wrote:[root@localhost ~]# bash multicrop -f 10 multicrop_test.jpg multi_test.jpg
multicrop: line 142: type: multicrop: not found

I'm not a Unix expert, but I think the problem is that "type" is a command built-in to the bash shell, but you are running a different shell and calling bash to run the script. If, instead, you start bash (probably by typing "bash") then run the script, you probably won't get the error with "type".

Post by **fmw42** » 2014-10-11T09:55:11-07:00

See my Pointers on my home page. You may need to set the dir to "/tmp" and modify your PATH as explained. type is trying to find the name of the script from bash. If you PATH does not include the location where you run your scripts, then you will get these messages, but usually it does not prevent the script from running.

eleison · Post by **eleison** » 2014-10-12T04:50:02-07:00

Thanks for quick reply! Fred, I tried follow your instructions, though i'm not that good at Unix (learning), I had a hard time understand every step in your instructions.

Alternately, edit the script somewhere between the comments and the first use of any IM command, such as just below the defaults section to add the following two lines:
imdir="path2" #(such as imdir="/usr/local/bin" or imdir="/usr/bin")
PATH="${imdir}:${PATH}"

I added this to the script file:

# CAVEAT: No guarantee that this script will work on all platforms,
# nor that trapping of inconsistent parameters is complete and
# foolproof. Use At Your Own Risk.
#
######
#
imdir="/usr/local/bin" #(such as imdir="/usr/local/bin" or imdir="/usr/bin")
PATH="${imdir}:${PATH}"

See my Pointers on my home page. You may need to set the dir to "/tmp" and modify your PATH as explained. type is trying to find the name of the script from bash. If you PATH does not include the location where you run your scripts, then you will get these messages, but usually it does not prevent the script from running.

1. Where do I set the dir to "/tmp"
2. Where do I modify the PATH?
3. I thought the path i added should be directed to "convert" in my case "/usr/local/bin"?

Sorry for my lack of knowledge and understanding!

Post by **fmw42** » 2014-10-12T11:15:52-07:00

1. Where do I set the dir to "/tmp"
2. Where do I modify the PATH?
3. I thought the path i added should be directed to "convert" in my case "/usr/local/bin"?

1. line 136 right after the defaults

2. What you did should work? But you can edit your .profile file (hidden file) and that then works for all scripts. But you should also put in there the path to where you keep any of my scripts.

3) yes, it should be fine if your IM is at /usr/local/bin. You can check by

type -a convert

or

which convert

[root@localhost ~]# bash multicrop -f 10 multicrop_test.jpg multi_test.jpg
multicrop: line 142: type: multicrop: not found
dirname: missing operand
Try `dirname --help' for more information.
basename: missing operand
Try `basename --help' for more information.

Processing Image 0
Size: 414x291
Page Geometry: 2048x1151+155+122

convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 1
Size: 308x276
Page Geometry: 2048x1151+772+146

convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 2
Size: 490x322
Page Geometry: 2048x1151+524+615

This indicated that your script worked. The messages as I said before should not stop it from working. Did you blur the image as I had originally suggested. If you do not, it will try to find each letter.

Try this:

Code: Select all

convert OCR_Project_test.jpg -blur 0x3 -negate -threshold 0 -negate tmp1.png

Code: Select all

multicrop -g 5 tmp1.png tmp1_multicrop.png 
Processing Image 0
Size: 283x165
Page Geometry: 1000x1000+127+154

Processing Image 1
Size: 177x151
Page Geometry: 1000x1000+614+161

Processing Image 2
Size: 169x158
Page Geometry: 1000x1000+215+590

Processing Image 3
Size: 46x210
Page Geometry: 1000x1000+737+552

The resulting images will just be the black blurred portions that were cropped to their bounding boxes. So now go back to your original image and crop each part out using the Size and the offsets from the Page Geometry. So for the Processing Image 2

Code: Select all

convert  OCR_Project_test.jpg -crop 169x158+215+590 -fuzz 1% -trim +repage result2.png

Now use my script unrotate to get the images to either horizontal or vertical orientation. You may have to rotate some multiple of 90 degrees afterwards to get the text rotated so it can be read.

Code: Select all

unrotate -f 1 result2.png result2_unr.png

Image Is Being Rotated 42.76 degrees

Legacy ImageMagick Discussions Archive

Prepare for OCR, find words and rotate.

Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.

Re: Prepare for OCR, find words and rotate.