Prepare for OCR, find words and rotate.

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
eleison
Posts: 45
Joined: 2013-12-10T15:14:46-07:00
Authentication code: 6789

Prepare for OCR, find words and rotate.

Post by eleison »

Hi everyone!

I have many jpgs with random placed words/sentence, background is white and text is black. So I want to use imagemagick to find and all lines of words, isolate them and rotate them to a horizontal axis. There is plenty of white space between each item(line). Sometimes there can be an image/graphics among the words too, best would be if imagemagick ignored those but wouldn't matter that much if they get isolated too.

This is how it could look like:
Image

Method 1
After I isolate the words i was thinking I could rotate them using this model:
1. Somehow find the corners of the text.
2. Make a rectangle of the corner points, all lines are in rectangles in shape... They won't difference in lengths that much.
3. Make a smaller horizontal rectangle.
4. Rotate the first rectangle until its completely fills the small rectangle.

Image

Update:
Method 2
Maby it's easier to rotate until imagemagick finds minimum possible height, or height equals under a certain set pixels.


This was just an idea that might work?! I am very open to ANY suggestions how to accomplish this. If this can be done with only the command line it's great, otherwise i'm most comfortable with PHP....

When this process is done i'm going to do OCR with Tesseract on the text.

Thank you all, and thank you for this fantastic forum :)
eleison
Posts: 45
Joined: 2013-12-10T15:14:46-07:00
Authentication code: 6789

Re: Prepare for OCR, find words and rotate.

Post by eleison »

I think i got the solution for the making the words horizontal. But can't really find a good solution for segment/separate the words/lines. One solution I've been thinking about is to look for black pixels, when i find one i mask that area around the black pixel roughly (since i got plenty of white space in between every object), then cut the masked area out, then repeat the step again until there is no more black pixels left. This should work because the lines/words have a maximum width.

How do I look for a black pixel, mask it and the cut it out to a new image?

I am of course open to any advice that could lead me in the right direction!!
The example image is almost identical in how my images will look like, the words/lines will always have plenty of white space in between. The placement of the words/lines/objects will vary some from image to image, still loads of white space in between.
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Prepare for OCR, find words and rotate.

Post by fmw42 »

What version of IM and platform? If on Linux/MacOSX or Windows with Cygwin, you could try my script multicrop and unrotate after blurring your image.
eleison
Posts: 45
Joined: 2013-12-10T15:14:46-07:00
Authentication code: 6789

Re: Prepare for OCR, find words and rotate.

Post by eleison »

Ubuntu and ImageMagick 6.5.4-7! What would be a good blurring filter, does the multicrop return crop coordinates? So i can crop the original-none-blurred image?

Thanks!
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Prepare for OCR, find words and rotate.

Post by fmw42 »

IM 6.5.4.7 is very very old. But you can try anyway. I think it should work. Try one of my examples first to test it. Blur your image enough so that there is no white space between and in letters. Use -blur. If you use option -u 2, it will unrotate. It does give the crop coordinates for each area it crops. If the unrotate gives a vertical oriented result, use convert ... -rotate 90 or -90 ... to make it horizontal. Note your background needs to be nearly constant. Use the -f argument for -fuzz to allow for some degree of variation in the background.

Example from my script page

Code: Select all

multicrop -f 10 -u 2 3ladies1.jpg tmp.jpg

Processing Image 0
Size: 267x268
Page Geometry: 660x588+18+14


Image Is Being Rotated 2.89 degrees

Processing Image 1
Size: 329x416
Page Geometry: 660x588+317+82


Image Is Being Rotated 2.93 degrees

Processing Image 2
Size: 269x269
Page Geometry: 660x588+17+303


Image Is Being Rotated -2.89 degrees
eleison
Posts: 45
Joined: 2013-12-10T15:14:46-07:00
Authentication code: 6789

Re: Prepare for OCR, find words and rotate.

Post by eleison »

Thank you! I will try this and see how it works! :) I've read many different subjects about the multicrop script, someone said it is slow and makes huge temp files? Maybe it was an old post I read...
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Prepare for OCR, find words and rotate.

Post by fmw42 »

eleison wrote:Thank you! I will try this and see how it works! :) I've read many different subjects about the multicrop script, someone said it is slow and makes huge temp files? Maybe it was an old post I read...
It may not be very fast and may create large temp files if you input file is large. But if it gets the job done, then that is what counts. If you need it faster, then some one would have to code it formally into proper C code. That is not my strength.
eleison
Posts: 45
Joined: 2013-12-10T15:14:46-07:00
Authentication code: 6789

Re: Prepare for OCR, find words and rotate.

Post by eleison »

Ok, i'm very thankful for all your hard work with these scripts! I will try and see if multicrop works for me. The method i came up with in earlier posts maybe is worth a try though. I might try and compare both ways...

My steps would be.
1. Take cords from first black pixel i find.
2. Put a "big enough" crop rectangle around text.
3. Crop out text.
4. Trim crop.
5. Loop until no more black pixels left.

Image

If i'm going to try this method, whats the best way of getting first black pixel?

Thanks again for your help!
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Prepare for OCR, find words and rotate.

Post by fmw42 »

Is it truly black and not just a dark gray?

Code: Select all

data=`compare -metric pae -subimage-search -dissimilarity threshold 1 -similarity-threshold 0 \
yourimage \( -size 1x1 xc:black \) null: 2>&1`
echo "$data"
see
http://www.imagemagick.org/Usage/compare/
eleison
Posts: 45
Joined: 2013-12-10T15:14:46-07:00
Authentication code: 6789

Re: Prepare for OCR, find words and rotate.

Post by eleison »

fmw42 wrote:Is it truly black and not just a dark gray?
I will do filters on the image to make it just black and white and as clear as I can. Maybe -normalize and -level would be enough? Would some grey pixels make trouble?

I will try your solution! Thanks! I remember I read something about putting image pixels into .txt. Maybe an even better way of doing this is to blur the text get the pixeldata into a .txt, with PHP look for the pixels with most black around, that maybe could work to get the center of each "text-object"... Hum. I'm not able to try all these things right now. When I get time I will try and see what works best!

Thanks again for all the help!
eleison
Posts: 45
Joined: 2013-12-10T15:14:46-07:00
Authentication code: 6789

Re: Prepare for OCR, find words and rotate.

Post by eleison »

Ok, now I had time to do some testing! I tried "multicrop", but seems like i'm doing something wrong... Se:


[root@localhost ~]# bash multicrop -f 10 multicrop_test.jpg multi_test.jpg
multicrop: line 142: type: multicrop: not found
dirname: missing operand
Try `dirname --help' for more information.
basename: missing operand
Try `basename --help' for more information.

Processing Image 0
Size: 414x291
Page Geometry: 2048x1151+155+122

convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 1
Size: 308x276
Page Geometry: 2048x1151+772+146

convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 2
Size: 490x322
Page Geometry: 2048x1151+524+615

This is an image with random black boxes on white background.
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Prepare for OCR, find words and rotate.

Post by snibgo »

eleison wrote:[root@localhost ~]# bash multicrop -f 10 multicrop_test.jpg multi_test.jpg
multicrop: line 142: type: multicrop: not found
I'm not a Unix expert, but I think the problem is that "type" is a command built-in to the bash shell, but you are running a different shell and calling bash to run the script. If, instead, you start bash (probably by typing "bash") then run the script, you probably won't get the error with "type".
snibgo's IM pages: im.snibgo.com
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Prepare for OCR, find words and rotate.

Post by fmw42 »

See my Pointers on my home page. You may need to set the dir to "/tmp" and modify your PATH as explained. type is trying to find the name of the script from bash. If you PATH does not include the location where you run your scripts, then you will get these messages, but usually it does not prevent the script from running.
eleison
Posts: 45
Joined: 2013-12-10T15:14:46-07:00
Authentication code: 6789

Re: Prepare for OCR, find words and rotate.

Post by eleison »

Thanks for quick reply! Fred, I tried follow your instructions, though i'm not that good at Unix (learning), I had a hard time understand every step in your instructions.
Alternately, edit the script somewhere between the comments and the first use of any IM command, such as just below the defaults section to add the following two lines:
imdir="path2" #(such as imdir="/usr/local/bin" or imdir="/usr/bin")
PATH="${imdir}:${PATH}"
I added this to the script file:

# CAVEAT: No guarantee that this script will work on all platforms,
# nor that trapping of inconsistent parameters is complete and
# foolproof. Use At Your Own Risk.
#
######
#
imdir="/usr/local/bin" #(such as imdir="/usr/local/bin" or imdir="/usr/bin")
PATH="${imdir}:${PATH}"
See my Pointers on my home page. You may need to set the dir to "/tmp" and modify your PATH as explained. type is trying to find the name of the script from bash. If you PATH does not include the location where you run your scripts, then you will get these messages, but usually it does not prevent the script from running.
1. Where do I set the dir to "/tmp"
2. Where do I modify the PATH?
3. I thought the path i added should be directed to "convert" in my case "/usr/local/bin"?

Sorry for my lack of knowledge and understanding!
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Prepare for OCR, find words and rotate.

Post by fmw42 »

1. Where do I set the dir to "/tmp"
2. Where do I modify the PATH?
3. I thought the path i added should be directed to "convert" in my case "/usr/local/bin"?
1. line 136 right after the defaults

2. What you did should work? But you can edit your .profile file (hidden file) and that then works for all scripts. But you should also put in there the path to where you keep any of my scripts.

3) yes, it should be fine if your IM is at /usr/local/bin. You can check by

type -a convert

or

which convert

[root@localhost ~]# bash multicrop -f 10 multicrop_test.jpg multi_test.jpg
multicrop: line 142: type: multicrop: not found
dirname: missing operand
Try `dirname --help' for more information.
basename: missing operand
Try `basename --help' for more information.

Processing Image 0
Size: 414x291
Page Geometry: 2048x1151+155+122

convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 1
Size: 308x276
Page Geometry: 2048x1151+772+146

convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 2
Size: 490x322
Page Geometry: 2048x1151+524+615
This indicated that your script worked. The messages as I said before should not stop it from working. Did you blur the image as I had originally suggested. If you do not, it will try to find each letter.

Try this:

Code: Select all

convert OCR_Project_test.jpg -blur 0x3 -negate -threshold 0 -negate tmp1.png
Image

Code: Select all

multicrop -g 5 tmp1.png tmp1_multicrop.png 
Processing Image 0
Size: 283x165
Page Geometry: 1000x1000+127+154

Processing Image 1
Size: 177x151
Page Geometry: 1000x1000+614+161

Processing Image 2
Size: 169x158
Page Geometry: 1000x1000+215+590

Processing Image 3
Size: 46x210
Page Geometry: 1000x1000+737+552

The resulting images will just be the black blurred portions that were cropped to their bounding boxes. So now go back to your original image and crop each part out using the Size and the offsets from the Page Geometry. So for the Processing Image 2

Code: Select all

convert  OCR_Project_test.jpg -crop 169x158+215+590 -fuzz 1% -trim +repage result2.png
Image

Now use my script unrotate to get the images to either horizontal or vertical orientation. You may have to rotate some multiple of 90 degrees afterwards to get the text rotated so it can be read.

Code: Select all

unrotate -f 1 result2.png result2_unr.png

Image Is Being Rotated 42.76 degrees
Image
Post Reply