Prepare for OCR, find words and rotate.
Prepare for OCR, find words and rotate.
Hi everyone!
I have many jpgs with random placed words/sentence, background is white and text is black. So I want to use imagemagick to find and all lines of words, isolate them and rotate them to a horizontal axis. There is plenty of white space between each item(line). Sometimes there can be an image/graphics among the words too, best would be if imagemagick ignored those but wouldn't matter that much if they get isolated too.
This is how it could look like:
Method 1
After I isolate the words i was thinking I could rotate them using this model:
1. Somehow find the corners of the text.
2. Make a rectangle of the corner points, all lines are in rectangles in shape... They won't difference in lengths that much.
3. Make a smaller horizontal rectangle.
4. Rotate the first rectangle until its completely fills the small rectangle.
Update:
Method 2
Maby it's easier to rotate until imagemagick finds minimum possible height, or height equals under a certain set pixels.
This was just an idea that might work?! I am very open to ANY suggestions how to accomplish this. If this can be done with only the command line it's great, otherwise i'm most comfortable with PHP....
When this process is done i'm going to do OCR with Tesseract on the text.
Thank you all, and thank you for this fantastic forum
I have many jpgs with random placed words/sentence, background is white and text is black. So I want to use imagemagick to find and all lines of words, isolate them and rotate them to a horizontal axis. There is plenty of white space between each item(line). Sometimes there can be an image/graphics among the words too, best would be if imagemagick ignored those but wouldn't matter that much if they get isolated too.
This is how it could look like:
Method 1
After I isolate the words i was thinking I could rotate them using this model:
1. Somehow find the corners of the text.
2. Make a rectangle of the corner points, all lines are in rectangles in shape... They won't difference in lengths that much.
3. Make a smaller horizontal rectangle.
4. Rotate the first rectangle until its completely fills the small rectangle.
Update:
Method 2
Maby it's easier to rotate until imagemagick finds minimum possible height, or height equals under a certain set pixels.
This was just an idea that might work?! I am very open to ANY suggestions how to accomplish this. If this can be done with only the command line it's great, otherwise i'm most comfortable with PHP....
When this process is done i'm going to do OCR with Tesseract on the text.
Thank you all, and thank you for this fantastic forum
Re: Prepare for OCR, find words and rotate.
I think i got the solution for the making the words horizontal. But can't really find a good solution for segment/separate the words/lines. One solution I've been thinking about is to look for black pixels, when i find one i mask that area around the black pixel roughly (since i got plenty of white space in between every object), then cut the masked area out, then repeat the step again until there is no more black pixels left. This should work because the lines/words have a maximum width.
How do I look for a black pixel, mask it and the cut it out to a new image?
I am of course open to any advice that could lead me in the right direction!!
The example image is almost identical in how my images will look like, the words/lines will always have plenty of white space in between. The placement of the words/lines/objects will vary some from image to image, still loads of white space in between.
How do I look for a black pixel, mask it and the cut it out to a new image?
I am of course open to any advice that could lead me in the right direction!!
The example image is almost identical in how my images will look like, the words/lines will always have plenty of white space in between. The placement of the words/lines/objects will vary some from image to image, still loads of white space in between.
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Prepare for OCR, find words and rotate.
What version of IM and platform? If on Linux/MacOSX or Windows with Cygwin, you could try my script multicrop and unrotate after blurring your image.
Re: Prepare for OCR, find words and rotate.
Ubuntu and ImageMagick 6.5.4-7! What would be a good blurring filter, does the multicrop return crop coordinates? So i can crop the original-none-blurred image?
Thanks!
Thanks!
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Prepare for OCR, find words and rotate.
IM 6.5.4.7 is very very old. But you can try anyway. I think it should work. Try one of my examples first to test it. Blur your image enough so that there is no white space between and in letters. Use -blur. If you use option -u 2, it will unrotate. It does give the crop coordinates for each area it crops. If the unrotate gives a vertical oriented result, use convert ... -rotate 90 or -90 ... to make it horizontal. Note your background needs to be nearly constant. Use the -f argument for -fuzz to allow for some degree of variation in the background.
Example from my script page
Example from my script page
Code: Select all
multicrop -f 10 -u 2 3ladies1.jpg tmp.jpg
Processing Image 0
Size: 267x268
Page Geometry: 660x588+18+14
Image Is Being Rotated 2.89 degrees
Processing Image 1
Size: 329x416
Page Geometry: 660x588+317+82
Image Is Being Rotated 2.93 degrees
Processing Image 2
Size: 269x269
Page Geometry: 660x588+17+303
Image Is Being Rotated -2.89 degrees
Re: Prepare for OCR, find words and rotate.
Thank you! I will try this and see how it works! I've read many different subjects about the multicrop script, someone said it is slow and makes huge temp files? Maybe it was an old post I read...
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Prepare for OCR, find words and rotate.
It may not be very fast and may create large temp files if you input file is large. But if it gets the job done, then that is what counts. If you need it faster, then some one would have to code it formally into proper C code. That is not my strength.eleison wrote:Thank you! I will try this and see how it works! I've read many different subjects about the multicrop script, someone said it is slow and makes huge temp files? Maybe it was an old post I read...
Re: Prepare for OCR, find words and rotate.
Ok, i'm very thankful for all your hard work with these scripts! I will try and see if multicrop works for me. The method i came up with in earlier posts maybe is worth a try though. I might try and compare both ways...
My steps would be.
1. Take cords from first black pixel i find.
2. Put a "big enough" crop rectangle around text.
3. Crop out text.
4. Trim crop.
5. Loop until no more black pixels left.
If i'm going to try this method, whats the best way of getting first black pixel?
Thanks again for your help!
My steps would be.
1. Take cords from first black pixel i find.
2. Put a "big enough" crop rectangle around text.
3. Crop out text.
4. Trim crop.
5. Loop until no more black pixels left.
If i'm going to try this method, whats the best way of getting first black pixel?
Thanks again for your help!
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Prepare for OCR, find words and rotate.
Is it truly black and not just a dark gray?
see
http://www.imagemagick.org/Usage/compare/
Code: Select all
data=`compare -metric pae -subimage-search -dissimilarity threshold 1 -similarity-threshold 0 \
yourimage \( -size 1x1 xc:black \) null: 2>&1`
echo "$data"
http://www.imagemagick.org/Usage/compare/
Re: Prepare for OCR, find words and rotate.
I will do filters on the image to make it just black and white and as clear as I can. Maybe -normalize and -level would be enough? Would some grey pixels make trouble?fmw42 wrote:Is it truly black and not just a dark gray?
I will try your solution! Thanks! I remember I read something about putting image pixels into .txt. Maybe an even better way of doing this is to blur the text get the pixeldata into a .txt, with PHP look for the pixels with most black around, that maybe could work to get the center of each "text-object"... Hum. I'm not able to try all these things right now. When I get time I will try and see what works best!
Thanks again for all the help!
Re: Prepare for OCR, find words and rotate.
Ok, now I had time to do some testing! I tried "multicrop", but seems like i'm doing something wrong... Se:
[root@localhost ~]# bash multicrop -f 10 multicrop_test.jpg multi_test.jpg
multicrop: line 142: type: multicrop: not found
dirname: missing operand
Try `dirname --help' for more information.
basename: missing operand
Try `basename --help' for more information.
Processing Image 0
Size: 414x291
Page Geometry: 2048x1151+155+122
convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 1
Size: 308x276
Page Geometry: 2048x1151+772+146
convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 2
Size: 490x322
Page Geometry: 2048x1151+524+615
This is an image with random black boxes on white background.
[root@localhost ~]# bash multicrop -f 10 multicrop_test.jpg multi_test.jpg
multicrop: line 142: type: multicrop: not found
dirname: missing operand
Try `dirname --help' for more information.
basename: missing operand
Try `basename --help' for more information.
Processing Image 0
Size: 414x291
Page Geometry: 2048x1151+155+122
convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 1
Size: 308x276
Page Geometry: 2048x1151+772+146
convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 2
Size: 490x322
Page Geometry: 2048x1151+524+615
This is an image with random black boxes on white background.
-
- Posts: 12159
- Joined: 2010-01-23T23:01:33-07:00
- Authentication code: 1151
- Location: England, UK
Re: Prepare for OCR, find words and rotate.
I'm not a Unix expert, but I think the problem is that "type" is a command built-in to the bash shell, but you are running a different shell and calling bash to run the script. If, instead, you start bash (probably by typing "bash") then run the script, you probably won't get the error with "type".eleison wrote:[root@localhost ~]# bash multicrop -f 10 multicrop_test.jpg multi_test.jpg
multicrop: line 142: type: multicrop: not found
snibgo's IM pages: im.snibgo.com
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Prepare for OCR, find words and rotate.
See my Pointers on my home page. You may need to set the dir to "/tmp" and modify your PATH as explained. type is trying to find the name of the script from bash. If you PATH does not include the location where you run your scripts, then you will get these messages, but usually it does not prevent the script from running.
Re: Prepare for OCR, find words and rotate.
Thanks for quick reply! Fred, I tried follow your instructions, though i'm not that good at Unix (learning), I had a hard time understand every step in your instructions.
2. Where do I modify the PATH?
3. I thought the path i added should be directed to "convert" in my case "/usr/local/bin"?
Sorry for my lack of knowledge and understanding!
Alternately, edit the script somewhere between the comments and the first use of any IM command, such as just below the defaults section to add the following two lines:
imdir="path2" #(such as imdir="/usr/local/bin" or imdir="/usr/bin")
PATH="${imdir}:${PATH}"
I added this to the script file:
# CAVEAT: No guarantee that this script will work on all platforms,
# nor that trapping of inconsistent parameters is complete and
# foolproof. Use At Your Own Risk.
#
######
#
imdir="/usr/local/bin" #(such as imdir="/usr/local/bin" or imdir="/usr/bin")
PATH="${imdir}:${PATH}"
1. Where do I set the dir to "/tmp"See my Pointers on my home page. You may need to set the dir to "/tmp" and modify your PATH as explained. type is trying to find the name of the script from bash. If you PATH does not include the location where you run your scripts, then you will get these messages, but usually it does not prevent the script from running.
2. Where do I modify the PATH?
3. I thought the path i added should be directed to "convert" in my case "/usr/local/bin"?
Sorry for my lack of knowledge and understanding!
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Prepare for OCR, find words and rotate.
1. line 136 right after the defaults1. Where do I set the dir to "/tmp"
2. Where do I modify the PATH?
3. I thought the path i added should be directed to "convert" in my case "/usr/local/bin"?
2. What you did should work? But you can edit your .profile file (hidden file) and that then works for all scripts. But you should also put in there the path to where you keep any of my scripts.
3) yes, it should be fine if your IM is at /usr/local/bin. You can check by
type -a convert
or
which convert
This indicated that your script worked. The messages as I said before should not stop it from working. Did you blur the image as I had originally suggested. If you do not, it will try to find each letter.[root@localhost ~]# bash multicrop -f 10 multicrop_test.jpg multi_test.jpg
multicrop: line 142: type: multicrop: not found
dirname: missing operand
Try `dirname --help' for more information.
basename: missing operand
Try `basename --help' for more information.
Processing Image 0
Size: 414x291
Page Geometry: 2048x1151+155+122
convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 1
Size: 308x276
Page Geometry: 2048x1151+772+146
convert: geometry does not contain image `./multicrop_1_27048.mpc' @ warning/attribute.c/GetImageBoundingBox/247.
Processing Image 2
Size: 490x322
Page Geometry: 2048x1151+524+615
Try this:
Code: Select all
convert OCR_Project_test.jpg -blur 0x3 -negate -threshold 0 -negate tmp1.png
Code: Select all
multicrop -g 5 tmp1.png tmp1_multicrop.png
Processing Image 0
Size: 283x165
Page Geometry: 1000x1000+127+154
Processing Image 1
Size: 177x151
Page Geometry: 1000x1000+614+161
Processing Image 2
Size: 169x158
Page Geometry: 1000x1000+215+590
Processing Image 3
Size: 46x210
Page Geometry: 1000x1000+737+552
Code: Select all
convert OCR_Project_test.jpg -crop 169x158+215+590 -fuzz 1% -trim +repage result2.png
Now use my script unrotate to get the images to either horizontal or vertical orientation. You may have to rotate some multiple of 90 degrees afterwards to get the text rotated so it can be read.
Code: Select all
unrotate -f 1 result2.png result2_unr.png
Image Is Being Rotated 42.76 degrees