Page 1 of 1
How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-10T09:38:40-07:00
by gddolph
Hello all,
I am using imagemagick and the textcleaner script to preprocess image files for tesseract OCR, and while I'm having good success so far I've run into one problem that I need some help to solve. One of the use cases I have is to process text from screenshots, and I'm finding that tesseract getting confused by the red or green wavy underlines MS Word adds for spelling and grammar errors. Here's an example:
I am looking for a way to get rid of the red and blue lines, and I have had some success with the following command set:
Code: Select all
[devbox@fraitcf1vd1998 images]$ convert 20170110/deliberate_mistakes1.png -sharpen 0x1.0 -fuzz 30% -fill white -opaque 'rgb(255,0,0)' \
> -opaque 'rgb(0,0,255)' -scale 200% miff:- |\
> ./textcleaner -g -e stretch -f 50 -o 10 -s 1 - png:- |\
> tesseract - stdout
This is some random text with a missspelt word im it and, grammar mistake.
So it works, only missing getting the N in in wrong. Looking at it incrementally the convert command gives me this:
And the textcleaner section gives me this:
The problem I have is that this is a blunt instrument which changes all instances of red or blue into white, if all I wanted to do was read black text on white backgrounds then I'd be overjoyed by this solution, however I will have images with text in multiple colors.
One approach I've thought about is trying to detect the wavy line shape as it is very distinctive, and I'm thinking morphology might do it for me, but I have to confess I'm lost with the documentation.
Does anyone have a suggested approach and/or code?
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-10T11:31:12-07:00
by fmw42
Your image are either too small or the font size is too small. So we cannot even read them. I doubt that you could OCR such images. Can you provide and image that has better resolution?
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-10T15:04:50-07:00
by snibgo
Clicking on the image, then the download button, improves the size, but not by much. The capital letter height (eg of "T") is 10 pixels, which in my experience is unlikely to get good results from Tesseract. It says your first image is:
Code: Select all
ms I5 some rzndum text with 2 ymsssgen ward m n and grammar mistake.
Sadly, the blue wavy line beneath "grammar" overwrites part of the base of the "g". If you removed the blue line, you would need to "know" that some of it should be replaced with black or gray instead of white.
Ignoring those problems, the problem is fairly simple. Words have wide gaps between them, and letters have small gaps. The wavy lines mostly (always?) span entire words, so are mostly wider than individual letters. So the task is to isolate graphic objects that are wider than a certain width, and paint them over with white.
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-11T03:10:51-07:00
by gddolph
fmw42 wrote: ↑2017-01-10T11:31:12-07:00
Your image are either too small or the font size is too small. So we cannot even read them. I doubt that you could OCR such images. Can you provide and image that has better resolution?
Sorry about that, I've used a different image upload site and now the images are scaled properly on the page.
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-11T04:08:31-07:00
by gddolph
snibgo wrote: ↑2017-01-10T15:04:50-07:00
Clicking on the image, then the download button, improves the size, but not by much. The capital letter height (eg of "T") is 10 pixels, which in my experience is unlikely to get good results from Tesseract. It says your first image is:
Code: Select all
ms I5 some rzndum text with 2 ymsssgen ward m n and grammar mistake.
Sadly, the blue wavy line beneath "grammar" overwrites part of the base of the "g". If you removed the blue line, you would need to "know" that some of it should be replaced with black or gray instead of white.
Ignoring those problems, the problem is fairly simple. Words have wide gaps between them, and letters have small gaps. The wavy lines mostly (always?) span entire words, so are mostly wider than individual letters. So the task is to isolate graphic objects that are wider than a certain width, and paint them over with white.
Hi @snibgo, thanks for your response. The images I originally had on the post didn't scale, I've fixed that by using a different site. I've had no problems using tesseract on the properly scaled image, I've edited my main post so that the images are scaled correctly and included my complete command line including tesseract and the results.
I like your idea of isolating objects that are wider, I'm not sure about how to do that. I've tried using morphology but everything I do ends up being a mess, or deleting everything! Do you have any suggestions on how to do it?
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-11T04:57:25-07:00
by snibgo
The following removes the worst of the lines, where they are darker in any channel than 77% of maximum, and 20 or more pixels wide. (Letters are about 10 pixels wide.) Textcleaner may take care of the remaining bits.
Windows BAT syntax. For Bash, change ^ to \, and %% to %, and the syntax of the environment variables.
Code: Select all
set SRC=doXrJR.png
set LEN=20
%IM%convert ^
%SRC% -write mpr:ORG ^
-channel RGB ^
-threshold 77%% ^
+channel ^
-write mpr:MSK ^
-colorspace Gray -threshold 50%% ^
( +clone ^
-negate ^
-morphology Erode rectangle:%LEN%x1 ^
-mask mpr:MSK -morphology Dilate rectangle:%LEN%x1 ^
+mask ^
-threshold 0 ^
) ^
-delete 0 ^
mpr:ORG ^
+swap ^
-compose Lighten -composite ^
out.png
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-11T09:17:47-07:00
by gddolph
snibgo wrote: ↑2017-01-11T04:57:25-07:00
The following removes the worst of the lines, where they are darker in any channel than 77% of maximum, and 20 or more pixels wide. (Letters are about 10 pixels wide.) Textcleaner may take care of the remaining bits.
Windows BAT syntax. For Bash, change ^ to \, and %% to %, and the syntax of the environment variables.
Code: Select all
set SRC=doXrJR.png
set LEN=20
%IM%convert ^
%SRC% -write mpr:ORG ^
-channel RGB ^
-threshold 77%% ^
+channel ^
-write mpr:MSK ^
-colorspace Gray -threshold 50%% ^
( +clone ^
-negate ^
-morphology Erode rectangle:%LEN%x1 ^
-mask mpr:MSK -morphology Dilate rectangle:%LEN%x1 ^
+mask ^
-threshold 0 ^
) ^
-delete 0 ^
mpr:ORG ^
+swap ^
-compose Lighten -composite ^
out.png
Thanks @snibgo, I've tried that code, it does remove some of the line, but it does leave some traces. Reducing the SET value to 12 made it work better, but leaves some still. I'll play around with this some and see if I can get it to work.
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-11T18:25:06-07:00
by anthony
Nice use of morphology... Exactly what it is meant for
Use Erode to locate image parts that match the given kernel
set a conditional dilation mask, and dilate the result back to the matching lines
remove the found lines.
It might be improved by replacing the kernel with a
DIY kernel of the 'wavyline', so that it better matches the lines MS word adds.
If the kernel matches the lines more closely, you will get a tighter match, and perhaps avoid the use of conditional (masked) dilation. That means the erode-dilate steps will become the simpler 'Open' morphology equivalent.
You can generate a DIY kernel from an image using the "
image2kernel" script I wrote for another problem (see
Drawing Symbols.
I have updated the
Morphology DIY user kernels section to demonstrate using that script, to generate morphicological kernels.
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-12T07:13:05-07:00
by gddolph
anthony wrote: ↑2017-01-11T18:25:06-07:00
Nice use of morphology... Exactly what it is meant for
Use Erode to locate image parts that match the given kernel
set a conditional dilation mask, and dilate the result back to the matching lines
remove the found lines.
It might be improved by replacing the kernel with a
DIY kernel of the 'wavyline', so that it better matches the lines MS word adds.
If the kernel matches the lines more closely, you will get a tighter match, and perhaps avoid the use of conditional (masked) dilation. That means the erode-dilate steps will become the simpler 'Open' morphology equivalent.
You can generate a DIY kernel from an image using the "
image2kernel" script I wrote for another problem (see
Drawing Symbols.
I have updated the
Morphology DIY user kernels section to demonstrate using that script, to generate morphicological kernels.
Thanks Anthony, that image2kernel script is exactly what I needed. I've created a grayscale kernel file using that script and called it ms_wavy_kernel.dat. I've plugged into my command line similar to the syntax in the Alternatives to Symbols section as follows, but I get an error and I'm not sure what I'm doing wrong:
Code: Select all
$ convert 20170110/word_fonts_1.png -write mpr:ORG -channel RGB -threshold 77% \
> +channel -write mpr:MSK -colorspace Gray -threshold 50% +clone -negate \
> -morphology Erode @ms_wavy_kernel.dat \
> -mask mpr:MSK -morphology Dilate @ms_wavy_kernel.dat \
> +mask -threshold 0 -delete 0 mpr:ORG +swap -compose Lighten -composite \
> word_fonts_1_conv1.png
Failed to parse kernel number #0
convert: invalid argument for option `-morphology': @ms_wavy_kernel.dat @ error/convert.c/ConvertImageCommand/2045.
The kernel file is in the same directory as the script, although I have given it the full path as well. Taking the @ symbols off doesn't help either. It's probably something simple, what am I missing?
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-12T07:31:58-07:00
by snibgo
The error is "Failed to parse kernel number #0".
That would be the first kernel in ms_wavy_kernel.dat. Which you might show us.
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-12T07:40:30-07:00
by gddolph
snibgo wrote: ↑2017-01-12T07:31:58-07:00
The error is "Failed to parse kernel number #0".
That would be the first kernel in ms_wavy_kernel.dat. Which you might show us.
Hi snibgo, can do. It's pretty long. I did a grayscale kernel, so it's one file:
Code: Select all
24x8:
0.59375
0.59375
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.59375
0.59375
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.5
0.5
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.99609375
0.99609375
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.99609375
0.99609375
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.421875
0.421875
0.421875
0.421875
0.07421875
0.07421875
0.07421875
0.07421875
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
0.99609375
0.99609375
0.99609375
0.99609375
0.5
0.5
0.5
0.5
I used
Code: Select all
convert -scale 200% -type grayscale
on one of my images, then I copied a 24x8 pixel example of the wavy line to a separate png which I used image2kernel -g on to create the kernel above. Here's the image I used.
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-12T08:25:38-07:00
by snibgo
That parses without error for me, v6.9.5-3 and v7.0.3-5.
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-12T09:44:47-07:00
by gddolph
snibgo wrote: ↑2017-01-12T08:25:38-07:00
That parses without error for me, v6.9.5-3 and v7.0.3-5.
Embarrassingly the problem was my version. The system had 6.7.8-9 installed, once I upgraded to 7.0.4-3 the command worked. I say worked, as in it didn't fail, but it didn't remove anything. The reason is that it wasn't matching anything, because the morphology was being run on a negated image. I negated the image and re-ran image2kernel and then used that kernel, which worked, giving me this image:
The only thing is it's slow, on a larger screenshot it took several seconds to run, which could be a problem given I need to automate this to process 100k images per day. Still, it's a great start!
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-12T10:14:10-07:00
by fmw42
You could try -connected-components and throw out any regions that are not a shade of gray. See
http://magick.imagemagick.org/script/co ... onents.php
Re: How to remove MS Word Spellcheck wavy lines
Posted: 2017-01-12T16:16:57-07:00
by anthony
Hmmm ... The kernel generated is for a convolution, not a dilatation. I am also not sure if it is inverted.
After making the conversions, and reformatting so as to make it more 'human readable' the resulting kernel did not look much like a wavy line, but more like a hash pattern!
Code: Select all
24x8:
1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - -
1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - -
1 1 - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1
1 1 - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1
- - 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - -
- - 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1 - - - -
- - - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1
- - - - 1 1 1 1 - - - - 1 1 1 1 - - - - 1 1 1 1
This is obviously not a wavy line.
Grab the newer version of the
image2kernel script and use the flags
-gm on a white on black copy of the image so as to generate the right type of kernel.
The '
m' flag has the script convert image into a morphological kernel (thresholded values of '1' and '-', the latter meaning not part of neighbourhood). It should work better in matching edges of the wavey lines.
See the new 'flag' examples at the bottom of...
http://www.imagemagick.org/Usage/morphology/#user