I am using imagemagick and the textcleaner script to preprocess image files for tesseract OCR, and while I'm having good success so far I've run into one problem that I need some help to solve. One of the use cases I have is to process text from screenshots, and I'm finding that tesseract getting confused by the red or green wavy underlines MS Word adds for spelling and grammar errors. Here's an example:
I am looking for a way to get rid of the red and blue lines, and I have had some success with the following command set:
Code: Select all
[devbox@fraitcf1vd1998 images]$ convert 20170110/deliberate_mistakes1.png -sharpen 0x1.0 -fuzz 30% -fill white -opaque 'rgb(255,0,0)' \
> -opaque 'rgb(0,0,255)' -scale 200% miff:- |\
> ./textcleaner -g -e stretch -f 50 -o 10 -s 1 - png:- |\
> tesseract - stdout
This is some random text with a missspelt word im it and, grammar mistake.
And the textcleaner section gives me this:
The problem I have is that this is a blunt instrument which changes all instances of red or blue into white, if all I wanted to do was read black text on white backgrounds then I'd be overjoyed by this solution, however I will have images with text in multiple colors.
One approach I've thought about is trying to detect the wavy line shape as it is very distinctive, and I'm thinking morphology might do it for me, but I have to confess I'm lost with the documentation.
Does anyone have a suggested approach and/or code?