Page 1 of 1
Pre-processing for OCR (outlined font)
Posted: 2013-01-18T02:38:57-07:00
by Wolfgang Woehl
I'm trying to improve OCR/tesseract legibility of text rendered in an outline font. Text spacing is dense enough that a bunch of letter pairs will touch:
http://minus.com/lyPtV8gTWD0SP. Feeding this original through tesseract will not output anything useful (completely garbled).
My best shot at it so far is trying to pick out the glyphs' meat by merging the black outline pixels with the background via floodfill:
Code: Select all
convert original.tif -fill black -draw 'color 5,5 floodfill' -negate output.tif
which results in
http://minus.com/lbup7GGmpnwy6I. This improves OCR/tesseract output dramatically but it will result in garbled text wherever the floodfill can not reach (outlines touching) and leaves "insets" behind. E.g. "wollte" will turn into "wtalltue" because of the artefacts in "o" and between "t" and "e".
Improvements or, rather, a better idea much appreciated. Thanks in advance.
Version: ImageMagick 6.7.8-10 2012-10-07 Q16
http://www.imagemagick.org (Linux)
Re: Pre-processing for OCR (outlined font)
Posted: 2013-01-18T10:57:09-07:00
by snibgo
Morphology might be useful here. For example:
Code: Select all
"%IMG%convert" wollte.jpg ^
-fuzz 50%% ^
-fill Black ^
-floodfill 0x0 White ^
w1.png
"%IMG%convert" w1.png ^
-morphology Hit-and-Miss "1x8:1,0,1,1,0,0,0,0" ^
w2.png
This identifies areas where pixel are, reading downwards: white, black, white, white, and 4 blacks. These are most of the areas that are incorrectly left white.
Re: Pre-processing for OCR (outlined font)
Posted: 2013-01-18T16:54:00-07:00
by Wolfgang Woehl
snibgo, very interesting idea indeed. Thanks for the suggestion. Going to experiment with it tomorrow. Hardcoded pixel matches though, yes? That's probably going to be a problem for variable-sized input.
Re: Pre-processing for OCR (outlined font)
Posted: 2013-01-18T17:36:14-07:00
by snibgo
I should have said: the script is Windows Bat; adjust as required for other languages. It works for any size of input.
Re: Pre-processing for OCR (outlined font)
Posted: 2013-01-18T17:38:31-07:00
by snibgo
(Well, any size at least 1x8 pixels.)
Re: Pre-processing for OCR (outlined font)
Posted: 2013-01-19T06:42:47-07:00
by Wolfgang Woehl
Right, but the kernels are fixed-size. Thus whatever it will match is fixed-size, from what I understand?
I found a related topic (area opening and closing):
Morphology, area open and close. This is about selecting contiguous areas bigger (or smaller) than a specific amount of pixels. In conjunction with some neighbourhood checking this might be feasible (if it were a feature in the first place), right?
Re: Pre-processing for OCR (outlined font)
Posted: 2013-01-19T11:27:21-07:00
by snibgo
Ah, I see what you mean. Yes, different font sizes would need different kernels.
My fragment above is a building block to remove the unfilled pixels near the bottom of characters. The kernel can be inverted for unfilled pixels near the top of characters. That leaves a few isolated pixels, which a third pass can remove.
A complete Windows Bat script, that gives perfect results for your sample file, is below. It isn't fast, because of the repeated sub-image search for white pixels. Performance would be greatly improved by dumping w2.tiff to a text file and looping through it, floodfilling w1.tiff for each white-ish pixel in w2.tiff.
If your files have different font sizes, other morphology methods may be better. See
http://www.imagemagick.org/Usage/morphology/ .
Code: Select all
"%IMG%convert" wollte.jpg ^
-fuzz 50%% ^
-fill Black ^
-floodfill 0x0 White ^
-alpha off ^
-threshold 50%% ^
-depth 8 ^
w1.tiff
rem Find unfilled pixels near the bottom of characters.
"%IMG%convert" w1.tiff ^
-morphology Hit-and-Miss "1x8:1,0,1,1,0,0,0,0" ^
w2.tiff
:Loop1
rem Find a white pixel
"%IMG%compare" ^
-metric pae -dissimilarity-threshold 1 ^
w2.tiff ^
-size 1x1 xc:white ^
-subimage-search ^
null: 2>wollteWhite.lis
type wollteWhite.lis
for /f "tokens=2,3,4 delims=()@, " %%a ^
in (wollteWhite.lis) ^
do (
set score=%%a
set foundX=%%b
set foundY=%%c
)
if /I "%score%" gtr "0.1" goto noMore1
set /A imgY=%foundY%-3
"%IMG%convert" w1.tiff -fuzz 25%% -fill Black -draw ^"color %foundX%,%imgY% floodfill^" w1.tiff
"%IMG%convert" w2.tiff -fuzz 25%% -fill Black -draw ^"color %foundX%,%foundY% floodfill^" w2.tiff
goto Loop1
:noMore1
rem Find unfilled pixels near the top of characters.
"%IMG%convert" w1.tiff ^
-threshold 50%% ^
-morphology Hit-and-Miss "1x8:0,0,0,0,1,1,0,1" ^
-threshold 50%% ^
-depth 8 ^
w2.tiff
:Loop2
rem Find a white pixel
"%IMG%compare" ^
-metric pae -dissimilarity-threshold 1 ^
w2.tiff ^
-size 1x1 xc:white ^
-subimage-search ^
null: 2>wollteWhite.lis
type wollteWhite.lis
for /f "tokens=2,3,4 delims=()@, " %%a ^
in (wollteWhite.lis) ^
do (
set score=%%a
set foundX=%%b
set foundY=%%c
)
if /I "%score%" gtr "0.1" goto noMore2
set /A imgY=%foundY%+4
"%IMG%convert" w1.tiff ^
-fuzz 50%% -fill Black -draw ^"color %foundX%,%imgY% floodfill^" ^
-threshold 50%% ^
-depth 8 ^
w1.tiff
"%IMG%convert" w2.tiff ^
-fuzz 50%% -fill Black -draw ^"color %foundX%,%foundY% floodfill^" ^
-threshold 50%% ^
-depth 8 ^
w2.tiff
goto Loop2
:noMore2
rem Eliminate single white pixels
"%IMG%convert" w1.tiff ^
-threshold 50%% ^
-morphology Hit-and-Miss "3x3:-,0,-,0,1,0,-,0,-" ^
-threshold 50%% ^
-depth 8 ^
w2.tiff
:Loop3
rem Find a white pixel
"%IMG%compare" ^
-metric pae -dissimilarity-threshold 1 ^
w2.tiff ^
-size 1x1 xc:white ^
-subimage-search ^
null: 2>wollteWhite.lis
type wollteWhite.lis
for /f "tokens=2,3,4 delims=()@, " %%a ^
in (wollteWhite.lis) ^
do (
set score=%%a
set foundX=%%b
set foundY=%%c
)
if /I "%score%" gtr "0.1" goto noMore3
set /A imgY=%foundY%
"%IMG%convert" w1.tiff ^
-fuzz 50%% -fill Black -draw ^"color %foundX%,%imgY% floodfill^" ^
-threshold 50%% ^
-depth 8 ^
w1.tiff
"%IMG%convert" w2.tiff ^
-fuzz 50%% -fill Black -draw ^"color %foundX%,%foundY% floodfill^" ^
-threshold 50%% ^
-depth 8 ^
w2.tiff
goto Loop3
:noMore3
rem Finished. w1.tiff contains the result.
Re: Pre-processing for OCR (outlined font)
Posted: 2013-01-20T14:21:56-07:00
by Wolfgang Woehl
Ok, there's one way to do it -- if somewhat cumbersome and, indeed, horrifyingly slow
Thanks for the effort, snibgo! The problem, I think, with this approach is that it deals with the artefacts of an initial operation (background floodfill with outline color) which is not suitable in the first place. That idea was really only my first babystep towards a better understanding of the problem.
I'm experimenting with another observation: Insets in outlined fonts are surrounded, at least partially, by glyph "meat". Assuming a specific search direction (left-to-right or top-to-bottom), once you encounter an outline pixel (black in this case) with neighbouring background pixels (white in this case) the following pixel should be inside glyph "meat". Floodfill with a marker color there. The next outline pixel will either lead to background or to an "inset". The check for neighbouring background pixels would fail there because the surrounding outline is filled already with marker. Mixed results so far. With the densely packed outlines I have here some locations will fail with left-to-right search.
Re: Pre-processing for OCR (outlined font)
Posted: 2013-01-20T17:50:50-07:00
by snibgo
The speed of my script can be increased by a couple of orders of magnitude, so that needn't be a major concern (depending on the volume of work, of course).
I looked at a few ways of getting to the characters without also getting the centre of the "o", the gap between "t" and "e", between "w" and "a", and so on. I couldn't quickly find a method that was simpler than my script. (Which doesn't mean that no simpler solution exists, of course.)
The real problems start if you need a general solution for different font sizes or even different fonts. For example, if the font is constant but the size isn't, you might search for every "a", then every "b", and so on.
Re: Pre-processing for OCR (outlined font)
Posted: 2013-01-20T18:00:31-07:00
by snibgo
Wolfgang Woehl wrote:This is about selecting contiguous areas bigger (or smaller) than a specific amount of pixels. In conjunction with some neighbourhood checking this might be feasible (if it were a feature in the first place), right?
Morphology can currently find areas bigger or smaller than various dimensions. (Checkout "distance".) While this might be useful, it doesn't offer an immediate solution, as the hole in "o" is larger than the dot in "i", for example.
Re: Pre-processing for OCR (outlined font)
Posted: 2013-01-21T12:17:15-07:00
by Wolfgang Woehl
Yes, it's an attractively hard problem for any kind of non-OCR approach. From an OCR-centric point-of-view, though, it might be close to trivial, shape recognition and some intelligence towards the concept of outlines.