Legacy ImageMagick Discussions Archive

This is my first post here, so hi everyone.

I have been using Imagemagick to pre-process images in preparation for OCR. But the input is typewriter copy which is a problem because some characters are typed with less force and so are considerably lighter than others. The other problem is that the white background varies in brightness across the image (I have no control of the scanning process). The OCR output was crummy.

Then I explored a bit, and found this useful line:
$ convert 1345.jpg -colorspace gray $ +clone -blur 0x20 $ -compose Divide_Src -composite 1345photocopy1.jpg
# ref http://www.imagemagick.org/Usage/compose/#divide

After this, the OCR output is quite good, thanks!. But not perfect. Are there other things I could do in ImageMagick to improve this?

If on Linux or MacOSX or (Windows with Cygwin), you can try my script textcleaner at the link below.

Code: Select all

textcleaner -g -e normalize -f 15 -o 10 -t 35 snippetFrom1345.png snippetFrom1345_g_norm_f15_o10_35.png

otherwise, use the IM function -lat

Code: Select all

convert snippetFrom1345.png -negate -lat 15x15+10% -negate snippetFrom1345_lat_15x15_10.png

adjust the 15x15 as desired.

Thanks, the -lat command line gives better results than the -divide line. Both work better than the old way, where I just brightened the input by 50%. I have not tried the textcleanup line yet but would like to.

But now I am using more CPU time for 'convert' than for 'tesseract. Yikes!

Oops, sorry, hold that last post. The -lat method gives more words in the OCR output, but there are incorrect and broken up words. The -divide method gives almost correct text. I should try other values for 15x15.

Legacy ImageMagick Discussions Archive

OCR of typewriter copy

OCR of typewriter copy

Re: OCR of typewriter copy

Re: OCR of typewriter copy

Re: OCR of typewriter copy