This is my first post here, so hi everyone.
I have been using Imagemagick to pre-process images in preparation for OCR. But the input is typewriter copy which is a problem because some characters are typed with less force and so are considerably lighter than others. The other problem is that the white background varies in brightness across the image (I have no control of the scanning process). The OCR output was crummy.
Then I explored a bit, and found this useful line:
$ convert 1345.jpg -colorspace gray \( +clone -blur 0x20 \) -compose Divide_Src -composite 1345photocopy1.jpg
# ref http://www.imagemagick.org/Usage/compose/#divide
After this, the OCR output is quite good, thanks!. But not perfect. Are there other things I could do in ImageMagick to improve this?
OCR of typewriter copy
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: OCR of typewriter copy
If on Linux or MacOSX or (Windows with Cygwin), you can try my script textcleaner at the link below.
otherwise, use the IM function -lat
adjust the 15x15 as desired.
Code: Select all
textcleaner -g -e normalize -f 15 -o 10 -t 35 snippetFrom1345.png snippetFrom1345_g_norm_f15_o10_35.png
otherwise, use the IM function -lat
Code: Select all
convert snippetFrom1345.png -negate -lat 15x15+10% -negate snippetFrom1345_lat_15x15_10.png
adjust the 15x15 as desired.
Re: OCR of typewriter copy
Thanks, the -lat command line gives better results than the -divide line. Both work better than the old way, where I just brightened the input by 50%. I have not tried the textcleanup line yet but would like to.
But now I am using more CPU time for 'convert' than for 'tesseract. Yikes!
But now I am using more CPU time for 'convert' than for 'tesseract. Yikes!
Re: OCR of typewriter copy
Oops, sorry, hold that last post. The -lat method gives more words in the OCR output, but there are incorrect and broken up words. The -divide method gives almost correct text. I should try other values for 15x15.