Pre-Process for OCR with tesseract. Only Digits.
Posted: 2016-01-28T05:47:33-07:00
Hi,
I have to process screenshots from excel-style tables with mostly numbers in it. Since the screenshot always has the same size and all columns are at static pixel positions, I have implemented a script to crop out individual cells into small images. THey have a very low resolution of just 15-19 pixels height. The digits have different foreground and background colors, sometimes the background color is achieved through dithering (png) which makes ocr even much more difficult. I tried several techniques to improve the image quality and ocr works quite well already but i still get some mistakes and I would like to kindly ask if you had any other suggestions for how to improve the images. Below some examples of original and my improvement results.
->
options used: -scale 1000% -blur 0x02 -colorspace gray -threshold 60% -type bilevel
---
->
options used: -scale 1000% -posterize 2 -blur 0x02 -colorspace gray -threshold 65% -type bilevel
options used: -scale 1000% -blur 0x02 -colorspace gray -threshold 17% -type bilevel
---
I am detecting foreground and background colors and based on the result use different arguments for convert.
I am still a beginner to ImageMagick or image processing in general so I am wondering if there are other ways to improve OCR readability of those images. I have to add that I have not yet done a training for tesseract based on those images. It seems to be a quite complicated process, I am willing to do it but I would like to get the best out of image processing first. Especially I would like to avoid letter overlapping which is caused by the blur I am applying.
Thanks for your feedback.
I am using Ubuntu and the shell version of convert, called by a perl script, since I did not find all options in PerlMagick
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP
I have to process screenshots from excel-style tables with mostly numbers in it. Since the screenshot always has the same size and all columns are at static pixel positions, I have implemented a script to crop out individual cells into small images. THey have a very low resolution of just 15-19 pixels height. The digits have different foreground and background colors, sometimes the background color is achieved through dithering (png) which makes ocr even much more difficult. I tried several techniques to improve the image quality and ocr works quite well already but i still get some mistakes and I would like to kindly ask if you had any other suggestions for how to improve the images. Below some examples of original and my improvement results.
->
options used: -scale 1000% -blur 0x02 -colorspace gray -threshold 60% -type bilevel
---
->
options used: -scale 1000% -posterize 2 -blur 0x02 -colorspace gray -threshold 65% -type bilevel
options used: -scale 1000% -blur 0x02 -colorspace gray -threshold 17% -type bilevel
---
I am detecting foreground and background colors and based on the result use different arguments for convert.
I am still a beginner to ImageMagick or image processing in general so I am wondering if there are other ways to improve OCR readability of those images. I have to add that I have not yet done a training for tesseract based on those images. It seems to be a quite complicated process, I am willing to do it but I would like to get the best out of image processing first. Especially I would like to avoid letter overlapping which is caused by the blur I am applying.
Thanks for your feedback.
I am using Ubuntu and the shell version of convert, called by a perl script, since I did not find all options in PerlMagick
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP