Page 1 of 1

Grayscale background removal for OCR

Posted: 2019-03-19T12:34:02-07:00
by SteelMassimo
Hello everyone!

I'm having the following issue:

I need to prepare a huge quantity of images for OCR reading. Problem is, the information is written in a part of a document that has a grayscale background. Also, the scanner that produced this image did so in color, so when the OCR tries to read it, it make a mess out of it.

I tried the following commands, but the result came with a lot of noise, where the grayscale background used to be.

Code: Select all

convert 0068_example.jpg -type Grayscale -brightness-contrast +15x100 0068_result.jpg
The original image: https://drive.google.com/open?id=1ugHSQ ... zOYsahLVE9

The result: https://drive.google.com/open?id=1P1j9B ... k3284sFBVk

I've also tried converting to B&W and then blurring a bit to avoid too much pixalated images, but the results were inconsistent.

Any suggestions in order to make it cleaner for OCR reading ?

Thanks!

IM version: 7.0.8-27-Q16-x64
OS: Windows 7 Pro 64-bits.
Version date: 2019-01-27.

Re: Grayscale background removal for OCR

Posted: 2019-03-19T16:54:41-07:00
by fmw42
Perhaps something like this will help. https://stackoverflow.com/questions/530 ... 9#53016979

Re: Grayscale background removal for OCR

Posted: 2019-03-20T08:52:47-07:00
by SteelMassimo
Hello again fmw42!

Here's what I did, since my problem differs a little:

Removed the lines that concern trimming, thinning and border color. It went something like this:

Code: Select all

convert 0068_example.jpg -type Grayscale ^
-fuzz 22% ^
-define connected-components:remove=0 ^
-define connected-components:mean-color=true ^
-connected-components 4 ^
-background white -flatten ^
result_example.jpg
And here's the output: https://drive.google.com/file/d/1S7TdmN ... sp=sharing

The list of objects found by -define connected-components is way too massive to be of any use. Still, here's the link for it:

https://drive.google.com/file/d/1Jhp8ps ... sp=sharing

The image is still with a lot of noise from the gray background strip. What modifications should I do in order to make it recognize the objetcs that I want to remove better?

Re: Grayscale background removal for OCR

Posted: 2019-03-20T11:54:52-07:00
by fmw42
try something like this

Code: Select all

convert 0068_example.jpg -type Grayscale -threshold 25% ^
-define connected-components:area-threshold=5 ^
-define connected-components:mean-color=true ^
-connected-components 4 ^
result.png
You may have to remove the long horizontal lines and the table lines. See https://stackoverflow.com/questions/540 ... 6#54044746

I strongly suggest that you not save to JPG. That will just degrade your image more so that OCR is harder. Also if possible do not scan to JPG (or PDF). Use TIFF (LZW compressed if possible) or PNG if possible.

Re: Grayscale background removal for OCR

Posted: 2019-03-22T11:48:32-07:00
by SteelMassimo
Hey fmw42!

Worked like a charm. After some tweaks with the threshold and the area for connected components, the code is good enough for OCR reading.

No need to remove the black line, it does not interfere. Thanks for the help!

Final code:

Code: Select all

convert 0001.jpg -type Grayscale -threshold 35% -define connected-components:area-threshold=3 -define connected-components:mean-color=true -connected-components 8 0001.jpg