Hello all,
I've been trying to perform OCR on a number of images like the one below but am running in to issues because of all the noise present throughout the image. The two main issues being the slice of yellow around the number in the upper left corner of the image and the image of the film reel in the background of the image. Both of these things cause the accuracy of my OCR to decrease immensely.
I tried my hand at removing the background but had no luck with that. More recently I've been changing the images to gray scale, darkening the image and then changing anything to white that isn't black. I've experimented with a number of different fuzz levels, but none seem to produce particularly good results either.
Unfortunately my experience with ImageMagick is pretty minimal, so I figured I'd post here to see if anyone had any other ideas as to how I could better process these images.
Any help provided is greatly appreciated.
Remove Background Noise For OCR
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Remove Background Noise For OCR
you might experiment with looking at each channel of different colorspaces and see if any one makes it easier for you to get the text.
convert image -colorspace ??? -separate image_%d.png
convert image -colorspace ??? -separate image_%d.png
- anthony
- Posts: 8883
- Joined: 2004-05-31T19:27:03-07:00
- Authentication code: 8675308
- Location: Brisbane, Australia
Re: Remove Background Noise For OCR
The problem here is a low level pixel noise and lack of contrast.
For the former you can try many of the noise reducing methods, such as -median
or even -morphology smooth
The contrast is basically color adjustments. Most OCR software seems to like contrast increased to extreme thresholding so that each pixel is either black or white.
In either case you will probably need to first crop out individual areas of text you are interested in. OCR's software would probably have a lot of trouble with such a disordered collection of text, as it stands.
Basically... Simplify Simplify Simplify
For the former you can try many of the noise reducing methods, such as -median
or even -morphology smooth
The contrast is basically color adjustments. Most OCR software seems to like contrast increased to extreme thresholding so that each pixel is either black or white.
In either case you will probably need to first crop out individual areas of text you are interested in. OCR's software would probably have a lot of trouble with such a disordered collection of text, as it stands.
Basically... Simplify Simplify Simplify
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
https://imagemagick.org/Usage/
- anthony
- Posts: 8883
- Joined: 2004-05-31T19:27:03-07:00
- Authentication code: 8675308
- Location: Brisbane, Australia
Re: Remove Background Noise For OCR
Another alturnative is to avoid the images entierly.
If this is a 'live' feed that you are working with. then the "Movies on Demand" website may have the information in plain text or perhaps HTML that needs only minimal text processing to extarct the desired information!
OCR is hard. Text from web sites easy!
If this is a 'live' feed that you are working with. then the "Movies on Demand" website may have the information in plain text or perhaps HTML that needs only minimal text processing to extarct the desired information!
OCR is hard. Text from web sites easy!
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
https://imagemagick.org/Usage/
Re: Remove Background Noise For OCR
This was the first route I tried (and desperately hoped) to take, but it unfortunately didn't produce the same results.anthony wrote:Another alturnative is to avoid the images entierly.
If this is a 'live' feed that you are working with. then the "Movies on Demand" website may have the information in plain text or perhaps HTML that needs only minimal text processing to extarct the desired information!
OCR is hard. Text from web sites easy!
Thank you both for your suggestions. I'll play around with them and circle back around with my results.
Re: Remove Background Noise For OCR
You could generate two images, one specifically to recognize white-on-dark text and one to recognize dark-on-white text
convert Cb3R5.png -normalize ( -clone 0 -blur 5 ) -compose minus -composite -normalize cbblack.png
convert Cb3R5.png -normalize ( -clone 0 -blur 5 ) +swap -compose minus -composite -normalize cbwhite.png
Still looks very hard to recognize without errors though
Alternatively:
If the images you want to process all have the same background, without moving parts, then you can extract that background and remove it fairly easily: http://www.imagemagick.org/Usage/masking/#known_bgnd
This would also work if there are a few possible backgrounds, or if the background consists of an animation with several frames, you just have to repeat the process for all possible backgrounds
convert Cb3R5.png -normalize ( -clone 0 -blur 5 ) -compose minus -composite -normalize cbblack.png
convert Cb3R5.png -normalize ( -clone 0 -blur 5 ) +swap -compose minus -composite -normalize cbwhite.png
Still looks very hard to recognize without errors though
Alternatively:
If the images you want to process all have the same background, without moving parts, then you can extract that background and remove it fairly easily: http://www.imagemagick.org/Usage/masking/#known_bgnd
This would also work if there are a few possible backgrounds, or if the background consists of an animation with several frames, you just have to repeat the process for all possible backgrounds
- anthony
- Posts: 8883
- Joined: 2004-05-31T19:27:03-07:00
- Authentication code: 8675308
- Location: Brisbane, Australia
Re: Remove Background Noise For OCR
At the bottom of the same page on masking, is a more avanced form of removing background that recovers anti-aliasing too.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
https://imagemagick.org/Usage/