Hello...
Would anyone have any suggestions for bringing an image up to OCR level.
The original image is a scan of an extremely faded microfiche.
Here is a snippet of the image. It's a Dropbox link so just X the pop-up asking you to join.
http://tinyurl.com/zdy5p4r
I had to do a colour capture as the microfiche was so faded.
My thinking was that I could use -black-threshold and maybe a bit of blur. Although the result is good enough for the human eye the OCR wasn't happy. Barely a 30% strike rate. In total there are about 500 pages to OCR so I'd like a bit better strike rate.
I've also tried some of Fred's noise removal scripts but I couldn't isolate them to the noise only as they ate into the letters as well.
Also looked at Snibgo's monochrome scripts but could not make head nor tail of them
Any suggestions appreciated.
Many thanks
Miguel
OCR a scan of a faded microfiche
-
- Posts: 22
- Joined: 2015-09-27T20:26:53-07:00
- Authentication code: 1151
OCR a scan of a faded microfiche
Last edited by miguellint on 2016-04-03T00:37:03-07:00, edited 1 time in total.
-
- Posts: 12159
- Joined: 2010-01-23T23:01:33-07:00
- Authentication code: 1151
- Location: England, UK
Re: OCR a scan of a faded microfiche
I suggest you manually clean the image, using Gimp or similar, until your OCR software can read it. Then show us that image, and we might be able to suggest how to automatically to the cleaning.
snibgo's IM pages: im.snibgo.com
-
- Posts: 22
- Joined: 2015-09-27T20:26:53-07:00
- Authentication code: 1151
Re: OCR a scan of a faded microfiche
Hello Snibgo...
Here's one I made earlier
This is what a scanned fiche with a 100% OCR strike rate looks like...
http://tinyurl.com/zc6culq
The original scanned image is of such a reasonable quality that all I need do is deskew and crop the image then make it a bit more "solid" using the following command supplied by Fred...
If needed I can do a despeckle with the following...
I have the deskew/crop/negate commands in a bash script which will quite happily work away in the background and tidy up 500 images in an hour or two. The OCR strike rate is usually 90%-ish.
The annoying thing is that the fiche I'm currently scanning are really quite faded. Using the scanner's "colour capture" option is the only way to bring out any definition.
Any advice appreciated.
Many thanks
Miguel
Here's one I made earlier
This is what a scanned fiche with a 100% OCR strike rate looks like...
http://tinyurl.com/zc6culq
The original scanned image is of such a reasonable quality that all I need do is deskew and crop the image then make it a bit more "solid" using the following command supplied by Fred...
Code: Select all
convert infile -negate -lat 10x10+2% -negate outfile.png
Code: Select all
convert infile -morphology close diamond:1 outfile.png
The annoying thing is that the fiche I'm currently scanning are really quite faded. Using the scanner's "colour capture" option is the only way to bring out any definition.
Any advice appreciated.
Many thanks
Miguel
-
- Posts: 22
- Joined: 2015-09-27T20:26:53-07:00
- Authentication code: 1151
Re: OCR a scan of a faded microfiche
Here's a bash step-by-step that improves the OCR strike rate considerably.
(And here's a Dropbox link to a Before/After image so just X any popups asking you to register.)
Before and After
http://tinyurl.com/hl4qx5k
(And here's a Dropbox link to a Before/After image so just X any popups asking you to register.)
Before and After
http://tinyurl.com/hl4qx5k
Code: Select all
# Despeckle
for f in *.png
do
file=`convert $f -format "%f" info:`
convert $file -morphology close diamond:1 ${file%.*}_dspk.png
done
# Floodfill with white
for f in *dspk.png
do
file=`convert $f -format "%f" info:`
coordsNW=`convert $f -format "0,0" info:`
convert $file -fuzz 20% -fill white -draw "color $coordsNW floodfill" ${file%.*}_ff.png
done
# Slight blur
for f in *ff.png
do
file=`convert $f -format "%f" info:`
convert $file -blur 0x1 ${file%.*}_blur.png
done
# Convert to grayscale
for f in *blur.png
do
file=`convert $f -format "%f" info:`
convert $file -type Grayscale ${file%.*}_gray.png
done
# Fred's ImageMagick Textcleaner script - BEST SCRIPT EVER :-)
# Use everywhere even when not needed
for f in *gray.png
do
file=`convert $f -format "%f" info:`
textcleaner $file ${file%.*}_tc.png
done
# Darken/even out text
for f in *tc.png
do
file=`convert $f -format "%f" info:`
convert $file -negate -lat 10x10+2% -negate ${file%.*}_dark.png
done
# Slight blur again
for f in *dark.png
do
file=`convert $f -format "%f" info:`
convert $file -blur 0x1 ${file%.*}_cleaned.png
done
rename 's/_dspk_ff_blur_gray_tc_dark_cleaned/_cleaned/' *
rm *dspk*.png
Re: OCR a scan of a faded microfiche
Please don't use tinyurl. I cannot follow any of your links because tinyurl blocks Tor.
Anyway, without being able to see your links, I'll blindly suggest a tool called "unpaper". When I have to OCR a document, I use imagemagick in combination with unpaper.
Anyway, without being able to see your links, I'll blindly suggest a tool called "unpaper". When I have to OCR a document, I use imagemagick in combination with unpaper.