Image database of scanned letters.

miguellint · Post by **miguellint** » 2016-01-25T19:59:09-07:00

Hello...

I'm not sure who to ask (and I'm not even sure this is a thing) but I'd appreciate any thoughts

---

I have several dozen scanned images of old and faded microfiche. The microfiche hold lists of names and places.

Here is an example of the letter E. As can be seen there are five distinct shapes even though they are all for the same letter.

(Dropbox link so just X any pop-ups asking you to register)

http://tinyurl.com/jzpy5sl

If I collect all of these shapes, standardise their size (e.g 30x40), and then put them in a database under "Uppercase letter E" could I then go through each of my original scanned images and compare each 30x40 pixel area to the database. When the comparision is true I could then paste a "good" 30x40px copy of the letter E over the original "bad" 30x40px letter E.

Repeat for each letter of the alphabet.

I know that when you are training OCR software you build up a database containg various images of each letter. I'm wondering if I can do something similar using IM.

Hope that all makes sense

Any thoughts appreciated
Miguel

IM 6.8.9-9
Kubuntu 15.10

Post by **snibgo** » 2016-01-25T20:11:07-07:00

I suppose you could do this, but why? People have written special-purpose software to read text. IM is a general-purpose image processor, and doesn't contain code to recognise, characterise and match letter-forms. Using a brute-force method (chop the document into individual letters, and compare all with every image in the database) will be massively slow.

miguellint · Post by **miguellint** » 2016-01-25T20:37:32-07:00

Hello Snigbo...

Thanks for replying

It's not so much about OCR'ing the images. It's more about making the images look "nice" (for want of a better word).

And learning more about IM - specifically looping a copy/paste region by region, IM and databases, and comparing images pixel by pixel. I'm still a total noob.

The microfiche are not that legible so, as it stands, if I OCR a scanned image the software will come back with dozens of suspect letters which it will want me to correct. Multiply that by several dozen original scans and I'll soon have to make thousands of corrections.

When I eventually finish the OCR I'll have a text version of the microfiche (which admittedly is pretty cool) but the original scans will still look pretty shabby and still be fairly illegible.

---

Totally appreciate that it seems pointless but I'm just hoping for some pointers, not full blown code.

Thanks
Miguel

Post by **fmw42** » 2016-01-26T00:43:58-07:00

compare is not rotation or scale invariant. It can only find offsets and is slow. But see compare

http://www.imagemagick.org/script/compare.php
http://www.imagemagick.org/Usage/compare/

Legacy ImageMagick Discussions Archive

Image database of scanned letters.

Image database of scanned letters.

Re: Image database of scanned letters.

Re: Image database of scanned letters.

Re: Image database of scanned letters.