I'm not quite sure what to call this - hence the strange subject - but this is what it is:
I have a large number of images - strictly black on white - that I want to analyse like this:
1) I want to separate them out in components: if I have, say, a text written in black on white paper, I want to cut it up into all the letters, and in fact, I want to cut out the dots over the 'i's and put them in a separate bag too. A little bit like OCR, but without trying to identify what each bit is.
2) After having generated my component images, I want to automatically categorise them, so that all dots are bundled together, all vertical strokes are the same (but perhaps split into 'short', like the one in the letter 'i' and 'long', like 'l') etc. These categories are now my "standard components".
3) Having done this, I want to analyse the original images to see which of my standard components are part of them, and then use this to build an index.
Is there any open source tool or toolset in existence, that can do one or more of these steps? Or even almost? Or perhaps just a hint about where I might start looking?
Batch analysis of images (?)
-
- Posts: 12159
- Joined: 2010-01-23T23:01:33-07:00
- Authentication code: 1151
- Location: England, UK
Re: Batch analysis of images (?)
With ImageMagick, (1) is easy: "-connected-components" does that.
(2) is somewhat vague. You can get data about each component: the size, various moments that give the shape, and so on.
(3) sounds like a simple search for each component within the original images. IM can easily do exact searches, with "compare".
(2) is somewhat vague. You can get data about each component: the size, various moments that give the shape, and so on.
(3) sounds like a simple search for each component within the original images. IM can easily do exact searches, with "compare".
snibgo's IM pages: im.snibgo.com
Re: Batch analysis of images (?)
snibgo: that's great - thank you for helping.
About (2): One of the things I want to do is analyse Chinese seal script characters - they do have a standard form, but their exact shape still depends on the handwriting style of whoever wrote them, and each component can have it's overall shape altered to fit within a complex character. I would like to automatically identify all the variants with each other, because that is how your eye would identify them. There isn't really a good analogy in European writing, but imagine I wanted to put the letter 't' in the same box as '+', even thought the 't' has a little hook at the bottom; but I don't want '#' in the same box, if that makes sense.
And in (3), I don't want to search for the exact shape, for the same reason as stated in (2).
Still, this is something to start with; I'll see if I can do some experiments this weekend.
About (2): One of the things I want to do is analyse Chinese seal script characters - they do have a standard form, but their exact shape still depends on the handwriting style of whoever wrote them, and each component can have it's overall shape altered to fit within a complex character. I would like to automatically identify all the variants with each other, because that is how your eye would identify them. There isn't really a good analogy in European writing, but imagine I wanted to put the letter 't' in the same box as '+', even thought the 't' has a little hook at the bottom; but I don't want '#' in the same box, if that makes sense.
And in (3), I don't want to search for the exact shape, for the same reason as stated in (2).
Still, this is something to start with; I'll see if I can do some experiments this weekend.
-
- Posts: 12159
- Joined: 2010-01-23T23:01:33-07:00
- Authentication code: 1151
- Location: England, UK
Re: Batch analysis of images (?)
Interesting. I suggest you look at OCR (Optical Character Recognition) software, eg Tesseract. Tesseract probably contains the analysis code you need, and also recognises Chinese text (with a training module).
snibgo's IM pages: im.snibgo.com