Page 1 of 2

Removing caption background for use in ORC (tessearct)

Posted: 2017-03-07T05:54:05-07:00
by gilberto_san
I have some movies with embedded subtitles.
I'm doing a paper that tries to extract the text from these captions.

Examples:
Image
Font colors may change

I would like a result like that.
Image

I'm having a lot of trouble when the font color is gray
(as in the last 2 pictures)
Image

Code used.

Code: Select all

 convert Screenshot_2.bmp  -channel rgba -alpha set -fuzz 10%% -fill none  -opaque "#786965" -opaque "#B9B1B9" -opaque "#D6D5D7" -opaque "#E8F4F7" -opaque "#19191C" -opaque "#39383A" -opaque "#8A8A8B" -opaque "#524C4E" -opaque "#3C4B4C" -opaque "#425353" -opaque "#425353" gil.bmp 
I'm new to IMAGEMAGICK
Sorry for bad English .
Thank you.

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-07T06:17:49-07:00
by gilberto_san
Note: It is possible to obtain 2 or more images with the same text but with different background

Image

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-07T10:09:42-07:00
by snibgo
gilberto_san wrote:Note: It is possible to obtain 2 or more images with the same text but with different background.
That simplifies the problem. One possible method:

1. Take two images with the same text but with different background.

2. Find the best alignment between those images. Now the text appears in the same position on the two images.

3. Where corresponding pixels are different, this is background. Where they are the same, this is probably subtitle.

I show many methods for aligning images on my pages, linked below.

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-07T20:23:44-07:00
by gilberto_san
Thank you for answering. Congratulations on the site. He helped me a lot.

I have a question.
1. ok. :D
2. ok. :D
3. Sorry, I do not know how to do this. Could you help me with an imagemagick code? :?

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-07T21:15:34-07:00
by snibgo
For example, this finds the difference between two images. Where the difference is not zero (black), it changes the pixels to white.

Code: Select all

convert in1.png in2.png -compose Difference -composite -fill White +opaque Black out.png

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-08T04:43:26-07:00
by gilberto_san
snibgo wrote: 2017-03-07T21:15:34-07:00 For example, this finds the difference between two images. Where the difference is not zero (black), it changes the pixels to white.

Code: Select all

convert in1.png in2.png -compose Difference -composite -fill White +opaque Black out.png
Look the images
Image
Image

i use

Code: Select all

convert in1.png in2.png -compose Difference -composite -fill White +opaque Black out.png
Result
Image

Note that only the letter W was used.
They are aligned but the result was bad.

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-08T08:24:51-07:00
by snibgo
The pixels in the text in the two images are not identical. Instead of "-fill White +opaque Black" you can use "-threshold 1%" or similar.

When the background pixels are also within 1% of each other, you may need to combine multiple images in this way.

When the height of the capital letters is less than 20 pixels, OCR with Tesseract may be unreliable.

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-08T11:18:55-07:00
by gilberto_san
Thank you so much for your cooperation :D :D :D . I did the test with 4 identical texts and with different backgrounds.
Has improved a bit but is not yet readable

I was thinking of another solution.
It would be possible to normalize the colors of an image.
example:
An image has pixels: dark green, medium green, light green and green.
Would it be possible to turn them green?

Likewise transform: light gray, medium gray, dark gray to gray?

If so, please tell me how?

Look the figure
Image

Remembering that green was just an example, the fonts can be any color.
And the Fund is always in variations of gray and black because the films are old.

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-08T11:33:51-07:00
by snibgo
If the text is always coloured and the background is always gray, pixel saturation will separate them, eg:

Code: Select all

convert subtGrn.png -colorspace HSL -channel G -separate +channel -auto-level out.png
After "-auto-level", you could "-level 40%,60%" or "-sigmoidal-contrast 10,50%" to emphasise the difference.

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-09T04:46:51-07:00
by gilberto_san
I did some tests, these comands solve 60% of the problems.

To solve it all I need something that does the following:
1) transform all shades of any color (not just green) in the actual color
Example: light red, dark red are transformed into red
2) all shades of gray are transformed into white (or other desired color)
Is there any command that does such a transformation?

See the example I did manually
Image

Basically I would like similar colors to become a single color

Thank you one more time :D :D :D :D

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-09T05:04:57-07:00
by snibgo
gilberto_san wrote:1) transform all shades of any color (not just green) in the actual color
Example: light red, dark red are transformed into red
This is vague. What are the possible "actual colors"? Just red green and blue? How about cyan yellow and magenta? Orange, brown, pink?

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-09T09:42:14-07:00
by gilberto_san
Note the colors of the letter S. I have a variety of colors.
The fact is that the font seems to be the same and the background is in variations of gray and black

Image
Image
Image

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-09T09:54:42-07:00
by fmw42
Threshold on saturation might separate the colors from the grayish background.

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-09T11:08:59-07:00
by fmw42
Try this Unix syntax:

Code: Select all

convert 9a6c443dfa76445fa95158948cc4e9bd.png \
\( -clone 0 -colorspace HCL -channel g -separate +channel -threshold 40% \) \
-alpha off -compose copy_opacity -composite result.png
Please always provide your IM version and platform, since syntax may differ.

Re: Removing caption background for use in ORC (tessearct)

Posted: 2017-03-09T11:19:04-07:00
by gilberto_san
Could you help me by putting the code?

I got good medium results using the code.

Code: Select all

convert out.bmp +dither   -posterize 2  out1.bmp
Image

However, in some cases, the following problem occurs:
Image
:shock: :shock: :shock:

I thank you for your attention.