Page 1 of 1

Textcleaner sensitive to IM Version?

Posted: 2016-05-04T06:39:34-07:00
by mstone
This is not *strictly* a textcleaner question, but textcleaner-adjacent, IMO.

I had the following command line (inspired by the TextCleaner script) that used to do a great job cleaning up text prior to running OCR (under IM version 6.7.7-10):

convert (infile.png -colorspace gray -type grayscale -contrast-stretch 0) (-clone 0--1 -colorspace gray -negate -lat 15x15+5% -contrast-stretch 0) -compose copy_opacity -composite -opaque none -alpha off -deskew 40% -sharpen 0x1 outfile.png

But after I upgraded to IM 6.8.9-9, the result is a little less contrast with the images, and noticeably poorer OCR results. I couldn't find information in the changelogs that would explain this.

My question is, does anybody have any insight into what could account for the difference? I can supply image samples if that helps with the diagnosis.

Thanks much,
- Matt

Re: Textcleaner sensitive to IM Version?

Posted: 2016-05-04T09:32:34-07:00
by fmw42
convert (infile.png -colorspace gray -type grayscale -contrast-stretch 0) (-clone 0--1 -colorspace gray -negate -lat 15x15+5% -contrast-stretch 0) -compose copy_opacity -composite -opaque none -alpha off -deskew 40% -sharpen 0x1 outfile.png
This is not my exact code. You have added and removed things.

Best guess is that you may have different libpng delegates. Also there is code in my script to deal with changes in colorspace over time in IM. See viewtopic.php?f=4&t=21269

-opaque typically needs to have a -fill somecolor setting before using -opaque. But why are you setting it to none and then turning alpha off?

Put a -alpha off before -compose copy_opacity

Looks like I was a bit sloppy in my code and need to clean it up a little.

Perhap you should post your input image, so others can test with it.

Try upgrading to the latest IM 6 or IM 7 version and see what happens.

Re: Textcleaner sensitive to IM Version?

Posted: 2016-05-04T10:32:10-07:00
by mstone
Thanks for the reply.

Yes, I did modify the code somewhat, but I *was* trying to preserve the original intent. The "alpha off" is to replace the +matte, which the docs say is obsolete but equivalent to "alpha off" (assuming I got that right). I put the alpha off in the same position that +matte was in the original.

Attaching the image I've been playing around with.

Image

Thanks Again,
- Matt

Re: Textcleaner sensitive to IM Version?

Posted: 2016-05-04T10:46:22-07:00
by fmw42
You probably took your code from an older version of my script. I changed all the matte to alpha a while ago.

Can you post your outputs from your two IM versions?

Did you check your versions of libpng?

Code: Select all

convert -list format
should tell you the version numbers.

Re: Textcleaner sensitive to IM Version?

Posted: 2016-05-04T11:25:00-07:00
by mstone
(Edit for future readers--had the IM versions swapped; corrected that. -Matt)

The version that gives better OCR results is below:

Image

This is off IM 6.7.7-10, with libpng ver 1.2.49

The version that yields poorer OCR results is:

Image

And that is produced by IM 6.8.9-9, and it looks like that has libpng ver 1.2.50.

The convert command line is the same between them, but the results are noticeably different.

By the way, I didn't grab the command line from the script itself, but from the command line snippet at the bottom of http://www.fmwconcepts.com/imagemagick/ ... /index.php, which still appears to have the +matte in it. Probably
the script is updated, as you said.

Thanks again,
- RBW

Re: Textcleaner sensitive to IM Version?

Posted: 2016-05-04T11:47:06-07:00
by fmw42
I will look at this further later today. But try with jpg or tiff output. Are they any different?

Code: Select all

convert (infile.png -colorspace gray -type grayscale -contrast-stretch 0) (-clone 0--1 -colorspace gray -negate -lat 15x15+5% -contrast-stretch 0) -compose copy_opacity -composite -opaque none -alpha off -deskew 40% -sharpen 0x1 outfile.png
Also note that parentheses must have spaces on both sides. This could be just a typo in your post.

Re: Textcleaner sensitive to IM Version?

Posted: 2016-05-04T15:01:09-07:00
by fmw42
Your problem occurs because version 6.7.7.10 was during the time that IM was undergoing changes of colorspace and linear vs non-linear gray. At 6.7.7.10, it was using linear gray. Before that some release and after about 6.8.5, it was nonlinear gray. See the link I posted above about this issue.

The problem is with -colorspace gray. You can fix your results by using a linear grayscale in your later versions of IM by replacing it with -grayscale rec601luminance.

I tested this with the following:


convert page_Image_0.png -colorspace gray tmp6a.png
im67710 convert page_Image_0.png -colorspace gray tmp6b.png
convert page_Image_0.png -grayscale rec601luminance tmp6a2.png

tmp6a and tmp6b are different. But tmp6a2 and tmp6b are similar.

Choices of gray can be found by

Code: Select all

convert -list intensity
Luma is non-linear (equivalent of gray sRGB)

Luminance is linear. (equivalent of gray RGB)

Re: Textcleaner sensitive to IM Version?

Posted: 2016-05-04T15:07:07-07:00
by mstone
The missing spaces near the parens are just me reformatting the command line for the post. I'm fairly sure they are there in the real world, as the pieces of the command are assembled from an array, ala:

Code: Select all

childOutput = child_process.spawnSync('convert', [
  '(',
      operationObj.docpath, 
      '-colorspace', 'gray', 
      '-type', 'grayscale',
      '-contrast-stretch', '0',
  ')',
  '(',
    '-clone', '0--1',
    '-colorspace', 'gray',
    '-negate',
    '-lat', '15x15+5%',
    '-contrast-stretch', '0',
  ')',
  '-compose', 'copy_opacity',
  '-composite',
  '-opaque', 'none',
  // '+matte',
  '-alpha', 'off',
  '-deskew', '40%',
  '-sharpen', '0x1',
  newName], {timeout: operation_timeout} );
  
child_process.spawnSync should just dumbly assemble the elements into a space-separated list, and that ought to produce the command line with spaces where we need them.

I did try with jpg and tiff in both environments. In 6.7 it produced pretty much the same image, and pretty much identical (good) OCR results. In 6.8 it was weird--converting to tiff yielded a negative image (white on black), which produced terrible OCR results. Converting to jpg produced a pretty similar output as converting to png, which is to say, slightly too-cleaned-up and therefore bad OCR results again.

Thanks,
- Matt

Re: Textcleaner sensitive to IM Version?

Posted: 2016-05-04T15:51:58-07:00
by fmw42
Did you see my post above yours about using -grayscale rec601luminance rather than -colorspace gray?

Re: Textcleaner sensitive to IM Version?

Posted: 2016-05-04T17:01:52-07:00
by mstone
Yes, I did. That's what I was referring to that made the difference. JPG / TIFF didn't do much different but -grayscale rec601luminance brought the results much more back in line with what 6.7 had been producing.

I'm also going to try IM 7 when I can afford to scrub my VM and reinstall everything, but from what you say I'll probably need to use -grayscale to get results under that version as well.

Thank you again. Very much appreciated.

Best,
- Matt

Re: Textcleaner sensitive to IM Version?

Posted: 2016-05-04T17:28:00-07:00
by fmw42
Yes, you will need to use -grayscale rec601luminance to get results similar to IM 6.7.7.10