Page 1 of 1

Converting jpg to tiff for OCR with tesseract

Posted: 2012-03-20T10:27:34-07:00
by gerasalus
Hi all,

what i'm basically trying to achieve is convert jpg file to tiff so that it could be parsed with tesseract.

The problem - when i convert it by using convert command line utility, tesseract output contains a lot of garbage. But if i process the same image on Mac with Preview (simply auto-levels and save as tiff) the output of tesseract is pretty good.

Visually looking at the images i can see that convert output is a bit darker. Any suggestions not for particular case, but for general use?

Command i'm using to convert:

convert -alpha set -auto-level -auto-gamma -compress none sample.jpg output.tiff

Tried using with -normalize also , does not help.

Also the files

Original : http://dl.dropbox.com/u/12535857/tesseract/sample.jpg
Mac : http://dl.dropbox.com/u/12535857/tesseract/mac.tiff
Imagemagick: http://dl.dropbox.com/u/12535857/tesseract/ubuntu.tiff

Re: Converting jpg to tiff for OCR with tesseract

Posted: 2012-03-20T11:32:41-07:00
by fmw42
What version of IM are you using?

This works fine for me on IM 6.7.6.1 Q16 Mac OSX Snow Leopard.

convert 1sample.jpg -auto-level -compress none 1sample.tiff

The culprit that is causing it to darken is -auto-gamma. Leave it off. Also the proper IM 6 syntax is in general to read the input image first.

see
http://www.imagemagick.org/Usage/basics/#cmdline

Re: Converting jpg to tiff for OCR with tesseract

Posted: 2012-03-20T13:06:21-07:00
by gerasalus
Thanks for reply,

Version: ImageMagick 6.6.2-6 2011-03-16 Q16 http://www.imagemagick.org
Its on Ubuntu 11.04

Unfortunately if i omit the -auto-gamma tesseract produces only garbage :(

Re: Converting jpg to tiff for OCR with tesseract

Posted: 2012-03-20T14:24:06-07:00
by fmw42
But your Mac file does not cause problems and it is not darkened by the auto-gamma.

Why do you have -alpha set enabled? It only turns on a perfectly opaque alpha channel, which I would think would not be good for your tesseract.

Did you try my command?

Try upgrading your IM as you are about 100 versions old.

Re: Converting jpg to tiff for OCR with tesseract

Posted: 2012-03-20T16:08:09-07:00
by gerasalus
fmw42 wrote: Why do you have -alpha set enabled? It only turns on a perfectly opaque alpha channel, which I would think would not be good for your tesseract.
I wanted to make it 100% like the one on my mac
fmw42 wrote: Did you try my command?

Try upgrading your IM as you are about 100 versions old.
I've tried your command before upgrading. Garbage output. So 'ive upgraded my imagemagick to "Version: ImageMagick 6.7.6-1 2012-03-20 Q16"
And after that even my old command failed - tesseract would simply output empty file of the tiff.
After downloading and examining the one created on my Mac with the one created on Ubuntu with your command i somehow noticed that the one on Mac is kind of more sharp. So i've started playing with -contrast option, and it seems that 3x -contrast gets me where i want.

onvert sample.jpg -auto-level -contrast -contrast -contrast -compress none sample.tiff

The file with new imagemagic without constrat option - http://dl.dropbox.com/u/12535857/tesser ... agick.tiff
And with the contrast option - http://dl.dropbox.com/u/12535857/tesser ... trast.tiff

And the output:

Mac:

Code: Select all

‘or the short pastry: 
a|1.purp0se flour, 
1% 613:5 more as needed 
stick) unsalted butter 
Z/6 cup sugar 
5 egg yolks 
Salt 
l4 lb. (1 
For the filling: 
6 oz. blanched almonds 
6 large eggs, separated 
1% cup sugar 
1 pinch ground cinnamon 
Grated zest of 1 lemon 
‘A cup pearjelly, 
warmed to liquld 
For the glaze: 
1% cups sugar 
107- (1 square) unsweetened 
l chocolate 
Ubuntu with high constrast:

Code: Select all

/awe short pastry:
l all-purpose flour,
13‘ (“£35 more as needed
$4 "1 (1 stick) unsalted butter
1/6 cup sugar
5 egg yolks
Salt
For the filling:
602. blanched almonds
3 large eggs, separated
% cup sugar
1 pinch ground cinnamon
Grated zest of 1 lemon
‘A cup pearjelly,
warmed to liquld
For the glaze:
1% cups sugar
'91-(1 square) unsweetened
~ » chocolate
Now the question is ... can i somehow automate this process my examining image levels or some other properties, and decide how many contrast or other properties should i apply ?

Re: Converting jpg to tiff for OCR with tesseract

Posted: 2012-03-20T16:44:30-07:00
by fmw42
If you are trying to clean up the background so that you can get black text on white background, then you can try my bash unix/IM script, textcleaner. Or you can try using -lat (local area threshold), which is what I use in the script. See the link below. Do not use -auto-gamma. Just use your the jpg input image with my script and then save as tiff. You may have to play with the parameters some.

P.S. If you still need to do contrast adjustment, I suggest using -brightness-contrast XX,YY. You can set XX=0 for no brightness change and just adjust the contrast by YY (as a percent change). You might also try some noise cleaning (-blur, -median) and thresholding. But all of these adjustments are including in my textcleaner script.

Re: Converting jpg to tiff for OCR with tesseract

Posted: 2012-03-21T12:39:40-07:00
by gerasalus
cool! Thanks. The textcleaner was 100% superb without any finetuning. Best results so far. I've tried your autolevel script, but that did not help much.
Now i just need to find hosting where that script runs quicker than 20 seconds :)

Thanks for help!

Re: Converting jpg to tiff for OCR with tesseract

Posted: 2012-03-21T15:06:56-07:00
by fmw42
My autolevel script is not much different than the IM -auto-level combined with -auto-gamma. My script was a prototype for the IM function. Neither are what you really need. You can simplify my script by using -lat.