Converting jpg to tiff for OCR with tesseract

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
gerasalus
Posts: 4
Joined: 2012-03-20T10:10:04-07:00
Authentication code: 8675308

Converting jpg to tiff for OCR with tesseract

Post by gerasalus »

Hi all,

what i'm basically trying to achieve is convert jpg file to tiff so that it could be parsed with tesseract.

The problem - when i convert it by using convert command line utility, tesseract output contains a lot of garbage. But if i process the same image on Mac with Preview (simply auto-levels and save as tiff) the output of tesseract is pretty good.

Visually looking at the images i can see that convert output is a bit darker. Any suggestions not for particular case, but for general use?

Command i'm using to convert:

convert -alpha set -auto-level -auto-gamma -compress none sample.jpg output.tiff

Tried using with -normalize also , does not help.

Also the files

Original : http://dl.dropbox.com/u/12535857/tesseract/sample.jpg
Mac : http://dl.dropbox.com/u/12535857/tesseract/mac.tiff
Imagemagick: http://dl.dropbox.com/u/12535857/tesseract/ubuntu.tiff
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Converting jpg to tiff for OCR with tesseract

Post by fmw42 »

What version of IM are you using?

This works fine for me on IM 6.7.6.1 Q16 Mac OSX Snow Leopard.

convert 1sample.jpg -auto-level -compress none 1sample.tiff

The culprit that is causing it to darken is -auto-gamma. Leave it off. Also the proper IM 6 syntax is in general to read the input image first.

see
http://www.imagemagick.org/Usage/basics/#cmdline
gerasalus
Posts: 4
Joined: 2012-03-20T10:10:04-07:00
Authentication code: 8675308

Re: Converting jpg to tiff for OCR with tesseract

Post by gerasalus »

Thanks for reply,

Version: ImageMagick 6.6.2-6 2011-03-16 Q16 http://www.imagemagick.org
Its on Ubuntu 11.04

Unfortunately if i omit the -auto-gamma tesseract produces only garbage :(
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Converting jpg to tiff for OCR with tesseract

Post by fmw42 »

But your Mac file does not cause problems and it is not darkened by the auto-gamma.

Why do you have -alpha set enabled? It only turns on a perfectly opaque alpha channel, which I would think would not be good for your tesseract.

Did you try my command?

Try upgrading your IM as you are about 100 versions old.
gerasalus
Posts: 4
Joined: 2012-03-20T10:10:04-07:00
Authentication code: 8675308

Re: Converting jpg to tiff for OCR with tesseract

Post by gerasalus »

fmw42 wrote: Why do you have -alpha set enabled? It only turns on a perfectly opaque alpha channel, which I would think would not be good for your tesseract.
I wanted to make it 100% like the one on my mac
fmw42 wrote: Did you try my command?

Try upgrading your IM as you are about 100 versions old.
I've tried your command before upgrading. Garbage output. So 'ive upgraded my imagemagick to "Version: ImageMagick 6.7.6-1 2012-03-20 Q16"
And after that even my old command failed - tesseract would simply output empty file of the tiff.
After downloading and examining the one created on my Mac with the one created on Ubuntu with your command i somehow noticed that the one on Mac is kind of more sharp. So i've started playing with -contrast option, and it seems that 3x -contrast gets me where i want.

onvert sample.jpg -auto-level -contrast -contrast -contrast -compress none sample.tiff

The file with new imagemagic without constrat option - http://dl.dropbox.com/u/12535857/tesser ... agick.tiff
And with the contrast option - http://dl.dropbox.com/u/12535857/tesser ... trast.tiff

And the output:

Mac:

Code: Select all

‘or the short pastry: 
a|1.purp0se flour, 
1% 613:5 more as needed 
stick) unsalted butter 
Z/6 cup sugar 
5 egg yolks 
Salt 
l4 lb. (1 
For the filling: 
6 oz. blanched almonds 
6 large eggs, separated 
1% cup sugar 
1 pinch ground cinnamon 
Grated zest of 1 lemon 
‘A cup pearjelly, 
warmed to liquld 
For the glaze: 
1% cups sugar 
107- (1 square) unsweetened 
l chocolate 
Ubuntu with high constrast:

Code: Select all

/awe short pastry:
l all-purpose flour,
13‘ (“£35 more as needed
$4 "1 (1 stick) unsalted butter
1/6 cup sugar
5 egg yolks
Salt
For the filling:
602. blanched almonds
3 large eggs, separated
% cup sugar
1 pinch ground cinnamon
Grated zest of 1 lemon
‘A cup pearjelly,
warmed to liquld
For the glaze:
1% cups sugar
'91-(1 square) unsweetened
~ » chocolate
Now the question is ... can i somehow automate this process my examining image levels or some other properties, and decide how many contrast or other properties should i apply ?
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Converting jpg to tiff for OCR with tesseract

Post by fmw42 »

If you are trying to clean up the background so that you can get black text on white background, then you can try my bash unix/IM script, textcleaner. Or you can try using -lat (local area threshold), which is what I use in the script. See the link below. Do not use -auto-gamma. Just use your the jpg input image with my script and then save as tiff. You may have to play with the parameters some.

P.S. If you still need to do contrast adjustment, I suggest using -brightness-contrast XX,YY. You can set XX=0 for no brightness change and just adjust the contrast by YY (as a percent change). You might also try some noise cleaning (-blur, -median) and thresholding. But all of these adjustments are including in my textcleaner script.
gerasalus
Posts: 4
Joined: 2012-03-20T10:10:04-07:00
Authentication code: 8675308

Re: Converting jpg to tiff for OCR with tesseract

Post by gerasalus »

cool! Thanks. The textcleaner was 100% superb without any finetuning. Best results so far. I've tried your autolevel script, but that did not help much.
Now i just need to find hosting where that script runs quicker than 20 seconds :)

Thanks for help!
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Converting jpg to tiff for OCR with tesseract

Post by fmw42 »

My autolevel script is not much different than the IM -auto-level combined with -auto-gamma. My script was a prototype for the IM function. Neither are what you really need. You can simplify my script by using -lat.
Post Reply