Hi all,
what i'm basically trying to achieve is convert jpg file to tiff so that it could be parsed with tesseract.
The problem - when i convert it by using convert command line utility, tesseract output contains a lot of garbage. But if i process the same image on Mac with Preview (simply auto-levels and save as tiff) the output of tesseract is pretty good.
Visually looking at the images i can see that convert output is a bit darker. Any suggestions not for particular case, but for general use?
Command i'm using to convert:
convert -alpha set -auto-level -auto-gamma -compress none sample.jpg output.tiff
Tried using with -normalize also , does not help.
Also the files
Original : http://dl.dropbox.com/u/12535857/tesseract/sample.jpg
Mac : http://dl.dropbox.com/u/12535857/tesseract/mac.tiff
Imagemagick: http://dl.dropbox.com/u/12535857/tesseract/ubuntu.tiff
Converting jpg to tiff for OCR with tesseract
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Converting jpg to tiff for OCR with tesseract
What version of IM are you using?
This works fine for me on IM 6.7.6.1 Q16 Mac OSX Snow Leopard.
convert 1sample.jpg -auto-level -compress none 1sample.tiff
The culprit that is causing it to darken is -auto-gamma. Leave it off. Also the proper IM 6 syntax is in general to read the input image first.
see
http://www.imagemagick.org/Usage/basics/#cmdline
This works fine for me on IM 6.7.6.1 Q16 Mac OSX Snow Leopard.
convert 1sample.jpg -auto-level -compress none 1sample.tiff
The culprit that is causing it to darken is -auto-gamma. Leave it off. Also the proper IM 6 syntax is in general to read the input image first.
see
http://www.imagemagick.org/Usage/basics/#cmdline
Re: Converting jpg to tiff for OCR with tesseract
Thanks for reply,
Version: ImageMagick 6.6.2-6 2011-03-16 Q16 http://www.imagemagick.org
Its on Ubuntu 11.04
Unfortunately if i omit the -auto-gamma tesseract produces only garbage
Version: ImageMagick 6.6.2-6 2011-03-16 Q16 http://www.imagemagick.org
Its on Ubuntu 11.04
Unfortunately if i omit the -auto-gamma tesseract produces only garbage
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Converting jpg to tiff for OCR with tesseract
But your Mac file does not cause problems and it is not darkened by the auto-gamma.
Why do you have -alpha set enabled? It only turns on a perfectly opaque alpha channel, which I would think would not be good for your tesseract.
Did you try my command?
Try upgrading your IM as you are about 100 versions old.
Why do you have -alpha set enabled? It only turns on a perfectly opaque alpha channel, which I would think would not be good for your tesseract.
Did you try my command?
Try upgrading your IM as you are about 100 versions old.
Re: Converting jpg to tiff for OCR with tesseract
I wanted to make it 100% like the one on my macfmw42 wrote: Why do you have -alpha set enabled? It only turns on a perfectly opaque alpha channel, which I would think would not be good for your tesseract.
I've tried your command before upgrading. Garbage output. So 'ive upgraded my imagemagick to "Version: ImageMagick 6.7.6-1 2012-03-20 Q16"fmw42 wrote: Did you try my command?
Try upgrading your IM as you are about 100 versions old.
And after that even my old command failed - tesseract would simply output empty file of the tiff.
After downloading and examining the one created on my Mac with the one created on Ubuntu with your command i somehow noticed that the one on Mac is kind of more sharp. So i've started playing with -contrast option, and it seems that 3x -contrast gets me where i want.
onvert sample.jpg -auto-level -contrast -contrast -contrast -compress none sample.tiff
The file with new imagemagic without constrat option - http://dl.dropbox.com/u/12535857/tesser ... agick.tiff
And with the contrast option - http://dl.dropbox.com/u/12535857/tesser ... trast.tiff
And the output:
Mac:
Code: Select all
‘or the short pastry:
a|1.purp0se flour,
1% 613:5 more as needed
stick) unsalted butter
Z/6 cup sugar
5 egg yolks
Salt
l4 lb. (1
For the filling:
6 oz. blanched almonds
6 large eggs, separated
1% cup sugar
1 pinch ground cinnamon
Grated zest of 1 lemon
‘A cup pearjelly,
warmed to liquld
For the glaze:
1% cups sugar
107- (1 square) unsweetened
l chocolate
Code: Select all
/awe short pastry:
l all-purpose flour,
13‘ (“£35 more as needed
$4 "1 (1 stick) unsalted butter
1/6 cup sugar
5 egg yolks
Salt
For the filling:
602. blanched almonds
3 large eggs, separated
% cup sugar
1 pinch ground cinnamon
Grated zest of 1 lemon
‘A cup pearjelly,
warmed to liquld
For the glaze:
1% cups sugar
'91-(1 square) unsweetened
~ » chocolate
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Converting jpg to tiff for OCR with tesseract
If you are trying to clean up the background so that you can get black text on white background, then you can try my bash unix/IM script, textcleaner. Or you can try using -lat (local area threshold), which is what I use in the script. See the link below. Do not use -auto-gamma. Just use your the jpg input image with my script and then save as tiff. You may have to play with the parameters some.
P.S. If you still need to do contrast adjustment, I suggest using -brightness-contrast XX,YY. You can set XX=0 for no brightness change and just adjust the contrast by YY (as a percent change). You might also try some noise cleaning (-blur, -median) and thresholding. But all of these adjustments are including in my textcleaner script.
P.S. If you still need to do contrast adjustment, I suggest using -brightness-contrast XX,YY. You can set XX=0 for no brightness change and just adjust the contrast by YY (as a percent change). You might also try some noise cleaning (-blur, -median) and thresholding. But all of these adjustments are including in my textcleaner script.
Re: Converting jpg to tiff for OCR with tesseract
cool! Thanks. The textcleaner was 100% superb without any finetuning. Best results so far. I've tried your autolevel script, but that did not help much.
Now i just need to find hosting where that script runs quicker than 20 seconds
Thanks for help!
Now i just need to find hosting where that script runs quicker than 20 seconds
Thanks for help!
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Converting jpg to tiff for OCR with tesseract
My autolevel script is not much different than the IM -auto-level combined with -auto-gamma. My script was a prototype for the IM function. Neither are what you really need. You can simplify my script by using -lat.