Legacy ImageMagick Discussions Archive

Folks,

We have a huge amount of PDF's of all different shapes and sizes and we are trying to OCR the text off of them using tesseract.

The quality of the scan is the problem. We have been messing with colors and colorspace using "Quantize", we have messed around with "threshold" and "Posterize" and "WhiteThreshold" etc and there are 2 things I know for a fact. One is that how we convert the PDF into a tiff makes a HUGE difference in the scan. The other is we just don't know enough about images to know what the best thing to do is.

So far about the best results have come from using

Quantize with colorspace=>'Gray'
and
Threshold=>'20',channel=>'All'

But the scan is still not where we want it to be.

Can anyone suggest anything else they think we should be doing to the IM Object before writing it out as a tiff to OCR?

TIA!!!

When converting a PDF, IM uses a default resolution of 72dpi. If you set this to 200dpi for example, the OCR program may work better. However, it will also take longer to convert the file and the output file will be bigger.

Pete

Thanks Pete, that helps a lot.

Still not quite where we want to be. Any other suggestions to improve the image?

TIA!

I have *very* limited experience scanning images for OCR. It might help if you could show the IM command you are using and provide a link to an example PDF.

Pete

Hi Pete,

I appreciate anything you could tell me. I'm doing this with perl magick as it's our hope we will be able to create a script that will process a ton of different PDF's.

Here is a example PDF, its just a generic ad

http://208.79.234.36/test/752586.pdf

I'm just using a test script to do it (like I said I'm want to be able to use perl magick).
This is it, 200 is the best density for sure - 150 or 250 resulted in a worse scan.

Code: Select all

	
        $image->Set(density=>'200x200');
        my $im_error = $image->Read($pdf_file);
        $image->Quantize(colorspace=>'Gray');
        $image->Set(Threshold=>'20',channel=>'All');
        $im_error = $image->Write("tif:$tiff_file");

Then I'm using tesseract on the tiff image.

Everything I do effects the scan, a density more or less makes it worse, no threshold is worse, not changing it to grayscale makes its worse.
That's about as far as I could get, I have no idea what I could do to make it better.

Thanks for taking the time to take a look!!!

Legacy ImageMagick Discussions Archive

Suggestions on PDF to tiff conversion for OCR scan

Suggestions on PDF to tiff conversion for OCR scan

Re: Suggestions on PDF to tiff conversion for OCR scan

Re: Suggestions on PDF to tiff conversion for OCR scan

Re: Suggestions on PDF to tiff conversion for OCR scan

Re: Suggestions on PDF to tiff conversion for OCR scan