Basic settings for Tesseract

Vico · Post by **Vico** » 2017-07-18T01:22:15-07:00

Hello guys,

I'm trying to get the text of a scanned PDF. I can't really give you the full version of the PDF since it's a legal document but here is a sample.

My goal here is to improve tesseract results, because currently I'm getting the following text:

PRIVACY NOTE: Section 31B of the Rea! PropertyAct 1900 (RF Act} authorises the Registrar General to coilectthe informétion required
by this form for the estabiishment and maintenanée of the Real Property Act Register. Section 968 RP Act requires that

the Register is made available to any person for search upon payment of a fee, if any.

I've been trying to improve the pdf quality using ImageMagick. I'm doing it manually, but I'm trying to find general settings that will be applied to all PDFs. Since it's part of a software, I won't be able to play with the settings each time I upload a pdf.

One try I've done is using convert and the lat option to remove small imperfections like this:

Code: Select all

 convert -density 300 -monochrome -lat 15x15+10% in.pdf out.tif

Imperfections are removed, but Tesseract doesn't detect anything now. I though it would be easier for it, but no.

I've seen the great textcleaner tool, with a lot of options, but as said before I can't really afford to change the settings for each pdf.

We can assume all PDF will have the same issues, so is there any "automatic" tool that will try to fix a PDF without telling the tool exactly what to do?

Thanks in advance

Edit :
As requested here is my IM version:
ImageMagick 7.0.5-4 Q16 x86_64 2017-03-25

And my platform I use for my tests is MacOs Sierra 10.12.4

Post by **fmw42** » 2017-07-18T09:20:00-07:00

Please always provide your IM version and platform when asking questions, since syntax may vary.

Try without the monochrome and read the input right after setting the density. You might also try larger densities and if you do, then increase the 15 arguments

Code: Select all

convert -density 300 in.pdf -negate -lat 15x15+10% -negate out.tif

Legacy ImageMagick Discussions Archive

Basic settings for Tesseract

Basic settings for Tesseract

Re: Basic settings for Tesseract