I'm trying to get the text of a scanned PDF. I can't really give you the full version of the PDF since it's a legal document but here is a sample.
My goal here is to improve tesseract results, because currently I'm getting the following text:
I've been trying to improve the pdf quality using ImageMagick. I'm doing it manually, but I'm trying to find general settings that will be applied to all PDFs. Since it's part of a software, I won't be able to play with the settings each time I upload a pdf.PRIVACY NOTE: Section 31B of the Rea! PropertyAct 1900 (RF Act} authorises the Registrar General to coilectthe informétion required
by this form for the estabiishment and maintenanée of the Real Property Act Register. Section 968 RP Act requires that
the Register is made available to any person for search upon payment of a fee, if any.
One try I've done is using convert and the lat option to remove small imperfections like this:
Code: Select all
convert -density 300 -monochrome -lat 15x15+10% in.pdf out.tif
Imperfections are removed, but Tesseract doesn't detect anything now. I though it would be easier for it, but no.
I've seen the great textcleaner tool, with a lot of options, but as said before I can't really afford to change the settings for each pdf.
We can assume all PDF will have the same issues, so is there any "automatic" tool that will try to fix a PDF without telling the tool exactly what to do?
Thanks in advance
Edit :
As requested here is my IM version:
ImageMagick 7.0.5-4 Q16 x86_64 2017-03-25
And my platform I use for my tests is MacOs Sierra 10.12.4