My company is moving away from using an application called SimpleIndex that could OCR files for scanned images. I'm testing ImageMagicK/Tesseract OCR (hopefully with PHP to get the job done). Initially, we have a PDF that has several scanned images combined together. I then use this command line to use ImageMagicK to convert the PDF file to a TIF.
Code: Select all
magick.exe convert -strip -alpha off -density 300 100492.PDF -depth 2 -quality 100 -compress zip 100492.TIF
- The original PDF size is at 2,573 KB.
- After ImageMagicK it goes up 4,219 KB.
Is there anything else I can do to reduce the TIF file size without affecting the preferred density at 300 and reducing the resolution for tesseract?
For more info, next, I use tesseract to OCR the TIF file and output it as a PDF.
- The end result is a 7,208 KB PDF.
- This is more than double the size of the SimpleIndex file which is at 3,589 KB.
NOTE: Oddly enough I tested another TIF file (same original PDF file but changed the depth from 2 to 8 and quality from 100 to the default 92 on ImageMagicK which produced a 6,466 KB TIF file). After running tesseract on it produced the exact same size PDF at 7,208 KB PDF.