Office Document Imaging compatibility
Posted: 2009-11-05T12:40:03-07:00
Hello,
After some messing around, I've figured out how to use ImageMagick to convert pdf files into tiff files are compatible with Microsoft Office Document Imaging (because I'm cheap and I don't want to figure out how to use Tesseract). Since it took me entirely too long, I'm writing this post in the hope that future Internet-searchers have an easier time.
Weirdly enough, when you convert directly from pdf to tiff, you get a file that's not compatible with MODI. However, if you go to a jpeg in between, the resultant file is compatible. Hopefully someone with better knowledge of image formats than me can look at the two different outputs and figure out what options you need to toggle to do it in one step.
Here's what I did, as Windows batch commands:
And then tadah! everything.tiff has your entire PDF, in a format that MODI can read and OCR pretty well. Of course, I've run into problems converting the OCR'd tiff back into a PDF - maybe I'll reply to this once I find out how to fix that.
Hopefully this helps someone else.
After some messing around, I've figured out how to use ImageMagick to convert pdf files into tiff files are compatible with Microsoft Office Document Imaging (because I'm cheap and I don't want to figure out how to use Tesseract). Since it took me entirely too long, I'm writing this post in the hope that future Internet-searchers have an easier time.
Weirdly enough, when you convert directly from pdf to tiff, you get a file that's not compatible with MODI. However, if you go to a jpeg in between, the resultant file is compatible. Hopefully someone with better knowledge of image formats than me can look at the two different outputs and figure out what options you need to toggle to do it in one step.
Here's what I did, as Windows batch commands:
Code: Select all
convert -quality 100 -density 400 -resize 25% in.pdf out%d.jpg
Code: Select all
FOR /F %a IN ('dir /b *.jpg') DO convert -colorspace RGB +compress -type TrueColor -resize 300% %a "%a"-new.tiff
Code: Select all
convert -adjoin out*.jpg-new.tiff tiff:everything.tiff
And then tadah! everything.tiff has your entire PDF, in a format that MODI can read and OCR pretty well. Of course, I've run into problems converting the OCR'd tiff back into a PDF - maybe I'll reply to this once I find out how to fix that.
Hopefully this helps someone else.