Multi-page pdf to single tiff or jpg file
Posted: 2007-10-18T08:04:39-07:00
I have to turn a multi-page pdf file that contains images of text into a jpg or tiff file so the OCR software in HP Document Viewer can read it. The OCR software can't batch-process so I want to feed it a single file containing all pages from the pdf file.
I'm not sure how to determine the proper resolution when converting and I don't know how to determine the resolution of the original pdf file. I'm also not sure if the conversion that I tried from pdf to tiff (using Ghostscript) was lossy, or if the appending that I did with ImageMagick to create a single file was lossy. I want lossless. The OCR might have actually been more accurate on the jpg than the tiff based on a spell check that found 509 misspellings in the OCR translation of the jpg ( http://www.polisource.com/misc/ocr-test.txt ) and 540 in the OCR translation of the tiff ( http://www.polisource.com/misc/ocr-test-2.txt ), so I was wondering if I need to add something to the commands to make it non-lossy.
The source pdf document is a local copy of http://democrats.science.house.gov/Medi ... A%20IG.pdf
Here are the commands I used (first I used Ghostscript to convert, then I discovered that I need ImageMagick to join the pages):
Then I changed the resolution of PCIE-Report-Image.tif to 300 dpi using XNview before I used OCR, otherwise the OCR software would reject the file because of low resolution. In the first command I use a resolution of 150 because 300 pretty much crashed my computer.
When I made the jpg to compare to the tiff I just changed the extension of PCIE-Report-Image at the end of the second command.
Sorry this turned out to be so long... basically I'm asking how to convert a pdf document into a single tiff file for use by OCR software.
I'm not sure how to determine the proper resolution when converting and I don't know how to determine the resolution of the original pdf file. I'm also not sure if the conversion that I tried from pdf to tiff (using Ghostscript) was lossy, or if the appending that I did with ImageMagick to create a single file was lossy. I want lossless. The OCR might have actually been more accurate on the jpg than the tiff based on a spell check that found 509 misspellings in the OCR translation of the jpg ( http://www.polisource.com/misc/ocr-test.txt ) and 540 in the OCR translation of the tiff ( http://www.polisource.com/misc/ocr-test-2.txt ), so I was wondering if I need to add something to the commands to make it non-lossy.
The source pdf document is a local copy of http://democrats.science.house.gov/Medi ... A%20IG.pdf
Here are the commands I used (first I used Ghostscript to convert, then I discovered that I need ImageMagick to join the pages):
Code: Select all
gswin32c -sOutputFile=PCIE-Report-%03d.tif -r150 -sDEVICE=tiff32nc -dBATCH -dNOPAUSE "document.pdf"
convert PCIE-Report-001.tif PCIE-Report-002.tif PCIE-Report-003.tif PCIE-Report-004.tif PCIE-Report-005.tif PCIE-Report-006.tif PCIE-Report-007.tif PCIE-Report-008.tif PCIE-Report-009.tif PCIE-Report-010.tif PCIE-Report-011.tif PCIE-Report-012.tif PCIE-Report-013.tif -append PCIE-Report-Image.tif
When I made the jpg to compare to the tiff I just changed the extension of PCIE-Report-Image at the end of the second command.
Sorry this turned out to be so long... basically I'm asking how to convert a pdf document into a single tiff file for use by OCR software.