I can use IrfanView to convert a pdf into a multipage tiff file.
This file can be imported into OmniPage (then OCRed).
If I use ImageMagick to convert the same pdf into a multipage tiff file, then the file cannot be imported into OmniPage.
OmniPage complains - "OmniPage SE cannot load or scan this image.".
I use this command which results in a tiff with a similar filesize to the one produced by IrfanView.
convert -density 96x96 myPdfDoc.pdf -resize 1056x816 myTifDoc.tif
The file can be opened by other applications, eg: Microsoft Office Document Imaging, Microsoft Picture and Fax Viewer, IrfanView.
If I open it in IrfanView and then resave it, then OmniPage is happy with it.
What is going on?
Multipage Tiff not accepted by OmniPage
Re: Multipage Tiff not accepted by OmniPage
ImageMagick creates a valid TIFF image but one that OmniPage cannot read. To fix, you need to identify what TIFF types OmniPage supports and add options to the ImageMagick command line to produce that TIFF type. Post the output of the
- identify -verbose
Re: Multipage Tiff not accepted by OmniPage
Thanks for the prompt reply!
Here are the first image section snippets from the output of
The output from the verbose identify command on the IrfanView generated tif is 2344KB whereas the output from the ImageMagick generated tif is only 258KB.
Do you want me to upload both files so you can 'diff' them?
Here are the first image section snippets from the output of
- $identify -verbose pdf2TiffWithIM.tif
Image: pdf2TiffWithIM.tif
Format: TIFF (Tagged Image File Format)
Class: DirectClass
Geometry: 631x816+0+0
Resolution: 96x96
Print size: 6.57292x8.5
Units: Undefined
Type: TrueColor
Base type: TrueColor
Endianess: MSB
Colorspace: RGB
Depth: 16-bit
Channel depth:
red: 16-bit
green: 16-bit
blue: 16-bit
Channel statistics:
red:
min: 0 (0)
max: 65535 (1)
mean: 44559.9 (0.67994)
standard deviation: 18432.9 (0.281268)
kurtosis: -1.12467
skewness: -0.0634544
green:
min: 0 (0)
max: 65535 (1)
mean: 46875.9 (0.715281)
standard deviation: 16544.7 (0.252457)
kurtosis: -0.533182
skewness: -0.234325
blue:
min: 0 (0)
max: 65535 (1)
mean: 61374.2 (0.93651)
standard deviation: 12525.3 (0.191123)
kurtosis: 14.7549
skewness: -3.89405
Image statistics:
Overall:
min: 0 (0)
max: 65535 (1)
mean: 38202.5 (0.582933)
standard deviation: 26843.9 (0.409612)
kurtosis: -1.45017
skewness: -0.349018
Rendering intent: Undefined
Interlace: None
Background color: white
Border color: rgb(223,223,223)
Matte color: grey74
Transparent color: black
Compose: Over
Page geometry: 631x816+0+0
Dispose: Undefined
Iterations: 0
Scene: 0 of 133
Compression: None
Orientation: TopLeft
Properties:
date:create: 2009-12-08T11:58:05+00:00
date:modify: 2009-12-08T11:58:52+00:00
signature: 1b9949721ad8d7975a8be9dd8c797ad0c0350bbb514d92fd6e82eb11d57a3075
tiff:document: pdf2TiffWithIM.tif
tiff:photometric: RGB
tiff:rows-per-strip: 2
tiff:software: ImageMagick 6.5.8-3 2009-11-28 Q16 http://www.imagemagick.org
Artifacts:
verbose: true
Tainted: False
Filesize: 365.8MiB
Number pixels: 503KiB
Pixels per second: 264KiB
User time: 1.906u
Elapsed time: 0:02.906
Version: ImageMagick 6.5.8-3 2009-11-28 Q16 http://www.imagemagick.org
- $identify -verbose pdf2TiffWithIrfanView.tif
Image: pdf2TiffWithIrfanView.tif
Format: TIFF (Tagged Image File Format)
Class: DirectClass
Geometry: 816x1056+0+0
Resolution: 96x96
Print size: 8.5x11
Units: PixelsPerInch
Type: TrueColor
Base type: TrueColor
Endianess: MSB
Colorspace: RGB
Depth: 8-bit
Channel depth:
red: 8-bit
green: 8-bit
blue: 8-bit
Channel statistics:
red:
min: 0 (0)
max: 255 (1)
mean: 173.426 (0.6801)
standard deviation: 73.0236 (0.286367)
kurtosis: -1.03044
skewness: -0.125362
green:
min: 0 (0)
max: 255 (1)
mean: 182.436 (0.715437)
standard deviation: 65.7418 (0.257811)
kurtosis: -0.354149
skewness: -0.321318
blue:
min: 0 (0)
max: 255 (1)
mean: 238.832 (0.936594)
standard deviation: 50.467 (0.19791)
kurtosis: 15.0705
skewness: -3.96344
Image statistics:
Overall:
min: 0 (0)
max: 255 (1)
mean: 148.673 (0.583033)
standard deviation: 105.103 (0.412167)
kurtosis: -1.46045
skewness: -0.350269
Rendering intent: Undefined
Interlace: None
Background color: white
Border color: rgb(223,223,223)
Matte color: grey74
Transparent color: black
Compose: Over
Page geometry: 816x1056+0+0
Dispose: Undefined
Iterations: 0
Scene: 0 of 133
Compression: None
Orientation: TopLeft
Properties:
date:create: 2009-12-08T11:48:14+00:00
date:modify: 2009-12-08T11:49:43+00:00
signature: 0748cf4c66b909da4b45afe14169d71560d81a5d8ec90e0182970010c220881c
tiff:photometric: RGB
tiff:rows-per-strip: 3
tiff:software: IrfanView
Artifacts:
verbose: true
Tainted: False
Filesize: 328.3MiB
Number pixels: 842KiB
Pixels per second: 15.7KiB
User time: 5.438u
Elapsed time: 0:54.631
Version: ImageMagick 6.5.8-3 2009-11-28 Q16 http://www.imagemagick.org
The output from the verbose identify command on the IrfanView generated tif is 2344KB whereas the output from the ImageMagick generated tif is only 258KB.
Do you want me to upload both files so you can 'diff' them?
Re: Multipage Tiff not accepted by OmniPage
ImageMagick produces 16-bit per channel TIFF images. Let's start with that. Add
- -depth 8
Re: Multipage Tiff not accepted by OmniPage
Yes, I saw that and have already tried it.
It does make it work but the resolution is awful! Worse than with IrfanView.
I thought I'd try ImageMagick to see if I could get better resolution than with IrfanView.
I use OmniPage for OCRing so I'd like optimum resolution (ideally 300x300 according to the OmniPage manual).
I don't really understand how to achieve that from the original pdf.
I did an identify on the pdf and it says -
Image: myPDFDoc.pdf
Format: PDF (Portable Document Format)
Class: DirectClass
Geometry: 612x792+0+0
Resolution: 72x72
Print size: 8.5x11
Units: Undefined
Type: TrueColor
Endianess: Undefined
Colorspace: RGB
Depth: 16/8-bit
Channel depth:
red: 8-bit
green: 8-bit
blue: 8-bit
Channel statistics:
red:
min: 0 (0)
max: 65535 (1)
mean: 44555 (0.679866)
standard deviation: 18776.5 (0.286511)
kurtosis: -1.02682
skewness: -0.126939
green:
min: 0 (0)
max: 65535 (1)
mean: 46879.1 (0.71533)
standard deviation: 16890.2 (0.257728)
kurtosis: -0.352392
skewness: -0.321098
blue:
min: 0 (0)
max: 65535 (1)
mean: 61340.1 (0.93599)
standard deviation: 13007.6 (0.198484)
kurtosis: 14.8609
skewness: -3.93556
Image statistics:
Overall:
min: 0 (0)
max: 65535 (1)
mean: 38193.6 (0.582796)
standard deviation: 27006.7 (0.412096)
kurtosis: -1.46048
skewness: -0.349525
Rendering intent: Undefined
Interlace: None
Background color: white
Border color: rgb(223,223,223)
Matte color: grey74
Transparent color: black
Compose: Over
Page geometry: 612x792+0+0
Dispose: Undefined
Iterations: 0
Scene: 0 of 133
Compression: Undefined
Orientation: Undefined
Properties:
date:create: 2009-12-08T14:27:20+00:00
date:modify: 2009-12-08T14:28:19+00:00
pdf:HiResBoundingBox: 612x792+0+0
pdf:Version: PDF-1.4
signature: 53b63bf32cc565833b88617295723381714ffd823a3c95a179de302998c6b6cc
Artifacts:
verbose: true
Tainted: False
Filesize: 183.1MiB
Number pixels: 473KiB
Pixels per second: 522KiB
User time: 0.891u
Elapsed time: 0:01.905
Version: ImageMagick 6.5.8-3 2009-11-28 Q16 http://www.imagemagick.org
It does make it work but the resolution is awful! Worse than with IrfanView.
I thought I'd try ImageMagick to see if I could get better resolution than with IrfanView.
I use OmniPage for OCRing so I'd like optimum resolution (ideally 300x300 according to the OmniPage manual).
I don't really understand how to achieve that from the original pdf.
I did an identify on the pdf and it says -
Image: myPDFDoc.pdf
Format: PDF (Portable Document Format)
Class: DirectClass
Geometry: 612x792+0+0
Resolution: 72x72
Print size: 8.5x11
Units: Undefined
Type: TrueColor
Endianess: Undefined
Colorspace: RGB
Depth: 16/8-bit
Channel depth:
red: 8-bit
green: 8-bit
blue: 8-bit
Channel statistics:
red:
min: 0 (0)
max: 65535 (1)
mean: 44555 (0.679866)
standard deviation: 18776.5 (0.286511)
kurtosis: -1.02682
skewness: -0.126939
green:
min: 0 (0)
max: 65535 (1)
mean: 46879.1 (0.71533)
standard deviation: 16890.2 (0.257728)
kurtosis: -0.352392
skewness: -0.321098
blue:
min: 0 (0)
max: 65535 (1)
mean: 61340.1 (0.93599)
standard deviation: 13007.6 (0.198484)
kurtosis: 14.8609
skewness: -3.93556
Image statistics:
Overall:
min: 0 (0)
max: 65535 (1)
mean: 38193.6 (0.582796)
standard deviation: 27006.7 (0.412096)
kurtosis: -1.46048
skewness: -0.349525
Rendering intent: Undefined
Interlace: None
Background color: white
Border color: rgb(223,223,223)
Matte color: grey74
Transparent color: black
Compose: Over
Page geometry: 612x792+0+0
Dispose: Undefined
Iterations: 0
Scene: 0 of 133
Compression: Undefined
Orientation: Undefined
Properties:
date:create: 2009-12-08T14:27:20+00:00
date:modify: 2009-12-08T14:28:19+00:00
pdf:HiResBoundingBox: 612x792+0+0
pdf:Version: PDF-1.4
signature: 53b63bf32cc565833b88617295723381714ffd823a3c95a179de302998c6b6cc
Artifacts:
verbose: true
Tainted: False
Filesize: 183.1MiB
Number pixels: 473KiB
Pixels per second: 522KiB
User time: 0.891u
Elapsed time: 0:01.905
Version: ImageMagick 6.5.8-3 2009-11-28 Q16 http://www.imagemagick.org
- fmw42
- Posts: 25562
- Joined: 2007-07-02T17:14:51-07:00
- Authentication code: 1152
- Location: Sunnyvale, California, USA
Re: Multipage Tiff not accepted by OmniPage
don't know if this will help, but your IM print size is different from your Irfan and your IM units are undefined. try adding -units PixelsPerInch after -density
see http://www.imagemagick.org/script/comma ... .php#units
see http://www.imagemagick.org/script/comma ... .php#units
Re: Multipage Tiff not accepted by OmniPage
I'm just trying gsView to achieve the same goal.
I can open the pdf and convert using a device such as tiff24nc (24-bit RGB, 8-bits per component) and select the max available resolution of 204x196.
When the resultant Tiff is opened in OmniPage SE it says
I then use OmniPage SE to save each page as a jpeg (approx 100-200 KB each).
Canon MP Navigator 2.0 then converts all these jpegs into a pdf of 16MB.
Unfortunately it doesn't seem to be searchable! Arghh! So close!
I can open the pdf and convert using a device such as tiff24nc (24-bit RGB, 8-bits per component) and select the max available resolution of 204x196.
When the resultant Tiff is opened in OmniPage SE it says
- Image Resolution is 98,102 (originally 196, 204)
- Image Size is 1146x843
I then use OmniPage SE to save each page as a jpeg (approx 100-200 KB each).
Canon MP Navigator 2.0 then converts all these jpegs into a pdf of 16MB.
Unfortunately it doesn't seem to be searchable! Arghh! So close!
Re: Multipage Tiff not accepted by OmniPage
I've had some success with ImageMagick now.
My latest command is :-
convert -depth 8 -density 96x96 -units PixelsPerInch file.pdf -resize 816x1056 myTifDoc.tif
I swapped from 1056x816 to 816x1056 and the OCR recognition suddenly began working again!
I've now tried IrfanView, GSview and ImageMagick and so far the later with the command above seems to give the best OCR results.
Thanks everyone.
If anyone can suggest how I might improve the OCR recognition further please shout!
I tried
convert -depth 8 -density 150x150 -units PixelsPerInch file.pdf -resize 816x1056 myTifDoc.tif
but it seemed to be slightly worse than with the lower density of 96x96.
My latest command is :-
convert -depth 8 -density 96x96 -units PixelsPerInch file.pdf -resize 816x1056 myTifDoc.tif
I swapped from 1056x816 to 816x1056 and the OCR recognition suddenly began working again!
I've now tried IrfanView, GSview and ImageMagick and so far the later with the command above seems to give the best OCR results.
Thanks everyone.
If anyone can suggest how I might improve the OCR recognition further please shout!
I tried
convert -depth 8 -density 150x150 -units PixelsPerInch file.pdf -resize 816x1056 myTifDoc.tif
but it seemed to be slightly worse than with the lower density of 96x96.