Multipage Tiff not accepted by OmniPage

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
shanz

Multipage Tiff not accepted by OmniPage

Post by shanz »

I can use IrfanView to convert a pdf into a multipage tiff file.
This file can be imported into OmniPage (then OCRed).

If I use ImageMagick to convert the same pdf into a multipage tiff file, then the file cannot be imported into OmniPage.
OmniPage complains - "OmniPage SE cannot load or scan this image.".

I use this command which results in a tiff with a similar filesize to the one produced by IrfanView.

convert -density 96x96 myPdfDoc.pdf -resize 1056x816 myTifDoc.tif

The file can be opened by other applications, eg: Microsoft Office Document Imaging, Microsoft Picture and Fax Viewer, IrfanView.
If I open it in IrfanView and then resave it, then OmniPage is happy with it.

What is going on?
User avatar
magick
Site Admin
Posts: 11064
Joined: 2003-05-31T11:32:55-07:00

Re: Multipage Tiff not accepted by OmniPage

Post by magick »

ImageMagick creates a valid TIFF image but one that OmniPage cannot read. To fix, you need to identify what TIFF types OmniPage supports and add options to the ImageMagick command line to produce that TIFF type. Post the output of the
  • identify -verbose
command for the image ImageMagick creates and the one Infraview creates. Perhaps we can figure out the requirements from the differences in image attributes.
shanz

Re: Multipage Tiff not accepted by OmniPage

Post by shanz »

Thanks for the prompt reply!
Here are the first image section snippets from the output of
  • $identify -verbose pdf2TiffWithIM.tif

    Image: pdf2TiffWithIM.tif
    Format: TIFF (Tagged Image File Format)
    Class: DirectClass
    Geometry: 631x816+0+0
    Resolution: 96x96
    Print size: 6.57292x8.5
    Units: Undefined
    Type: TrueColor
    Base type: TrueColor
    Endianess: MSB
    Colorspace: RGB
    Depth: 16-bit
    Channel depth:
    red: 16-bit
    green: 16-bit
    blue: 16-bit
    Channel statistics:
    red:
    min: 0 (0)
    max: 65535 (1)
    mean: 44559.9 (0.67994)
    standard deviation: 18432.9 (0.281268)
    kurtosis: -1.12467
    skewness: -0.0634544
    green:
    min: 0 (0)
    max: 65535 (1)
    mean: 46875.9 (0.715281)
    standard deviation: 16544.7 (0.252457)
    kurtosis: -0.533182
    skewness: -0.234325
    blue:
    min: 0 (0)
    max: 65535 (1)
    mean: 61374.2 (0.93651)
    standard deviation: 12525.3 (0.191123)
    kurtosis: 14.7549
    skewness: -3.89405
    Image statistics:
    Overall:
    min: 0 (0)
    max: 65535 (1)
    mean: 38202.5 (0.582933)
    standard deviation: 26843.9 (0.409612)
    kurtosis: -1.45017
    skewness: -0.349018
    Rendering intent: Undefined
    Interlace: None
    Background color: white
    Border color: rgb(223,223,223)
    Matte color: grey74
    Transparent color: black
    Compose: Over
    Page geometry: 631x816+0+0
    Dispose: Undefined
    Iterations: 0
    Scene: 0 of 133
    Compression: None
    Orientation: TopLeft
    Properties:
    date:create: 2009-12-08T11:58:05+00:00
    date:modify: 2009-12-08T11:58:52+00:00
    signature: 1b9949721ad8d7975a8be9dd8c797ad0c0350bbb514d92fd6e82eb11d57a3075
    tiff:document: pdf2TiffWithIM.tif
    tiff:photometric: RGB
    tiff:rows-per-strip: 2
    tiff:software: ImageMagick 6.5.8-3 2009-11-28 Q16 http://www.imagemagick.org
    Artifacts:
    verbose: true
    Tainted: False
    Filesize: 365.8MiB
    Number pixels: 503KiB
    Pixels per second: 264KiB
    User time: 1.906u
    Elapsed time: 0:02.906
    Version: ImageMagick 6.5.8-3 2009-11-28 Q16 http://www.imagemagick.org
  • $identify -verbose pdf2TiffWithIrfanView.tif

    Image: pdf2TiffWithIrfanView.tif
    Format: TIFF (Tagged Image File Format)
    Class: DirectClass
    Geometry: 816x1056+0+0
    Resolution: 96x96
    Print size: 8.5x11
    Units: PixelsPerInch
    Type: TrueColor
    Base type: TrueColor
    Endianess: MSB
    Colorspace: RGB
    Depth: 8-bit
    Channel depth:
    red: 8-bit
    green: 8-bit
    blue: 8-bit
    Channel statistics:
    red:
    min: 0 (0)
    max: 255 (1)
    mean: 173.426 (0.6801)
    standard deviation: 73.0236 (0.286367)
    kurtosis: -1.03044
    skewness: -0.125362
    green:
    min: 0 (0)
    max: 255 (1)
    mean: 182.436 (0.715437)
    standard deviation: 65.7418 (0.257811)
    kurtosis: -0.354149
    skewness: -0.321318
    blue:
    min: 0 (0)
    max: 255 (1)
    mean: 238.832 (0.936594)
    standard deviation: 50.467 (0.19791)
    kurtosis: 15.0705
    skewness: -3.96344
    Image statistics:
    Overall:
    min: 0 (0)
    max: 255 (1)
    mean: 148.673 (0.583033)
    standard deviation: 105.103 (0.412167)
    kurtosis: -1.46045
    skewness: -0.350269
    Rendering intent: Undefined
    Interlace: None
    Background color: white
    Border color: rgb(223,223,223)
    Matte color: grey74
    Transparent color: black
    Compose: Over
    Page geometry: 816x1056+0+0
    Dispose: Undefined
    Iterations: 0
    Scene: 0 of 133
    Compression: None
    Orientation: TopLeft
    Properties:
    date:create: 2009-12-08T11:48:14+00:00
    date:modify: 2009-12-08T11:49:43+00:00
    signature: 0748cf4c66b909da4b45afe14169d71560d81a5d8ec90e0182970010c220881c
    tiff:photometric: RGB
    tiff:rows-per-strip: 3
    tiff:software: IrfanView
    Artifacts:
    verbose: true
    Tainted: False
    Filesize: 328.3MiB
    Number pixels: 842KiB
    Pixels per second: 15.7KiB
    User time: 5.438u
    Elapsed time: 0:54.631
    Version: ImageMagick 6.5.8-3 2009-11-28 Q16 http://www.imagemagick.org

The output from the verbose identify command on the IrfanView generated tif is 2344KB whereas the output from the ImageMagick generated tif is only 258KB.
Do you want me to upload both files so you can 'diff' them?
User avatar
magick
Site Admin
Posts: 11064
Joined: 2003-05-31T11:32:55-07:00

Re: Multipage Tiff not accepted by OmniPage

Post by magick »

ImageMagick produces 16-bit per channel TIFF images. Let's start with that. Add
  • -depth 8
to your command line.
shanz

Re: Multipage Tiff not accepted by OmniPage

Post by shanz »

Yes, I saw that and have already tried it.
It does make it work but the resolution is awful! Worse than with IrfanView.
I thought I'd try ImageMagick to see if I could get better resolution than with IrfanView.

I use OmniPage for OCRing so I'd like optimum resolution (ideally 300x300 according to the OmniPage manual).
I don't really understand how to achieve that from the original pdf.

I did an identify on the pdf and it says -

Image: myPDFDoc.pdf
Format: PDF (Portable Document Format)
Class: DirectClass
Geometry: 612x792+0+0
Resolution: 72x72
Print size: 8.5x11
Units: Undefined
Type: TrueColor
Endianess: Undefined
Colorspace: RGB
Depth: 16/8-bit
Channel depth:
red: 8-bit
green: 8-bit
blue: 8-bit
Channel statistics:
red:
min: 0 (0)
max: 65535 (1)
mean: 44555 (0.679866)
standard deviation: 18776.5 (0.286511)
kurtosis: -1.02682
skewness: -0.126939
green:
min: 0 (0)
max: 65535 (1)
mean: 46879.1 (0.71533)
standard deviation: 16890.2 (0.257728)
kurtosis: -0.352392
skewness: -0.321098
blue:
min: 0 (0)
max: 65535 (1)
mean: 61340.1 (0.93599)
standard deviation: 13007.6 (0.198484)
kurtosis: 14.8609
skewness: -3.93556
Image statistics:
Overall:
min: 0 (0)
max: 65535 (1)
mean: 38193.6 (0.582796)
standard deviation: 27006.7 (0.412096)
kurtosis: -1.46048
skewness: -0.349525
Rendering intent: Undefined
Interlace: None
Background color: white
Border color: rgb(223,223,223)
Matte color: grey74
Transparent color: black
Compose: Over
Page geometry: 612x792+0+0
Dispose: Undefined
Iterations: 0
Scene: 0 of 133
Compression: Undefined
Orientation: Undefined
Properties:
date:create: 2009-12-08T14:27:20+00:00
date:modify: 2009-12-08T14:28:19+00:00
pdf:HiResBoundingBox: 612x792+0+0
pdf:Version: PDF-1.4
signature: 53b63bf32cc565833b88617295723381714ffd823a3c95a179de302998c6b6cc
Artifacts:
verbose: true
Tainted: False
Filesize: 183.1MiB
Number pixels: 473KiB
Pixels per second: 522KiB
User time: 0.891u
Elapsed time: 0:01.905
Version: ImageMagick 6.5.8-3 2009-11-28 Q16 http://www.imagemagick.org
User avatar
fmw42
Posts: 25562
Joined: 2007-07-02T17:14:51-07:00
Authentication code: 1152
Location: Sunnyvale, California, USA

Re: Multipage Tiff not accepted by OmniPage

Post by fmw42 »

don't know if this will help, but your IM print size is different from your Irfan and your IM units are undefined. try adding -units PixelsPerInch after -density

see http://www.imagemagick.org/script/comma ... .php#units
shanz

Re: Multipage Tiff not accepted by OmniPage

Post by shanz »

I'm just trying gsView to achieve the same goal.
I can open the pdf and convert using a device such as tiff24nc (24-bit RGB, 8-bits per component) and select the max available resolution of 204x196.
When the resultant Tiff is opened in OmniPage SE it says
  • Image Resolution is 98,102 (originally 196, 204)
  • Image Size is 1146x843
The Tiff's filesize is rather large (over 1GB!).
I then use OmniPage SE to save each page as a jpeg (approx 100-200 KB each).
Canon MP Navigator 2.0 then converts all these jpegs into a pdf of 16MB.
Unfortunately it doesn't seem to be searchable! Arghh! So close!
shanz

Re: Multipage Tiff not accepted by OmniPage

Post by shanz »

I've had some success with ImageMagick now.

My latest command is :-

convert -depth 8 -density 96x96 -units PixelsPerInch file.pdf -resize 816x1056 myTifDoc.tif

I swapped from 1056x816 to 816x1056 and the OCR recognition suddenly began working again!

I've now tried IrfanView, GSview and ImageMagick and so far the later with the command above seems to give the best OCR results.

Thanks everyone.

If anyone can suggest how I might improve the OCR recognition further please shout!
I tried
convert -depth 8 -density 150x150 -units PixelsPerInch file.pdf -resize 816x1056 myTifDoc.tif
but it seemed to be slightly worse than with the lower density of 96x96.
Post Reply