# convert -version
Version: ImageMagick 6.5.2-4 2009-05-09 Q16 OpenMP http://www.imagemagick.org
Copyright: Copyright (C) 1999-2009 ImageMagick Studio LLC
WinXP SP3
I'm trying to create an application that will OCR PDF documents. To do this, I'm using IM to convert the files to TIFF, then using MS Office Document Imaging library (MODI) to OCR the docs. All is well until MODI - it throws an exception that says the file is empty or corrupt. The TIFF looks fine in a viewer.
Passing a TIFF file created by our fax server to the GetOcrText() routine works fine. But so far, I haven't been able to get any to work that were created by the IM PDF to TIFF conversion.
Any idea what could cause this? Is there another flag I can specify in the conversion? [...know of a better way to approach this easily and low cost?]
TIA,
Steve
Relevant Code - C#.Net:
ConvertPdfToTif(vInFile, vOutFile);
// get OCR text
string vOcr = GetOcrText(vOutFile);
// Call ImageMagick to convert the file and save in pToPath
private void ConvertPdfToTif(string pFromPath, string pToPath)
{
Process p = new Process();
p.StartInfo.UseShellExecute = false;
p.StartInfo.RedirectStandardOutput = true;
p.StartInfo.RedirectStandardError = true;
p.StartInfo.FileName = "MagickCMD.exe";
// create new file in zzTempFolder with .tif appended
p.StartInfo.Arguments = "convert -colorspace rgb -density 400 "
+ "\"" + pFromPath + "\" -resize 25% \"" + pToPath + "\"";
p.StartInfo.CreateNoWindow = true;
p.Start();
p.WaitForExit();
}
private string GetOcrText(string pFilename)
{
string vOcrText = "";
MODI.Document vDoc = new MODI.Document();
try
{
vDoc.Create(pFilename); // <----- Throws exception: File is empty or corrupt
}
catch (Exception ex)
{
MessageBox.Show("found exception");
}
vDoc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
[snip]
}
PDF to TIFF - possible corrupt TIFF
Re: PDF to TIFF - possible corrupt TIFF
Its possible MS Office Document Imaging may only accept a subset of the TIFF image types. Try adding +compress to the command line to create an uncompressed TIFF image.
Re: PDF to TIFF - possible corrupt TIFF
Can you post a URL and the exact command you are using. We need to reproduce the problem. If we prove ImageMagick produces a valid TIFF image, the bug report would go to your application rather than here. However, lets confirm whether ImageMagick is producing faulty TIFF images first.
Re: PDF to TIFF - possible corrupt TIFF
TIFF 1 Created via the application by calling external process as described above (giving filename and arguments)
TIFF 2 Created from command line via MagicCMD.exe (to mimic process above):
TIFF 3 Created from command line via convert.exe:
All of these result in an exception when processed by the MS Document Imaging component.
Source of the PDF: http://samplepdf.com/sample.pdf
I have a feeling that there isn't much that can be done - the exception thrown by the MS component doesn't give much info as to what the issue is, and since it can be the TIFF can be viewed just fine...
Code: Select all
p.StartInfo.FileName = "MagickCMD.exe";
p.StartInfo.Arguments = "convert -colorspace rgb -density 400 +compress "
+ "\"" + pFromPath + "\" -resize 25% \"" + pToPath + "\"";
Code: Select all
>> "c:\Program Files\ImageMagick-6.5.2-Q16\MagickCMD.exe" convert -colorspace rgb -density 400 +compress c:\TestFolder\sample.pdf -resize 25% c:\TestFolder\DocImaging\zzTempFolder\sample_MagicCmd.tif
Code: Select all
>> convert -colorspace rgb -density 400 +compress c:\\\\TestFolder\\\\sample.pdf -resize 25% C:\\\\TestFolder\\DocImaging\\zzTempFolder\\sample.pdf_cmdLineDirect.tif
Source of the PDF: http://samplepdf.com/sample.pdf
I have a feeling that there isn't much that can be done - the exception thrown by the MS component doesn't give much info as to what the issue is, and since it can be the TIFF can be viewed just fine...