PDF to TIFF - possible corrupt TIFF
Posted: 2009-05-15T14:04:24-07:00
# convert -version
Version: ImageMagick 6.5.2-4 2009-05-09 Q16 OpenMP http://www.imagemagick.org
Copyright: Copyright (C) 1999-2009 ImageMagick Studio LLC
WinXP SP3
I'm trying to create an application that will OCR PDF documents. To do this, I'm using IM to convert the files to TIFF, then using MS Office Document Imaging library (MODI) to OCR the docs. All is well until MODI - it throws an exception that says the file is empty or corrupt. The TIFF looks fine in a viewer.
Passing a TIFF file created by our fax server to the GetOcrText() routine works fine. But so far, I haven't been able to get any to work that were created by the IM PDF to TIFF conversion.
Any idea what could cause this? Is there another flag I can specify in the conversion? [...know of a better way to approach this easily and low cost?]
TIA,
Steve
Relevant Code - C#.Net:
ConvertPdfToTif(vInFile, vOutFile);
// get OCR text
string vOcr = GetOcrText(vOutFile);
// Call ImageMagick to convert the file and save in pToPath
private void ConvertPdfToTif(string pFromPath, string pToPath)
{
Process p = new Process();
p.StartInfo.UseShellExecute = false;
p.StartInfo.RedirectStandardOutput = true;
p.StartInfo.RedirectStandardError = true;
p.StartInfo.FileName = "MagickCMD.exe";
// create new file in zzTempFolder with .tif appended
p.StartInfo.Arguments = "convert -colorspace rgb -density 400 "
+ "\"" + pFromPath + "\" -resize 25% \"" + pToPath + "\"";
p.StartInfo.CreateNoWindow = true;
p.Start();
p.WaitForExit();
}
private string GetOcrText(string pFilename)
{
string vOcrText = "";
MODI.Document vDoc = new MODI.Document();
try
{
vDoc.Create(pFilename); // <----- Throws exception: File is empty or corrupt
}
catch (Exception ex)
{
MessageBox.Show("found exception");
}
vDoc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
[snip]
}
Version: ImageMagick 6.5.2-4 2009-05-09 Q16 OpenMP http://www.imagemagick.org
Copyright: Copyright (C) 1999-2009 ImageMagick Studio LLC
WinXP SP3
I'm trying to create an application that will OCR PDF documents. To do this, I'm using IM to convert the files to TIFF, then using MS Office Document Imaging library (MODI) to OCR the docs. All is well until MODI - it throws an exception that says the file is empty or corrupt. The TIFF looks fine in a viewer.
Passing a TIFF file created by our fax server to the GetOcrText() routine works fine. But so far, I haven't been able to get any to work that were created by the IM PDF to TIFF conversion.
Any idea what could cause this? Is there another flag I can specify in the conversion? [...know of a better way to approach this easily and low cost?]
TIA,
Steve
Relevant Code - C#.Net:
ConvertPdfToTif(vInFile, vOutFile);
// get OCR text
string vOcr = GetOcrText(vOutFile);
// Call ImageMagick to convert the file and save in pToPath
private void ConvertPdfToTif(string pFromPath, string pToPath)
{
Process p = new Process();
p.StartInfo.UseShellExecute = false;
p.StartInfo.RedirectStandardOutput = true;
p.StartInfo.RedirectStandardError = true;
p.StartInfo.FileName = "MagickCMD.exe";
// create new file in zzTempFolder with .tif appended
p.StartInfo.Arguments = "convert -colorspace rgb -density 400 "
+ "\"" + pFromPath + "\" -resize 25% \"" + pToPath + "\"";
p.StartInfo.CreateNoWindow = true;
p.Start();
p.WaitForExit();
}
private string GetOcrText(string pFilename)
{
string vOcrText = "";
MODI.Document vDoc = new MODI.Document();
try
{
vDoc.Create(pFilename); // <----- Throws exception: File is empty or corrupt
}
catch (Exception ex)
{
MessageBox.Show("found exception");
}
vDoc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
[snip]
}