Suggestions on PDF to tiff conversion for OCR scan
Posted: 2008-07-03T10:43:48-07:00
Folks,
We have a huge amount of PDF's of all different shapes and sizes and we are trying to OCR the text off of them using tesseract.
The quality of the scan is the problem. We have been messing with colors and colorspace using "Quantize", we have messed around with "threshold" and "Posterize" and "WhiteThreshold" etc and there are 2 things I know for a fact. One is that how we convert the PDF into a tiff makes a HUGE difference in the scan. The other is we just don't know enough about images to know what the best thing to do is.
So far about the best results have come from using
Quantize with colorspace=>'Gray'
and
Threshold=>'20',channel=>'All'
But the scan is still not where we want it to be.
Can anyone suggest anything else they think we should be doing to the IM Object before writing it out as a tiff to OCR?
TIA!!!
We have a huge amount of PDF's of all different shapes and sizes and we are trying to OCR the text off of them using tesseract.
The quality of the scan is the problem. We have been messing with colors and colorspace using "Quantize", we have messed around with "threshold" and "Posterize" and "WhiteThreshold" etc and there are 2 things I know for a fact. One is that how we convert the PDF into a tiff makes a HUGE difference in the scan. The other is we just don't know enough about images to know what the best thing to do is.
So far about the best results have come from using
Quantize with colorspace=>'Gray'
and
Threshold=>'20',channel=>'All'
But the scan is still not where we want it to be.
Can anyone suggest anything else they think we should be doing to the IM Object before writing it out as a tiff to OCR?
TIA!!!