Page 1 of 1
scanned paper pdfs and original image extraction.
Posted: 2013-04-08T09:25:27-07:00
by cyclondude
Pixel wizards of open source. I summon thee!
Although my mortal quarrels are surely nothing to your dances with the four elements, I seek you to lift my troubles.
I have used IM to take scanned paper document pdfs to pngs with good results but I am curious if there is a method to take the image with its original resolution and pixel values. I'm including an example below where I believe the document has an image embedded for each page. My big concern is that I would like to maintain the original quality of the images embedded in the pdf as much as possible without resizing or changing the original qualities of each page. Thanks. The ultimate end of this project to preprocess the original images for OCR.
[pdf]
http://www.muni.org/Departments/finance ... ection.pdf
What color of sorcery is protecting these legendary manuscripts?
Re: scanned paper pdfs and original image extraction.
Posted: 2013-04-08T10:00:52-07:00
by snibgo
The required incantation for this file is "-density 300".
Code: Select all
convert -density 300 "2011 CAFR Introductory Section.pdf" ma.png
Re: scanned paper pdfs and original image extraction.
Posted: 2013-04-08T10:47:48-07:00
by whugemann
There is special software that extracts the single pages from a PDF scan. Under Windows, this would be xpdf, more specificly pdfimages, which extracts the single pages from your document as .ppm files. The DOS command would be:
pdfimages your.pdf trunc
and would spit out
trunc-001.pbm
trunc-002.pbm
...
...
trunc-013.pbm
Which in turn could be bulk-convert into group4-coded TIFFs via IM.
The x- and y-resolution of your scans is however not the same. Thus snibgo's suggestion might be better.
Re: scanned paper pdfs and original image extraction.
Posted: 2013-04-08T16:30:24-07:00
by anthony
All of which uses 'ghostscript' to generate the pages... including IM.
Re: scanned paper pdfs and original image extraction.
Posted: 2013-04-09T15:47:04-07:00
by cyclondude
Thank you humble warlocks. I pay homage to your wisdom.
I used pdfimages to convert from pdf to .ppm and .pbm files. And the converted these files to .tif and they look very clean and are working for what I was hoping for with OCR but if you don't mind I have a couple more questions to complete my understanding.
1. The images I extracted are coming out in a compressed kind of aspect ratio that I would not expect from a scanned paper document. Why is this? Also, I notice that they are rotated 90 degrees. Are these just likely artifacts of the scanner software?
2. why does
> pdfimages myfile.pdf output
make .ppm files for the first four pages and the rest are .pbm for the file I linked to above? They both are working fine for my use but why is this happening?
3. In your opinion is ghostscript a practical thing to learn or are there equivalent wrapper functions in imagemagick?
Thanks again seriously.
Re: scanned paper pdfs and original image extraction.
Posted: 2013-04-09T16:58:02-07:00
by snibgo
I know nothing about pdfimages.
In my limited use of PDF or PS documents, using IM as a wrapper to gs is sufficient for my needs. If it wasn't, I would learn gs.
Re: scanned paper pdfs and original image extraction.
Posted: 2013-04-23T10:31:07-07:00
by cyclondude
What can I do to this image to improve tesseract-ocr transcription quality?
http://s21.postimg.org/4wwardkk7/title.png
There are many images like it that have 1-5 words on it with the same pixel dimensions. Do they need to be larger? It is only 43 pixels tall.
Thanks.
Re: scanned paper pdfs and original image extraction.
Posted: 2013-04-23T10:35:40-07:00
by fmw42
cyclondude wrote:What can I do to this image to improve tesseract-ocr transcription quality?
http://s21.postimg.org/4wwardkk7/title.png
There are many images like it that have 1-5 words on it with the same pixel dimensions. Do they need to be larger? It is only 43 pixels tall.
Thanks.
What command did you use to get this image? Did it come from converting a pdf? If so, then you need to use a higher density to read in the pdf.
convert -density 288 image.pdf -resize 25% image.png
This is supersampling so that the output quality is better, but the size remains the same. If you want a larger output image, then leave off the -resize 25% and just use -density to some value that looks better for you.
Re: scanned paper pdfs and original image extraction.
Posted: 2013-04-23T11:49:36-07:00
by cyclondude
I used:
Code: Select all
convert -density 300 mypdf.pdf output.png
After trying -density 600 mypdf.pdf -resize 50% it seems to work better. These shouldn't give the exact same image right? The -density 600 and -resize 50% are resampling at a greater quality. Is that correct? Thanks.
Re: scanned paper pdfs and original image extraction.
Posted: 2013-04-23T13:05:42-07:00
by snibgo
If the PDF came from a scanner, I reckon the optimum "-density" setting is the same as the scanner resolution, which is often a multiple of 150 dpi. Then there is no need to resize, because that discards data that might be useful.
Re: scanned paper pdfs and original image extraction.
Posted: 2013-04-29T04:05:33-07:00
by DominiqueMichel
cyclondude wrote:1. The images I extracted are coming out in a compressed kind of aspect ratio that I would not expect from a scanned paper document. Why is this? Also, I notice that they are rotated 90 degrees. Are these just likely artifacts of the scanner software?
More likely from the pdf format. You can even have pdf files where the images are broken into multiples parts, and to get the images, you have to extract the whole pages and process them later with the gimp or a similar software.