I think this question has already been answered, but I would like to cover some specifics for my task.
I have 660 JPEGs 1920x2573; the first and last are 1920x2575; and turn them into a single file pdf ebook.
Current size 774 MB. OCR option to make it searchable.
With the help of some fellow at IRC channel, proceed like:
Convert 660 JPEG to individual pdf's and keep same aspect ratio, resolution and ppi (all original)
Use some OCR tool like tesseract to make it searchable.
Compact the pdf's to turn it pratical and portable and on screen readable with imagemagick? what are the possibilites?
Merge the pdf's to make it a single file. WHat are the tools? ImageMagick?
What are the commands?
I have reach this but has multiple answers and I didn't understand some commands.
Thanks for your help
JPEG to PDF
-
- Posts: 12159
- Joined: 2010-01-23T23:01:33-07:00
- Authentication code: 1151
- Location: England, UK
Re: JPEG to PDF
You have about 5 million pixels per page, so 660 pages is about 3.3 G pixels. ImageMagick isn't good at creating PDFs that have more pixels than fit into memory, which yours probably doesn't. A program called "pdfunite" can take multiple PDFs, one per page, and create a multiple-page PDF from them.
So you could use IM to create one PDF per JPG, eg:
... looping through all the pages. Or use mogrify. Then:
So you could use IM to create one PDF per JPG, eg:
Code: Select all
convert in_0001.jpg page_0001.pdf
Code: Select all
pdfunite page_*.pdf output.pdf
snibgo's IM pages: im.snibgo.com
Re: JPEG to PDF
So you are saying:
1. Convert: Does convert pass regex? how to loop through all files?
the format is 1703.*_hhmmss.jpeg (date_time scan was taken)
2.merge with pdfunite (do you know dysprosium? I have used it for basic low demand work but didn't know about this memory details)
3. run OCR tool (tesseract)
4. Compact (how to run this and make a pdf look nice finished job like a real ebook and small size like the one I found on internet, same language, only 22 MB. I know it's double work, but I would like to do this task, learn it so I can apply to other JPEGs which I can't find a finished file on internet. The one I started is more popular so it is easy to find, even in my language.) rescaling? downsampling? resizing? resolution?
Thanks!
1. Convert: Does convert pass regex? how to loop through all files?
the format is 1703.*_hhmmss.jpeg (date_time scan was taken)
2.merge with pdfunite (do you know dysprosium? I have used it for basic low demand work but didn't know about this memory details)
3. run OCR tool (tesseract)
4. Compact (how to run this and make a pdf look nice finished job like a real ebook and small size like the one I found on internet, same language, only 22 MB. I know it's double work, but I would like to do this task, learn it so I can apply to other JPEGs which I can't find a finished file on internet. The one I started is more popular so it is easy to find, even in my language.) rescaling? downsampling? resizing? resolution?
Thanks!
-
- Posts: 12159
- Joined: 2010-01-23T23:01:33-07:00
- Authentication code: 1151
- Location: England, UK
Re: JPEG to PDF
For looping, either loop in your shell with a "for", or within ImageMagick using "mogrify". Yes, mogrify can use "1703.*_hhmmss.jpeg".
But if you want to use tesseract, it may be better to run that for each image, rather than the entire PDF. So it would make sense to have a shell loop; within the loop, run convert and tesseract. At the end, run pdfunite.
In the convert, you can do whatever image processing you want: resizing, resolution, noise removal, whatever.
I don't know dysprosium.
But if you want to use tesseract, it may be better to run that for each image, rather than the entire PDF. So it would make sense to have a shell loop; within the loop, run convert and tesseract. At the end, run pdfunite.
In the convert, you can do whatever image processing you want: resizing, resolution, noise removal, whatever.
I don't know dysprosium.
snibgo's IM pages: im.snibgo.com
Re: JPEG to PDF
Hello,
Now I have 660 searchable pdf's. Tesseract, actually, input must be JPEG, since it doens't read PDF. Now let's get those nice viewing results?
There is scale, sample, resize, resolution, quality, noise reduce? what is the function that best apply to this task?
My question relies on how to diminish the final size without losing it's aspect on screen readability.
From `identify -verbose finishedebook.pdf` with 22 MiB gets:
From identify -verbose page7of660.pdf single page with 864 KiB (*660 > 500 MiB) gets:
Thanks
Now I have 660 searchable pdf's. Tesseract, actually, input must be JPEG, since it doens't read PDF. Now let's get those nice viewing results?
There is scale, sample, resize, resolution, quality, noise reduce? what is the function that best apply to this task?
My question relies on how to diminish the final size without losing it's aspect on screen readability.
From `identify -verbose finishedebook.pdf` with 22 MiB gets:
Code: Select all
libgomp: Thread creation failed: Resource temporarily unavailable
Image: ~/path/to/finishedebook.pdf
Format: PAM (Common 2-dimensional bitmap format)
Mime type: image/x-portable-pixmap
Class: DirectClass
Geometry: 1298x836+0+0
Resolution: 72x72
Print size: 18.0278x11.6111
Units: Undefined
Type: ColorSeparation
Endianess: Undefined
Colorspace: CMYK
Code: Select all
Image: 170322_132121.pdf
Format: PNG (Portable Network Graphics)
Mime type: image/png
Class: DirectClass
Geometry: 1440x1930+0+0
Resolution: 72x72
Print size: 20x26.8056
Units: Undefined
Type: TrueColorAlpha
Endianess: Undefined
Colorspace: sRGB
Depth: 16/8-bit
Channel depth:
red: 8-bit
green: 8-bit
blue: 8-bit
alpha: 1-bit
Channel statistics:
Pixels: 2779200
Red:
min: 1285 (0.0196078)
max: 65535 (1)
mean: 62736.6 (0.9573)
standard deviation: 9744.14 (0.148686)
kurtosis: 12.6874
skewness: -3.66389
Green:
min: 5911 (0.0901961)
max: 65535 (1)
mean: 62833.7 (0.958781)
standard deviation: 9396.37 (0.143379)
kurtosis: 12.1911
skewness: -3.62061
Blue:
min: 6168 (0.0941176)
max: 65535 (1)
mean: 62804.4 (0.958333)
standard deviation: 9487.73 (0.144773)
kurtosis: 12.1189
skewness: -3.61544
Alpha:
min: 65535 (1)
max: 65535 (1)
mean: 65535 (1)
standard deviation: 0 (0)
kurtosis: 0
skewness: 0
Image statistics:
Overall:
min: 0 (0)
max: 65535 (1)
mean: 47093.7 (0.718604)
standard deviation: 8265.24 (0.12612)
kurtosis: 280.486
skewness: -39.6068
Rendering intent: Perceptual
Gamma: 0.454545
Chromaticity:
red primary: (0.64,0.33)
green primary: (0.3,0.6)
blue primary: (0.15,0.06)
white point: (0.3127,0.329)
Background color: white
Border color: srgba(223,223,223,1)
Matte color: grey74
Transparent color: none
Interlace: None
Intensity: Undefined
Compose: Over
Page geometry: 1440x1930+0+0
Dispose: Undefined
Iterations: 0
Compression: Undefined
Orientation: Undefined
Properties:
date:create: 2017-03-25T10:38:12-03:00
date:modify: 2017-03-25T10:38:12-03:00
icc:copyright: Copyright Artifex Software 2011
icc:description: Artifex Software sRGB ICC Profile
icc:manufacturer: Artifex Software sRGB ICC Profile
icc:model: Artifex Software sRGB ICC Profile
pdf:HiResBoundingBox: 1440x1929.75+0+0
pdf:Version: PDF-1.5
signature: 987d41ed05928ec39dae75bada74d2b4f58d5d9fa9617dae242f31bf291b6d03
Profiles:
Profile-icc: 2576 bytes
Artifacts:
filename: 170322_132121.pdf
verbose: true
Tainted: False
Filesize: 932KB
Number pixels: 2.779M
Pixels per second: 16.35MB
User time: 0.150u
Elapsed time: 0:01.170
Version: ImageMagick 6.8.9-9 Q16 i686 2017-03-14 http://www.imagemagick.org
Re: JPEG to PDF
Hello snibgo, and everyone who join this conversation
Could you suggest what image processing operation would be ideal for reduce the final size of the single PDF. without losing it's screen original readability and aspect ?
I have the multiple PDFs now, OCRed with tesseract and converted from JPEG to PDF, since I think the original JPEG would make the best OCR possible before reducing it's byte size. and tesseract input output is JPEG PDF respectively.
Appreciate your help!
-
- Posts: 12159
- Joined: 2010-01-23T23:01:33-07:00
- Authentication code: 1151
- Location: England, UK
Re: JPEG to PDF
Sorry, I don't use PDF as an output file format.
I suppose it depends on the purpose of the PDFs. Perhaps they are as catalogues, eg for customers. Then page layout might be important, with each page the same size, and each image captioned, and so on.
Or perhaps the PDF is simply a mechanism for grouping multiple images together, exactly like a multi-image TIFF file. If so, perhaps TIFF would be a better choice than PDF.
I suppose it depends on the purpose of the PDFs. Perhaps they are as catalogues, eg for customers. Then page layout might be important, with each page the same size, and each image captioned, and so on.
Or perhaps the PDF is simply a mechanism for grouping multiple images together, exactly like a multi-image TIFF file. If so, perhaps TIFF would be a better choice than PDF.
snibgo's IM pages: im.snibgo.com