JPEG to PDF

Capum140 · Post by **Capum140** » 2017-03-24T08:31:01-07:00

I think this question has already been answered, but I would like to cover some specifics for my task.
I have 660 JPEGs 1920x2573; the first and last are 1920x2575; and turn them into a single file pdf ebook.
Current size 774 MB. OCR option to make it searchable.
With the help of some fellow at IRC channel, proceed like:
Convert 660 JPEG to individual pdf's and keep same aspect ratio, resolution and ppi (all original)
Use some OCR tool like tesseract to make it searchable.
Compact the pdf's to turn it pratical and portable and on screen readable with imagemagick? what are the possibilites?
Merge the pdf's to make it a single file. WHat are the tools? ImageMagick?

What are the commands?
I have reach this but has multiple answers and I didn't understand some commands.

Thanks for your help

Post by **snibgo** » 2017-03-24T10:59:24-07:00

You have about 5 million pixels per page, so 660 pages is about 3.3 G pixels. ImageMagick isn't good at creating PDFs that have more pixels than fit into memory, which yours probably doesn't. A program called "pdfunite" can take multiple PDFs, one per page, and create a multiple-page PDF from them.

So you could use IM to create one PDF per JPG, eg:

Code: Select all

convert in_0001.jpg page_0001.pdf

... looping through all the pages. Or use mogrify. Then:

Code: Select all

pdfunite page_*.pdf output.pdf

Capum140 · Post by **Capum140** » 2017-03-24T12:22:14-07:00

So you are saying:
1. Convert: Does convert pass regex? how to loop through all files?
the format is 1703.*_hhmmss.jpeg (date_time scan was taken)

2.merge with pdfunite (do you know dysprosium? I have used it for basic low demand work but didn't know about this memory details)

3. run OCR tool (tesseract)

4. Compact (how to run this and make a pdf look nice finished job like a real ebook and small size like the one I found on internet, same language, only 22 MB. I know it's double work, but I would like to do this task, learn it so I can apply to other JPEGs which I can't find a finished file on internet. The one I started is more popular so it is easy to find, even in my language.) rescaling? downsampling? resizing? resolution?

Thanks!

Post by **snibgo** » 2017-03-24T12:33:11-07:00

For looping, either loop in your shell with a "for", or within ImageMagick using "mogrify". Yes, mogrify can use "1703.*_hhmmss.jpeg".

But if you want to use tesseract, it may be better to run that for each image, rather than the entire PDF. So it would make sense to have a shell loop; within the loop, run convert and tesseract. At the end, run pdfunite.

In the convert, you can do whatever image processing you want: resizing, resolution, noise removal, whatever.

I don't know dysprosium.

Capum140 · Post by **Capum140** » 2017-03-25T06:42:52-07:00

Hello,
Now I have 660 searchable pdf's. Tesseract, actually, input must be JPEG, since it doens't read PDF. Now let's get those nice viewing results?
There is scale, sample, resize, resolution, quality, noise reduce? what is the function that best apply to this task?
My question relies on how to diminish the final size without losing it's aspect on screen readability.
From `identify -verbose finishedebook.pdf` with 22 MiB gets:

Code: Select all

libgomp: Thread creation failed: Resource temporarily unavailable
Image: ~/path/to/finishedebook.pdf
  Format: PAM (Common 2-dimensional bitmap format)
  Mime type: image/x-portable-pixmap
  Class: DirectClass
  Geometry: 1298x836+0+0
  Resolution: 72x72
  Print size: 18.0278x11.6111
  Units: Undefined
  Type: ColorSeparation
  Endianess: Undefined
  Colorspace: CMYK

From identify -verbose page7of660.pdf single page with 864 KiB (*660 > 500 MiB) gets:

Code: Select all

Image: 170322_132121.pdf
  Format: PNG (Portable Network Graphics)
  Mime type: image/png
  Class: DirectClass
  Geometry: 1440x1930+0+0
  Resolution: 72x72
  Print size: 20x26.8056
  Units: Undefined
  Type: TrueColorAlpha
  Endianess: Undefined
  Colorspace: sRGB
  Depth: 16/8-bit
  Channel depth:
    red: 8-bit
    green: 8-bit
    blue: 8-bit
    alpha: 1-bit
  Channel statistics:
    Pixels: 2779200
    Red:
      min: 1285 (0.0196078)
      max: 65535 (1)
      mean: 62736.6 (0.9573)
      standard deviation: 9744.14 (0.148686)
      kurtosis: 12.6874
      skewness: -3.66389
    Green:
      min: 5911 (0.0901961)
      max: 65535 (1)
      mean: 62833.7 (0.958781)
      standard deviation: 9396.37 (0.143379)
      kurtosis: 12.1911
      skewness: -3.62061
    Blue:
      min: 6168 (0.0941176)
      max: 65535 (1)
      mean: 62804.4 (0.958333)
      standard deviation: 9487.73 (0.144773)
      kurtosis: 12.1189
      skewness: -3.61544
    Alpha:
      min: 65535 (1)
      max: 65535 (1)
      mean: 65535 (1)
      standard deviation: 0 (0)
      kurtosis: 0
      skewness: 0
  Image statistics:
    Overall:
      min: 0 (0)
      max: 65535 (1)
      mean: 47093.7 (0.718604)
      standard deviation: 8265.24 (0.12612)
      kurtosis: 280.486
      skewness: -39.6068
  Rendering intent: Perceptual
  Gamma: 0.454545
  Chromaticity:
    red primary: (0.64,0.33)
    green primary: (0.3,0.6)
    blue primary: (0.15,0.06)
    white point: (0.3127,0.329)
  Background color: white
  Border color: srgba(223,223,223,1)
  Matte color: grey74
  Transparent color: none
  Interlace: None
  Intensity: Undefined
  Compose: Over
  Page geometry: 1440x1930+0+0
  Dispose: Undefined
  Iterations: 0
  Compression: Undefined
  Orientation: Undefined
  Properties:
    date:create: 2017-03-25T10:38:12-03:00
    date:modify: 2017-03-25T10:38:12-03:00
    icc:copyright: Copyright Artifex Software 2011
    icc:description: Artifex Software sRGB ICC Profile
    icc:manufacturer: Artifex Software sRGB ICC Profile
    icc:model: Artifex Software sRGB ICC Profile
    pdf:HiResBoundingBox: 1440x1929.75+0+0
    pdf:Version: PDF-1.5 
    signature: 987d41ed05928ec39dae75bada74d2b4f58d5d9fa9617dae242f31bf291b6d03
  Profiles:
    Profile-icc: 2576 bytes
  Artifacts:
    filename: 170322_132121.pdf
    verbose: true
  Tainted: False
  Filesize: 932KB
  Number pixels: 2.779M
  Pixels per second: 16.35MB
  User time: 0.150u
  Elapsed time: 0:01.170
  Version: ImageMagick 6.8.9-9 Q16 i686 2017-03-14 http://www.imagemagick.org

Thanks

Capum140 · Post by **Capum140** » 2017-04-11T18:49:15-07:00

snibgo wrote: ↑2017-03-24T12:33:11-07:00 In the convert, you can do whatever image processing you want: resizing, resolution, noise removal, whatever.

Hello snibgo, and everyone who join this conversation
Could you suggest what image processing operation would be ideal for reduce the final size of the single PDF. without losing it's screen original readability and aspect ?
I have the multiple PDFs now, OCRed with tesseract and converted from JPEG to PDF, since I think the original JPEG would make the best OCR possible before reducing it's byte size. and tesseract input output is JPEG PDF respectively.
Appreciate your help!

Post by **snibgo** » 2017-04-11T20:11:22-07:00

Sorry, I don't use PDF as an output file format.

I suppose it depends on the purpose of the PDFs. Perhaps they are as catalogues, eg for customers. Then page layout might be important, with each page the same size, and each image captioned, and so on.

Or perhaps the PDF is simply a mechanism for grouping multiple images together, exactly like a multi-image TIFF file. If so, perhaps TIFF would be a better choice than PDF.

Legacy ImageMagick Discussions Archive

JPEG to PDF

JPEG to PDF

Re: JPEG to PDF

Re: JPEG to PDF

Re: JPEG to PDF

Re: JPEG to PDF

Re: JPEG to PDF

Re: JPEG to PDF