How do extract text layer and background layer from pdf?

Questions and postings pertaining to the development of ImageMagick, feature enhancements, and ImageMagick internals. ImageMagick source code and algorithms are discussed here. Usage questions which are too arcane for the normal user list should also be posted here.
Post Reply
eg1

How do extract text layer and background layer from pdf?

Post by eg1 »

Hi, for all members.

I have the next question:
How do I extract the text layer and background layer from pdf file?

Example:

http://n-cdn.dashdigital.com/tcprojects ... 041_fg.png
That´s a png 32 file with the text layer of the pdf.

http://n-cdn.dashdigital.com/tcprojects ... 041_bg.jpg
Tht´s the background layer on JPEG of the same pdf.

And together it, can show the pdf image in High quality and larger resolution.
Both images (png text layer and jpg background layer) don´t exceed 350 KB in 1500x2000 pixels (HIGH QUALITY), but if both were PNG 32 or 24 (ONE file), it exceed 3 MB.

The first solution, is very good.

How do fix this?, with convert, gs or other pdf tool.

Thank you, sorry for my bad english.
Cheers,
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: How do extract text layer and background layer from pdf?

Post by anthony »

Where is the PDF file these two images were ment to be in.


NOTE however that IM itself does nto handle PDF, but has ghostscript program render that PDF as a raster image...

See A word about Vector Image formats
http://www.imagemagick.org/Usage/formats/#vector

As such unless the two 'layers' are on separate pages in a PDF, IM will not see those separate images, but only the PDF composite!
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
eg1

Re: How do extract text layer and background layer from pdf?

Post by eg1 »

Thank Mr. Thyssen for your answer.

Please see the followings examples:

some systems oriented to online newspaper system is working under this solution:

TEXT LAYER ON PNG.
http://demo.olivesoftware.com/Olive/ODE ... ageExt=png (238 kb)

PICTURE LAYER ON JPEG:
http://demo.olivesoftware.com/Olive/ODE ... ageExt=jpg (108 kb)

Final size: 346 KB vs 3 MB (ON PNG or JPG at 100%)

Other example:

TEXT LAYER on PNG
http://demo.olivesoftware.com/Olive/ODE ... ageExt=png (110 kb)

PICTURE LAYER on JPG
http://demo.olivesoftware.com/Olive/ODE ... ageExt=jpg (189 Kb)

Final size: 299 Kb (Very amazing... without pixelated imgs, high quality, and small size)


and you can see the final user page in:
http://demo.olivesoftware.com/Olive/ODE/DenverPost/

these images were extracted from a pdf file, used a command line tool for extract it.
also http://www.pressdisplay.com/pressdisplay/es/viewer.aspx (click on a newspaper) has implemented this method, and many others.

because these two images (png, jpeg) don´t exceed 350 KB in high resolution and high quiality, ideal for web. This isn´t posible using only a single image in JPG or PNG, because the quality (images pixelated on jpg if quality<>100%) or size (3MB ON PNG or JPEG at 100% , or more...).. it isn´t ideal for user experience.


I´ve seen that, these systems is used ASP.NET generally; but think so is possible on linux x86_64 platform with some command line tool for pdfs.

Do you believe that is posible with imagemagick get this result?
or
Do you knows some command line tools that can do it?

Thank you

Emmanuel,
Kind regards
eg1

Re: How do extract text layer and background layer from pdf?

Post by eg1 »

How can I achive this, what do you recommend to me?

:(
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: How do extract text layer and background layer from pdf?

Post by anthony »

As both images are white backgrounds you can just multiply the images together to get the final product. No need to muck around with transparency or anything. Even the order does not matter, if they are the same size.

I have done this with postscript images successfully, adding multiple components together using multiply.

Code: Select all

   convert  page.png page.jpg -compose multiply -composite  show:
just replace show: with the format you want.

ASIDE:

I tries saving to PNG and got a image that was hugh compared to the two separate images, which I thought was interesting. Even using "optipng" to find the best lossless compression for the output PNG still produces a combined image larger than the two separate images. Interesting...

using the second example....
page.jpg 193959
page.png 113395 === added ===> 307354 bytes
page_output.png 1436844
page_output_optimized.png 1197465
page_out.jpg 570571

The use of two separate images really is a hugh saving.
Is there anything special about the source PNG and JPG that I should know about?


ASIDE #2: the source png image can be made smaller without data loss by using "optipng" to find a better internal compression 'quality' method.
page.png 113395
optimized => 97882

Not much but some reduction.


The final question is then what is a good way to separate a previously combined image?

As for making a multi-layered PDF from these. I doubt IM can do this. Though perhaps other PDF handlers can. Maybe the perl PDF modules or something like that.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
eg1

Re: How do extract text layer and background layer from pdf?

Post by eg1 »

something like:

Code: Select all

$ command  -only vectors  -use pngalpha   input_pdf.pdf    vectors_layer.png 
$ command  -only graphics  -use jpg  -quality 75  input_pdf.pdf    graphics_layer.jpg

Do you know how can I achive that?
eg1

Re: How do extract text layer and background layer from pdf?

Post by eg1 »

there must be a command line tool that can do it. I was trying with pstoedit with -f gs:pdfwrite, -f gs:pngalpha but doesn´t work.
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: How do extract text layer and background layer from pdf?

Post by anthony »

A tool like that would be extremely useful, if only to separate resolution independent vectors from the fixed resolution raster graphics.

But it would also be perfect for this type of layer separation.

Note it will fail however if the PDF contained the text as raster, such as in a scanned image. In that case some other technique to mask or separate the two aspects will be needed. I would be interested in that type of thing too.

PPS: you may like to look at the special scan image file format, DjVu which boxes all the sub-components of multiple scanned pages. This could make a determination that sub-images that remains large (or unique) is a image, and small repeating sub-images is text. It may be able to help in that separation in a more automated way.

Any one with other ideas, or who has tried methods, then please post here.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: How do extract text layer and background layer from pdf?

Post by anthony »

ImageMagick is not specifically devoted to handling PDF files.

Sure it can get an image of a PDF page, but it does so by running it though the thrid pary product, ghostscript to generate a raster image.
It has no understanding of text verses graphics, or any other aspect of PDF, beyond this.

If ghostscript or other thrid party tools provide this feature, yes that would be great. But it is not IM's responsibility.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
Post Reply