For investigating a PDF file, ImageMagick is the totally wrong tool. All ImageMagick commands employ 'delegates' to handle input PDFs. They use Ghostscript first to convert all PDF pages to raster images. Only after that step they take their first look at raster images to do something with it. Not even `identify` will look at the PDF directly.
To investigate a problematic PDF, use tools which are designed to do so. My first choices are some utilities from the 'Poppler' fork of XPDF. When these do not lead me to concluding results, I would employ others. But let's first start with these three:
- pdfinfo
- pdfimages
- pdffonts
'pdfinfo' tells you everything about the file's meta data, and does so very quickly:
Code: Select all
pdfinfo -meta -box -js form_advanced2.pdf
Producer: iPhone OS 8.0 Quartz PDFContext
CreationDate: Wed Jan 28 17:28:28 2015
ModDate: Fri Feb 6 14:38:00 2015
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 768 x 1526 pts
Page rot: 0
MediaBox: 0.00 0.00 768.00 1526.00
CropBox: 0.00 0.00 768.00 1526.00
BleedBox: 0.00 0.00 768.00 1526.00
TrimBox: 0.00 0.00 768.00 1526.00
ArtBox: 0.00 0.00 768.00 1526.00
File size: 142299 bytes
Optimized: yes
PDF version: 1.6
Metadata:
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<xmp:CreateDate>2015-01-28T17:28:28Z</xmp:CreateDate>
<xmp:ModifyDate>2015-02-06T14:38+02:00</xmp:ModifyDate>
<xmp:MetadataDate>2015-02-06T14:38+02:00</xmp:MetadataDate>
<pdf:Producer>iPhone OS 8.0 Quartz PDFContext</pdf:Producer>
<xmpMM:DocumentID>uuid:d192522a-d623-4c39-b6ec-6728337325bd</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:a0dc7e34-37e3-4541-a2c5-383a4c5e6602</xmpMM:InstanceID>
<dc:format>application/pdf</dc:format>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
This has me already wondering a bit:
How can iOS 8.0 Quartz PDFContext create a PDF-1.6 version, when even Mac OS X Mavericks 10.9.5 is unable to produce PDF-1.4 and still remains on PDF-1.3 ??? (It may be legit -- I'm not a developer nor an iOS expert, but it still leaves me wondering...)
Since the result line: 'JavaScript: no' does not indicate that there is JavaScript in the PDF (unless it is a malicious file, where the JavaScript is hidden and obfuscated!), we can rule out this as a cause for your observed crashes.
'pdffonts' gives us some hints about the used fonts (if any) inside the PDF:
Code: Select all
pdffonts form_advanced2.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
There are no fonts used by this PDF. Which means its page contents can only consist of vector shapes or pixel graphics (if not empty).
'pdfimages -list' will report details about all (raster) images contained in the PDF file:
Code: Select all
pdfimages -list form_advanced2.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1440 124 rgb 3 8 image yes 50 0 144 144 2663B 0.5%
1 1 image 602 62 rgb 3 8 image yes 52 0 144 144 510B 0.5%
1 2 smask 602 62 gray 1 8 image yes 52 0 144 144 186B 0.5%
1 3 image 32 32 rgb 3 8 image yes 19 0 144 144 401B 13%
1 4 smask 32 32 gray 1 8 image yes 19 0 144 144 222B 22%
1 5 image 32 32 rgb 3 8 image yes 19 0 0.505 144 401B 13%
1 6 smask 32 32 gray 1 8 image yes 19 0 0.505 144 222B 22%
1 7 image 32 32 rgb 3 8 image yes 19 0 144 144 401B 13%
1 8 smask 32 32 gray 1 8 image yes 19 0 144 144 222B 22%
1 9 image 32 32 rgb 3 8 image yes 19 0 144 0.596 401B 13%
1 10 smask 32 32 gray 1 8 image yes 19 0 144 0.596 222B 22%
1 11 image 32 32 rgb 3 8 image yes 19 0 0.505 0.596 401B 13%
1 12 smask 32 32 gray 1 8 image yes 19 0 0.505 0.596 222B 22%
1 13 image 32 32 rgb 3 8 image yes 19 0 144 0.596 401B 13%
1 14 smask 32 32 gray 1 8 image yes 19 0 144 0.596 222B 22%
1 15 image 32 32 rgb 3 8 image yes 19 0 144 144 401B 13%
1 16 smask 32 32 gray 1 8 image yes 19 0 144 144 222B 22%
1 17 image 32 32 rgb 3 8 image yes 19 0 0.505 144 401B 13%
1 18 smask 32 32 gray 1 8 image yes 19 0 0.505 144 222B 22%
1 19 image 32 32 rgb 3 8 image yes 19 0 144 144 401B 13%
1 20 smask 32 32 gray 1 8 image yes 19 0 144 144 222B 22%
1 21 image 242 42 rgb 3 8 image yes 72 0 144 144 157B 0.5%
1 22 smask 242 42 gray 1 8 image yes 72 0 144 144 1871B 18%
1 23 image 226 62 rgb 3 8 image yes 88 0 144 144 205B 0.5%
1 24 smask 226 62 gray 1 8 image yes 88 0 144 144 85B 0.6%
1 25 image 32 32 rgb 3 8 image yes 19 0 144 144 401B 13%
1 26 smask 32 32 gray 1 8 image yes 19 0 144 144 222B 22%
[....]
1 685 smask 32 32 gray 1 8 image yes 19 0 145 144 222B 22%
1 686 image 420 42 rgb 3 8 image yes 148 0 144 144 254B 0.5%
1 687 smask 420 42 gray 1 8 image yes 148 0 144 144 2229B 13%
1 688 image 172 42 rgb 3 8 image yes 124 0 144 144 117B 0.5%
1 689 smask 172 42 gray 1 8 image yes 124 0 144 144 1499B 21%
1 690 image 1536 88 rgb 3 8 image yes 150 0 144 144 1813B 0.4%
1 691 smask 1536 88 gray 1 8 image yes 150 0 144 144 636B 0.5%
1 692 image 1440 120 rgb 3 8 image yes 152 0 144 144 2284B 0.4%
1 693 smask 1440 120 gray 1 8 image yes 152 0 144 144 3041B 1.8%
1 694 image 1536 88 rgb 3 8 image yes 150 0 144 144 1813B 0.4%
1 695 smask 1536 88 gray 1 8 image yes 150 0 144 144 636B 0.5%
1 696 image 1440 120 rgb 3 8 image yes 154 0 144 144 2284B 0.4%
1 697 smask 1440 120 gray 1 8 image yes 154 0 144 144 9.78K 5.8%
1 698 image 1536 48 rgb 3 8 image yes 156 0 144 144 988B 0.4%
1 699 smask 1536 48 gray 1 8 image yes 156 0 144 144 344B 0.5%
1 700 image 1536 88 rgb 3 8 image yes 150 0 144 144 1813B 0.4%
1 701 smask 1536 88 gray 1 8 image yes 150 0 144 144 636B 0.5%
1 702 image 1440 120 rgb 3 8 image yes 158 0 144 144 2284B 0.4%
1 703 smask 1440 120 gray 1 8 image yes 158 0 144 144 2957B 1.7%
1 704 image 1536 48 rgb 3 8 image yes 156 0 144 144 988B 0.4%
1 705 smask 1536 48 gray 1 8 image yes 156 0 144 144 344B 0.5%
1 706 image 1536 88 rgb 3 8 image yes 150 0 144 144 1813B 0.4%
1 707 smask 1536 88 gray 1 8 image yes 150 0 144 144 636B 0.5%
1 708 image 1448 134 rgb 3 8 image yes 160 0 144 144 2562B 0.4%
1 709 smask 1448 134 gray 1 8 image yes 160 0 144 144 4645B 2.4%
Ok now! This one-page PDF seems to contain the stupid/insane number of 710 different images (some being used as soft masks, some being really visible images)!
This first impression is a bit misleading when looking at the first 3 columns only. We also have to take into account the columns headed
`object ID`. If these indicate the use of different PDF object IDs for each instance of an image, then the internal construction of the PDF indeed is 'stupid' (or indicates that the developer of the PDF generating application was at the beginning of that part of his professional career which has to deal with PDF related tasks...)
So let's see and count how many different object IDs are there, and their respective frequencies:
Code: Select all
pdfimages -list form_advanced2.pdf | grep -vE '(object ID|---)' | awk '{print $11, $12}' | sort | uniq -c | sort -g
2 100 0
2 102 0
2 104 0
2 106 0
2 108 0
2 110 0
2 116 0
2 122 0
2 126 0
2 128 0
2 130 0
2 132 0
2 134 0
2 136 0
2 138 0
2 140 0
2 142 0
2 144 0
2 146 0
2 148 0
2 152 0
2 154 0
2 158 0
2 160 0
2 52 0
2 54 0
2 56 0
2 58 0
2 62 0
2 64 0
2 66 0
2 68 0
2 70 0
2 72 0
2 74 0
2 76 0
2 78 0
2 84 0
2 86 0
2 90 0
2 92 0
2 94 0
2 98 0
4 112 0
4 114 0
4 118 0
4 120 0
4 124 0
4 156 0
4 96 0
6 60 0
6 82 0
6 88 0
8 150 0
14 50 0
18 80 0
538 19 0
The last line indicates that PDF object number 19 is embedded only once in the file, but used at 538 different locations. At least it's not embedded 538 times then!
Anyway, I'll not continue to analyse this PDF file. Just a few more hints:
Code: Select all
mkdir form_adv2
pdfimages -j form_advanced2.pdf somedir/form_adv2
This command creates a sub directory named 'form_adv2' and extracts all instances of images found in the PDF into this dir.
Attention: the command will extract multiple copies when an image is reused multiple times inside the PDF!. The filenames will be
'form_adv2-000.*',
'form_adv2-001.*',
'form_adv2-002.*', ... (matching the image numbers from the previously printed list). The
'-j' parameter ordered the extraction of JPEG files, which however is not always possible. If JPEGs are not possible to extract, it the suffixes of the file names will not be *.jpg, but *.ppm or *.pbm and the files will be uncompressed rasters. In this case you can still use ImageMagick and convert to get JPEGs for further analysis, if you want.
----
I guess that your version of Ghostscript is just not able to handle that PDF correctly...