Page 1 of 1

Converting a million+ files in bash script

Posted: 2011-07-31T16:24:53-07:00
by imnoob
I have a folder with subfolders with over a million small tif files that I need to convert to pdfs with the same base file name (e.g. file1.tif > file1.pdf). I have attempted a number of different versions of a bash script to accomplish this but am not having any success. The script starts out fine, converts 10 or so files, then starts giving an error on each conversion: "convert: memory allocation failed"

I'm fairly sure this is because of the approach I'm using - possibly not feeding the files to the convert command properly, allowing it to purge memory in use before starting the next one. I'm totally willing to abandon the bash script method if someone has a better idea. But I'm not skilled in any scripting language enough to know where I'm failing. Here's my bash script at the moment. The latest version I've tried reads a list of folder paths where the files exist from a text file, and then loops through the files in that folder, then goes on to the next folder (by reading the next line in the text file). There are no duplications in the file names, so it's no problem to write the pdf files to a single output directory but that's not an essential requirement.

Environment: Machine running the conversion is CentOS 5.6 with ImageMagick 6.28. 64-bit, 2 x quad core processors and 8 GB RAM
Source tif files are on a mounted NTFS file system from an external USB drive.

Code: Select all

#!/bin/bash
mydirs=`cat /tmp/dirlist.txt`
for d in $mydirs ; do
        for f in `find $d -type f -name "*.tif"` ; do
                filename=$(basename $f)
                echo "Now Processing File: ${filename}"
                let filelength=${#filename}-4
                filenamenoext=${filename:0:${filelength}}
                if [ ! -f "/home/myhomedir/$filenamenoext".pdf ]
                then
                        convert "$f" "/home/myhomedir/$filenamenoext".pdf
                fi
        done
done

Re: Converting a million+ files in bash script

Posted: 2011-07-31T18:03:43-07:00
by fmw42
convert requires memory to hold every image in that command line (loop)

Consider using mogrify. That is its purpose. To keep the same file name and do it on many images without worry about too many images in memory. see http://www.imagemagick.org/Usage/basics/#mogrify. the files need to be in the same directory however or moved there or done for each directory. I would also suggest you use the -path option to place the results in a different directory so you don't overwrite a good image with an error result. Or at least test it that way until you are sure you are doing things correctly.

mkdir newdirectory
cd olddirectory
mogrify -path fullpathto/newdirectory -format pdf *.tiff

or just use * at the end to process all files in the olddirectory and put the result in newdirectory with the same filename but .pdf rather than .tiff

IM 6.2.8 is very very old. You should consider upgrading if you can.


If mogrify does not work for you, then see http://www.imagemagick.org/Usage/basics/#mogrify_not

Re: Converting a million+ files in bash script

Posted: 2011-07-31T21:34:29-07:00
by imnoob
Thanks very much for replying and I'll take a look at mogrify. While I was waiting to see if anyone would reply, I uninstalled the old version from the repositories and compiled the most up to date version from the source. And lo and behold, it's working!

So thanks again for your reply.
Best regards.

IM is *very* cool :-)

Re: Converting a million+ files in bash script

Posted: 2011-07-31T22:58:25-07:00
by anthony
fmw42 wrote:convert requires memory to hold every image in that command line (loop)
Something I want to change in IMv7 with co-processing. Allowing the shell script to run just one convert command and
feed it image processing options. when that become available, then 'purging' and loding the next image to work with will be an option!
Consider using mogrify. That is its purpose. To keep the same file name and do it on many images without worry about too many images in memory. see http://www.imagemagick.org/Usage/basics/#mogrify
Correct. Morgrify completely separates all image files from all the other options on the command line.
It processes settings, then reads one file name, runs it through all options, and saves it before reading the next filename
as a completely separate process.

However A million plus files! I doubt you can list them all on the command line, or even ask ImageMagick to expad '*' meta-characters instead of the shell without it filling memory with filenames!

A file processing loop will be a better option... See non-morgrify technqiues for a starting point.
http://www.imagemagick.org/Usage/basics ... ogrify_not

You may also like to study the GNU 'parallel' command, whcih can launch and keep running N commands each on a different file. You can even arrange for those commands to run on different machines so as to use a 'computer processor farm' for even faster processing.
IM 6.2.8 is very very old. You should consider upgrading if you can.
Definitely!