Batch process page images for OCR

m00tpoint · Post by **m00tpoint** » 2009-08-20T23:06:15-07:00

I have thousands of pages of stuff I need to run through an OCR system. Each page has a horizontal line at the top with a line of text above it (a title and a couple of page numbers). A vertical line runs down the middle of the image. The images are not so exactly scanned that "middle" is 100% precise. On each side are two columns of text.

My OCR engine is great at recognizing the old script, but cannot understand there are 2 columns of text. How can I split each page image into two columns using imagemagick in a batch process? Ideally the image processing would somehow "find" the vertical line and the horizontal line, and split the page into 3 images ... the top slice, right half, and left half. Less ideal would be to go by pixel count ... the pages are not always perfectly centered in the image file, and the text comes up tight to the vertical line. However, each image is identical in size: 4326 x 5670 pixels.

I'm hoping to run this in a Linux environment.

Help?

m00tpoint

Post by **fmw42** » 2009-08-20T23:10:18-07:00

can you post a link to an example?

m00tpoint · Post by **m00tpoint** » 2009-08-21T04:41:59-07:00

Hopefully this will work:

http://fastfreefilehosting.com/file/208 ... 1-tif.html

m00t

Post by **anthony** » 2009-08-21T05:47:48-07:00

The scan seems straight forward enough. It even seems to be de-skewed rather well, so I am going to assume that is not a problem.

So your problem is you need to find the row of the horizontal line, and the column of the vertical one.

Lets take the vertical one first.

The trick is to scale the image down to one row and look at the resulting values by writing them to a txt format...

Code: Select all

   convert page.tif -scale WIDTHx1\!  -depth 16 columns.txt

For example...

# ImageMagick pixel enumeration: 4326,1,65535,rgb
0,0: (65535,65535,65535) #FFFFFFFFFFFF white
1,0: (65535,65535,65535) #FFFFFFFFFFFF white
2,0: (65535,65535,65535) #FFFFFFFFFFFF white
3,0: (65535,65535,65535) #FFFFFFFFFFFF white
4,0: (65535,65535,65535) #FFFFFFFFFFFF white
...
2005,0: (65015,65015,65015) #FDF7FDF7FDF7 rgb(99.2065%,99.2065%,99.2065%)
2006,0: (65015,65015,65015) #FDF7FDF7FDF7 rgb(99.2065%,99.2065%,99.2065%)
2007,0: (64922,64922,64922) #FD9AFD9AFD9A rgb(99.0646%,99.0646%,99.0646%)
2008,0: (60900,60900,60900) #EDE4EDE4EDE4 rgb(92.9274%,92.9274%,92.9274%)
2009,0: (53376,53376,53376) #D080D080D080 rgb(81.4466%,81.4466%,81.4466%)
2010,0: (50405,50405,50405) #C4E5C4E5C4E5 rgb(76.9131%,76.9131%,76.9131%)
2011,0: (48625,48625,48625) #BDF1BDF1BDF1 rgb(74.197%,74.197%,74.197%)
2012,0: (46360,46360,46360) #B518B518B518 rgb(70.7408%,70.7408%,70.7408%)
2013,0: (43759,43759,43759) #AAEFAAEFAAEF rgb(66.772%,66.772%,66.772%)
2014,0: (40800,40800,40800) #9F609F609F60 rgb(62.2568%,62.2568%,62.2568%)
2015,0: (41032,41032,41032) #A048A048A048 rgb(62.6108%,62.6108%,62.6108%)
2016,0: (41552,41552,41552) #A250A250A250 rgb(63.4043%,63.4043%,63.4043%)
2017,0: (38812,38812,38812) #979C979C979C rgb(59.2233%,59.2233%,59.2233%)
2018,0: (38651,38651,38651) #96FB96FB96FB rgb(58.9776%,58.9776%,58.9776%)
2019,0: (36593,36593,36593) #8EF18EF18EF1 rgb(55.8373%,55.8373%,55.8373%)
2020,0: (36050,36050,36050) #8CD28CD28CD2 rgb(55.0088%,55.0088%,55.0088%)
2021,0: (36466,36466,36466) #8E728E728E72 rgb(55.6435%,55.6435%,55.6435%)
2022,0: (37044,37044,37044) #90B490B490B4 rgb(56.5255%,56.5255%,56.5255%)
2023,0: (38893,38893,38893) #97ED97ED97ED rgb(59.3469%,59.3469%,59.3469%)
2024,0: (41274,41274,41274) #A13AA13AA13A rgb(62.9801%,62.9801%,62.9801%)
2025,0: (44823,44823,44823) #AF17AF17AF17 rgb(68.3955%,68.3955%,68.3955%)
2026,0: (49238,49238,49238) #C056C056C056 rgb(75.1324%,75.1324%,75.1324%)
2027,0: (57479,57479,57479) #E087E087E087 rgb(87.7073%,87.7073%,87.7073%)
2028,0: (61848,61848,61848) #F198F198F198 rgb(94.374%,94.374%,94.374%)
2029,0: (63478,63478,63478) #F7F6F7F6F7F6 rgb(96.8612%,96.8612%,96.8612%)
2030,0: (64321,64321,64321) #FB41FB41FB41 rgb(98.1476%,98.1476%,98.1476%)
2031,0: (65200,65200,65200) #FEB0FEB0FEB0 rgb(99.4888%,99.4888%,99.4888%)
2032,0: (65200,65200,65200) #FEB0FEB0FEB0 rgb(99.4888%,99.4888%,99.4888%)
2033,0: (65234,65234,65234) #FED2FED2FED2 rgb(99.5407%,99.5407%,99.5407%)
...

What you want is the smallest (blackest) numbers... a search found column 2020 has the smallest number. In fact if you look you will a very dark patch around that figure surrounding by a large white space on either side! Column 2020 is the center of your vertical line!

You can make this faster by trimming blakc spaces around the image (noting there size) and limiting your 'scale' search to just the center region of the image.

The same method can be used to find the vertical line.

HugoRune · Post by **HugoRune** » 2009-08-21T07:01:27-07:00

Another option: If the pages all have the same format and size, and the only difference is their alignement in the scanner, then you could trim the white border.
Then all your pages should be aligned the same.

You can use the same method of shrinking one dimension to one pixel to trim the border, thereby ignoring black speckles in the border.
See also here viewtopic.php?f=1&t=14247

m00tpoint · Post by **m00tpoint** » 2009-08-21T11:15:10-07:00

Thanks for the tips folks! I'll try them out and let you know how I'm doing.

Post by **anthony** » 2009-08-21T17:39:49-07:00

HugoRune wrote:Another option: If the pages all have the same format and size, and the only difference is their alignement in the scanner, then you could trim the white border.
Then all your pages should be aligned the same.

You can use the same method of shrinking one dimension to one pixel to trim the border, thereby ignoring black speckles in the border.
See also here viewtopic.php?f=1&t=14247

That would work well. But watch out for those last chapter pages, which are half blank!!!!

As with most things, it is the exceptions to the norm that will kill a application!

m00tpoint · Post by **m00tpoint** » 2009-08-21T17:47:40-07:00

Forgive me for being n00bish at CLI image manipulation, but I believe you're guiding me toward something like this?

# convert test1.tif -scale 4326x1\! -depth 16 a.txt
# grep rgb\(5 a.txt
2017,0: (38812,38812,38812) #979C979C979C rgb(59.2233%,59.2233%,59.2233%)
2018,0: (38651,38651,38651) #96FB96FB96FB rgb(58.9776%,58.9776%,58.9776%)
2019,0: (36593,36593,36593) #8EF18EF18EF1 rgb(55.8373%,55.8373%,55.8373%)
2020,0: (36050,36050,36050) #8CD28CD28CD2 rgb(55.0088%,55.0088%,55.0088%)
2021,0: (36466,36466,36466) #8E728E728E72 rgb(55.6435%,55.6435%,55.6435%)
2022,0: (37044,37044,37044) #90B490B490B4 rgb(56.5255%,56.5255%,56.5255%)
2023,0: (38893,38893,38893) #97ED97ED97ED rgb(59.3469%,59.3469%,59.3469%)

Therefore, the vertical centerline is around pixels 2017-2023. Allow some room because it's not going to be perfectly vertical:

# convert test1.tif -crop 2030x5760 testcrop1.tif

Resulting file testcrop1.tif is the first column.

Woo hoo. Looking pretty good! I'm sure I'll have a few more ?'s, but thanks for the quick and helpful replies!

m00t

HugoRune · Post by **HugoRune** » 2009-08-21T18:17:00-07:00

m00tpoint wrote:# convert test1.tif -scale 4326x1\! -depth 16 a.txt
# grep rgb\(5 a.txt

I am not sure how reliable searching for "rgb(5" is.
But if you throw a "-contrast-stretch 0" in there:

convert test1.tif -depth 16 -scale 4326x1\! -contrast-stretch 0 a.txt

Then it should work with a grep "#000000000000"

Post by **anthony** » 2009-08-21T18:31:52-07:00

The method not only finds the location of the centerline, but also the width of the white space on either side of that centerline!!!!

If you look at the larger report I produce you will see the values on either side of the center 2020 slowly increase back to near pure white 99.???% That would mark the start of the white space.

If you look further (beyond what i extracted) then you will see the white gap starte to get darker again due to the text. Set you crop boundary in the middle of that white space!

As you used a grep, you must be under a UNIX type system.. If so use this to filter the txt output. NOTE that I negated the image so low numbers is blank white areas and larger numbers is the darker text and a really large number for the vertical bar

convert test1.tif -scale 4326x1\! -negate -depth 16 txt:- |\
tail -n+2 | tr -cs '0-9\012' ' ' | cut -d' ' -f1,3 > a.txt

The first number on each line is the column, the second is a masure of content low = blank
high = text very high = vertical bar

That should make it easier to process. Look for the highest value for the vertical line, then look for the low numbers around it for the white gap on either side. You can even graph this data (using a program like "gnuplot") to get a better idea of what the data shows.

m00tpoint · Post by **m00tpoint** » 2009-08-21T18:38:06-07:00

convert test1.tif -depth 16 -scale 4326x1\! -contrast-stretch 0 a.txt
Then it should work with a grep "#000000000000"

I'll experiment with that, thanks again.

I also have a slightly more challenging batch of files from the same set of books. An example:
http://FastFreeFileHosting.com/file/20896/0061-tif.html

This batch was scanned with a flatbed scanner. Therefore the skew issues are worse, especially on the left side. The middle gutter will be quite easy with the "find the darkest pixel range and crop accordingly" method.

But since skew was mentioned, how would one deskew the sample file after "cutting it in half?"

m00t

m00tpoint · Post by **m00tpoint** » 2009-08-21T22:46:39-07:00

OK, I'm having issues trying to do some unrotating. Input file is [URL=http://m00tpoint.0catch.com/test1.tif[/URL]

I've tried as follows:

root@mybox# ./unrotate 61crop.tif unout.tif
convert: unable to open image `white': No such file or directory @ magick/blob.c/OpenBlob/2418.
convert: unable to open image `./unrotate_0_1664.png': No such file or directory @ magick/blob.c/OpenBlob/2418.
convert: unable to open file `./unrotate_0_1664.png' @ coders/png.c/ReadPNGImage/2833.
convert: missing an image filename `./unrotate_1_1664.png' @ wand/convert.c/ConvertImageCommand/2710.
convert: unable to open image `./unrotate_1_1664.png': No such file or directory @ magick/blob.c/OpenBlob/2418.
convert: unable to open file `./unrotate_1_1664.png' @ coders/png.c/ReadPNGImage/2833.
convert: missing an image filename `./unrotate_2_1664.png' @ wand/convert.c/ConvertImageCommand/2710.
convert: unable to open image `./unrotate_1_1664.png': No such file or directory @ magick/blob.c/OpenBlob/2418.
convert: unable to open file `./unrotate_1_1664.png' @ coders/png.c/ReadPNGImage/2833.
convert: missing an image filename `./unrotate_3_1664.png' @ wand/convert.c/ConvertImageCommand/2710.
convert: unable to open image `./unrotate_2_1664.png': No such file or directory @ magick/blob.c/OpenBlob/2418.
convert: unable to open file `./unrotate_2_1664.png' @ coders/png.c/ReadPNGImage/2833.
convert: missing an image filename `txt:-' @ wand/convert.c/ConvertImageCommand/2710.

So that script isn't doing well for me. For the record, here are the files it created:

root@myboxt# ls -l | grep 1664
-rw-r--r-- 1 root root 334160 2009-08-21 22:47 unrotate_0_1664-0.png
-rw-r--r-- 1 root root 354315 2009-08-21 22:47 unrotate_0_1664-1.png
-rw-r--r-- 1 root root 20728 2009-08-21 22:47 unrotate_0_1664-2.png

Is this as simple as filenaming that isn't consistent in the current rev of the script?

Thanks,
m00t

Post by **fmw42** » 2009-08-21T22:54:22-07:00

if you are using my unrotate script, then please correct your last post as your image are not available easily. you are not using the URL button correctly.

And of I type this into my browser it cannot find it http://m00tpoint.0catch.com/test1.tif

several other issues with using my script.

1) you must change permissions to make it executable
2) you may need to remove the .sh at the end after downloading
3) you may need to provide the full path to the script as well as to the input and output images
4) you may need to install any relevant delegate library for tif, png, jpg, gif that you use and then recompile IM

m00tpoint · Post by **m00tpoint** » 2009-08-22T00:08:26-07:00

Fred,

1) Here's a better URL for the file: http://FastFreeFileHosting.com/file/209 ... 1-tif.html
2) Script is executable, run as root. Running it isn't the problem.
3) Input and output images are in the same directory as the script.
4) Imagemagick's convert utility is handling my tif's fine; I used it to crop this image from a larger one.

Thanks so much for your help, and for all the time and effort you've put into software for the rest of us to use and enjoy!

m00t

Post by **fmw42** » 2009-08-22T02:46:25-07:00

So I assume it is working fine. But there is a better script for processing text. See textcleaner. It uses the -deskew function in IM that works well for text that needs to be unrotated by about 5 deg or less. See my example page

Legacy ImageMagick Discussions Archive

Batch process page images for OCR

Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR

Re: Batch process page images for OCR