Batch process page images for OCR
Posted: 2009-08-20T23:06:15-07:00
I have thousands of pages of stuff I need to run through an OCR system. Each page has a horizontal line at the top with a line of text above it (a title and a couple of page numbers). A vertical line runs down the middle of the image. The images are not so exactly scanned that "middle" is 100% precise. On each side are two columns of text.
My OCR engine is great at recognizing the old script, but cannot understand there are 2 columns of text. How can I split each page image into two columns using imagemagick in a batch process? Ideally the image processing would somehow "find" the vertical line and the horizontal line, and split the page into 3 images ... the top slice, right half, and left half. Less ideal would be to go by pixel count ... the pages are not always perfectly centered in the image file, and the text comes up tight to the vertical line. However, each image is identical in size: 4326 x 5670 pixels.
I'm hoping to run this in a Linux environment.
Help?
m00tpoint
My OCR engine is great at recognizing the old script, but cannot understand there are 2 columns of text. How can I split each page image into two columns using imagemagick in a batch process? Ideally the image processing would somehow "find" the vertical line and the horizontal line, and split the page into 3 images ... the top slice, right half, and left half. Less ideal would be to go by pixel count ... the pages are not always perfectly centered in the image file, and the text comes up tight to the vertical line. However, each image is identical in size: 4326 x 5670 pixels.
I'm hoping to run this in a Linux environment.
Help?
m00tpoint