I am trying to OCR a number of very old documents (150years+)which seem to have been typeset with a 2-vertical column line delimiter and a horizontal top of page delimiter. The OCR programs I have tried are 'foxed' by the vertical line between the columns but seem to be OK with the horizontal one.
The right bound of column 1 is very close to the vertical as is the left bound of column2 and at each try the OCR fails at a hyphenation which ends column 1.
The result is quite god OCR but nonsense sentences after a hyphenated word, and they are quite frequent in multi-columnar text.
I have some ideas for algorithms to fix the page columnation issue;
1) create/substitute a blank slightly wider (0.2")column for a 'mask' vertical delimiter line to cover the vertical line.
2) create a new page by copying the bounding box of column1 to a new page and then the bounding box of column2 to the same new page ensuring there is a sufficient gap without a vertical line on the new page.
3) Alternately mask column2 while OCRing column 1, and then masking column1 while OCRing column 2. Programatically this is probably too difficult for me
Trouble is with the execution of this. Can someone help with a proposed implementation.
I attach an example of the page in question.
https://www.dropbox.com/s/32ye6befoyzjx ... 5.png?dl=0
Thanks in advance
Brian
Pre-processing a victorian typeset page
Re: Pre-processing a victorian typeset page
here is a working link:
https://www.dropbox.com/s/qgld3398kj4zz ... 5.png?dl=0
https://www.dropbox.com/s/qgld3398kj4zz ... 5.png?dl=0
-
- Posts: 12159
- Joined: 2010-01-23T23:01:33-07:00
- Authentication code: 1151
- Location: England, UK
Re: Pre-processing a victorian typeset page
Fred may have a script for this.
Your first link worked for me. But the image is horribly small for successful OCR.
The text is pretty much horizontal but, sadly, that vertical central black line isn't vertical. This can be corrected by "-shear 0.5x0". With luck, all the pages need the same shear.
Then we can scale down to a single row. I expand this to 10 rows so we can see it.
(Windows CMD syntax.)
We can see a dark line in the position of the vertical line, with a lighter gutter on each side. I would chop the image somewhere in each gutter, so I have "two pieces of paper", and the central strip is discarded.
So the first gutter (in the half-size image) is from x=136 to x=138, and the second is 143 to 145. So we should chop x0.png at x=137 and x=144, plus a quarter of its width.
When we have the two images, they could be OCR'd separately, or appended vertically to do them together.
Your first link worked for me. But the image is horribly small for successful OCR.
The text is pretty much horizontal but, sadly, that vertical central black line isn't vertical. This can be corrected by "-shear 0.5x0". With luck, all the pages need the same shear.
Then we can scale down to a single row. I expand this to 10 rows so we can see it.
Code: Select all
convert dnb25.png -shear 0.5x0 -trim +repage +write x0.png -scale x1! -scale x10! -auto-level dnb25_flat.png
We can see a dark line in the position of the vertical line, with a lighter gutter on each side. I would chop the image somewhere in each gutter, so I have "two pieces of paper", and the central strip is discarded.
Code: Select all
convert dnb25_flat.png -gravity center -crop 50%x+0+0 +repage -crop x1+0+0 +repage -channel RGB -auto-level +channel -threshold 90% +transparent White sparse-color:
136,0,white 137,0,white 138,0,white 143,0,white 144,0,white 145,0,white
When we have the two images, they could be OCR'd separately, or appended vertically to do them together.
snibgo's IM pages: im.snibgo.com