Pre-processing a victorian typeset page
Posted: 2016-05-29T18:45:56-07:00
I am trying to OCR a number of very old documents (150years+)which seem to have been typeset with a 2-vertical column line delimiter and a horizontal top of page delimiter. The OCR programs I have tried are 'foxed' by the vertical line between the columns but seem to be OK with the horizontal one.
The right bound of column 1 is very close to the vertical as is the left bound of column2 and at each try the OCR fails at a hyphenation which ends column 1.
The result is quite god OCR but nonsense sentences after a hyphenated word, and they are quite frequent in multi-columnar text.
I have some ideas for algorithms to fix the page columnation issue;
1) create/substitute a blank slightly wider (0.2")column for a 'mask' vertical delimiter line to cover the vertical line.
2) create a new page by copying the bounding box of column1 to a new page and then the bounding box of column2 to the same new page ensuring there is a sufficient gap without a vertical line on the new page.
3) Alternately mask column2 while OCRing column 1, and then masking column1 while OCRing column 2. Programatically this is probably too difficult for me
Trouble is with the execution of this. Can someone help with a proposed implementation.
I attach an example of the page in question.
https://www.dropbox.com/s/32ye6befoyzjx ... 5.png?dl=0
Thanks in advance
Brian
The right bound of column 1 is very close to the vertical as is the left bound of column2 and at each try the OCR fails at a hyphenation which ends column 1.
The result is quite god OCR but nonsense sentences after a hyphenated word, and they are quite frequent in multi-columnar text.
I have some ideas for algorithms to fix the page columnation issue;
1) create/substitute a blank slightly wider (0.2")column for a 'mask' vertical delimiter line to cover the vertical line.
2) create a new page by copying the bounding box of column1 to a new page and then the bounding box of column2 to the same new page ensuring there is a sufficient gap without a vertical line on the new page.
3) Alternately mask column2 while OCRing column 1, and then masking column1 while OCRing column 2. Programatically this is probably too difficult for me
Trouble is with the execution of this. Can someone help with a proposed implementation.
I attach an example of the page in question.
https://www.dropbox.com/s/32ye6befoyzjx ... 5.png?dl=0
Thanks in advance
Brian