Page 1 of 1
Removing table borders
Posted: 2012-11-19T09:33:15-07:00
by phrosty
I'm trying to OCR some papers which have data in a table format, but the table has borders between rows/columns and it's messing up the OCR. I think there must be some way to have ImageMagick remove the borders for me -- anyone know?
Re: Removing table borders
Posted: 2012-11-19T10:28:57-07:00
by Bonzo
For a start nobody will be able to help without an example; also are all the tables the same in every page?
Re: Removing table borders
Posted: 2012-11-19T10:41:58-07:00
by fmw42
If all the tables are the same structure, then can you scan an empty table. Then use compare to locate the offset (assuming no rotation or scale differences). Then use the empty table to mask out the lines.
Re: Removing table borders
Posted: 2012-11-19T16:48:52-07:00
by phrosty
I've got thousands of scanned documents from various sources and they're all slightly different -- I can't count on fonts, sizes, positions (global or local), border thickness, etc. so it's not a one-mask-fits-all problem (I wish!).
I've thought of detecting straight lines and removing any that are beyond a size which could be text -- sounds easy in theory, but the algorithms I've seen (e.g. Hough transform) which might be used are beyond me. Maybe I'm over-complicating it.
I don't have the documents on me at the moment, I'll post some later tonight when I do but really just imagine messed up (due to scanning) HTML tables with borders.
Re: Removing table borders
Posted: 2012-11-20T13:55:54-07:00
by fmw42
Having one or two examples may help us understand better. Also it gives us something with which to test ideas.