Page 1 of 2
Remove horizontal summation lines but keep a minus
Posted: 2018-09-21T03:30:35-07:00
by isfando
Hi
I need to ocr pdf of financial statements with horizontal lines before the summation. The lines decrease the accuracy of digits i am parsing therefore i need to remove the lines but keep a minus sign before a digit. The digits need to used for calculation after parsing therefore accuracy is a crucial factor.For example a pdf might contain
220
-30
________
190
________
I want to get my result image from pdf as
220
-30
190
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-21T04:07:53-07:00
by snibgo
See
viewtopic.php?f=1&t=22338&p=129166#p129154 where I show a command that turns white all black lines that are at least 50 pixels wide.
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-21T05:05:19-07:00
by isfando
I applied your command but it dont white the lines. It makes them grey.
Before Applying your command
https://drive.google.com/open?id=1EN3Zj ... 4lPwESQT8K
After Applying your command
https://drive.google.com/open?id=1vXmjC ... hPmfveDuTG
I am using convert on windows with following information.
Version: ImageMagick 7.0.7-4 Q16 x64 2017-09-23
http://www.imagemagick.org
Copyright: Copyright (C) 1999-2015 ImageMagick Studio LLC
License:
http://www.imagemagick.org/script/license.php
Visual C++: 180040629
Features: Cipher DPC Modules OpenMP
Delegates (built-in): bzlib cairo flif freetype jng jp2 jpeg lcms lqr openexr pangocairo png ps rsvg tiff webp xml zlib
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-21T05:37:13-07:00
by snibgo
My command turns black lines into white, but your image doesn't have black lines. You could process it to make the lines black, then remove them, and use the pixels that have changed to paint white over your input image.
However, that image is low quality with small characters. I doubt that you will get reliable OCR from it.
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-21T06:19:31-07:00
by isfando
Sorry the pictures earlier were taken in a bit bad zoom and they were part of bigger picture which i cant post for confidentiality reason.
I changed my lines to black and your command works for the most part but it leaves the jagged edges out.
I am using the following command to convert from pdf to png
convert -density 300 ./sam.pdf -depth 8 -strip -background white -alpha off -threshold 70% sam.png
I get this png as a result after running the above command
https://drive.google.com/open?id=1Fv3RI ... tZiIDDCVqc
After Applying your command
https://drive.google.com/open?id=1Q9xJX ... IBq0lrxU-L
The only problem is there is still some noise remaining from the removed lines which could be parsed as minus sign with tesseract.
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-21T06:57:08-07:00
by snibgo
A bit more pre-processing solves the problem:
Code: Select all
convert ^
Before1.png ^
-strip ^
( +clone ^
-threshold 50%% ^
-write mpr:ORG ^
+delete ^
) ^
( mpr:ORG ^
-negate ^
-morphology Erode rectangle:50x1 ^
-mask mpr:ORG -morphology Dilate rectangle:50x1 ^
+mask ^
-morphology Dilate Disk:2 ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "1x4:1,0,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "1x3:1,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "3x1:1,0,1" ^
) ^
-compose Lighten -composite ^
out.png
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-24T03:41:55-07:00
by isfando
Snibgo it worked .thanks alot. If you can explain the script it would help me think independently and do changes in future. Currently i am not at a level to understand the pipeline of event happening in your code.
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-24T05:12:59-07:00
by snibgo
When trying to understand a long command, it's helpful to sprinkle "+write x0.png", "+write x1.png" etc after every step. We can then see the effects.
The goal is to remove the long black horizontal lines, ie to make then white. To do that, we make an image that is white where the lines are and black everywhere else. As the input isn't just black and white, we slightly enlarge the lines in that image.
convert ^ Run v6 convert
Before1.png ^ Read the image
-strip ^ Remove any superfluous metadata.
( +clone ^ Start a new image list; add to it a clone of the last image in the outer list.
-threshold 50%% ^ Threshold all the images in the current list (there is only one) so they are black and white only.
-write mpr:ORG ^ Write it to a memory location.
+delete ^ Remove it from the current list.
) ^ Close the current list. This would add any images from the nested list to the outer list, but there aren't any as we deleted it..
( mpr:ORG ^ Start a new list, reading the image we saved.
-negate ^ Invert black and white. Now we have white numbers and lines on a black background.
-morphology Erode rectangle:50x1 ^ Erode (remove) small horizontal lines. Now we have just the long horizontal lines, but slightly trimmed.
-mask mpr:ORG -morphology Dilate rectangle:50x1 ^ Dilate (make larger) the horizontal lines, using ORG as a mask so we get the full width of the lines.
+mask ^ Stop using the mask.
-morphology Dilate Disk:2 ^ Make the lines slightly taller (and wider).
) ^ Close the current list, copying the result from the inner list to the outer list. Now the list has two images: Before1.png, and thick white lines on a black background.
-compose Lighten -composite ^ Make each pixel the lighter of the two images. This paints white over the long horizontal lines in Before1.png.
( +clone ^
-morphology HMT "1x4:1,0,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "1x3:1,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "3x1:1,0,1" ^
) ^
-compose Lighten -composite ^
out.png
The final steps simply clean any noise from the image. These aren't needed for your example.
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-24T05:59:14-07:00
by isfando
Thanks alot for the explanation now i have an idea of the pipeline. I noticed that the code still leaves a fragment of line at the bottom right corner of image. Plus any addition to smoothen numbers would be highly helpful.
Before
https://drive.google.com/open?id=135BfW ... LvCm7B0476
After
https://drive.google.com/open?id=1SXEzI ... PVIzMnf34I
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-24T07:03:47-07:00
by snibgo
To improve the bottom right, in "-morphology Dilate Disk:2 ^", change 2 to 3.
My command doesn't change the numbers. You can add a slight blur if you want, eg "-blur 0x0.5" just before the output filename.
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-25T04:23:53-07:00
by isfando
I made the suggested changes and the results are pretty good. thanks alot. One last question in this regard. How can i feed a pdf file with multiple pages to your code and for each page the code is applied to it and as a result i get images in png format equal to number of pages in pdf file.
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-25T05:12:32-07:00
by snibgo
PDF documents are often too large for raster images of all the pages to be in memory simultaneously. Besides, adapting a complex convert command for multiple input images isn't trivial.
The easy solution is to use a shell loop to process one page at a time. For example, put the above command in a BAT file I'll call DoOnePage.bat that takes %1 as the input and %2 as the output. Then create another BAT file I'll call DoManyPages.bat like this (untested):
Code: Select all
set INPDF=mypdf.pdf
for /F "usebackq" %%L in (`exiftool -args -PageCount %INPDF%`) do set %%L
set /A LASTPAGE=%-PageCount%-1
for /L %%I in (0,1,%LASTPAGE%) do call DoOnePage %INPDF%[%%I] out_%%I.png
This uses exiftool to quickly count the pages.
However, that creates files like out_9.png and out_10.png, so they don't sort cleanly. I add leading zeros like this:
Code: Select all
for /L %%I in (0,1,%LASTPAGE%) do (
set LZ=000000%%I
set LZ=!LZ:~-6!
call DoOnePage %INPDF%[%%I] out_!LZ!.png
)
This gives filenames like out_000009.png and out_000010.png.
[I haven't tested the above. Beware of my faulty memory.]
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-25T06:37:59-07:00
by isfando
Thanks for you suggestion. I will try to make use of it. But In my case i will run the script on a server machine where memory is not a problem.The server has 128gb ram. If the original convert script could be changed to handle a pdf file as a whole, it would make my work quite easy.I would want the algorithm to apply to each page also so feeding a pdf to the original convert scripts suits my needs alot
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-25T06:52:39-07:00
by snibgo
The command could be adapted for multiple inputs, but I don't know how the masked morphology would be done. Each image needs its own mask, but IM's syntax needs an explicit name for the mask.
I've shown you a loop in a BAT script, which calls another BAT script that runs the command. Of course, you could do it in a single BAT script instead.
If you have loads of processors as well as memory, you could split the job into one page per processor, so it should be quick for the overall PDF document. That's what I do for video frames (although I have only 8 logical cores, and 12 GB memory).
Re: Remove horizontal summation lines but keep a minus
Posted: 2018-09-25T07:59:21-07:00
by isfando
ok I got the point.Your guidance is indeed very helpful. I was able to construct your script on my machine. That goes through pdf and run your convert command page by page. But now the quality of the result png images are not sharp. Below i have given steps for two approaches and the result from earlier approach is pretty crisp while that from the current approach is dull and has after marks of removal of lines . As an example i am using a pdf named sam.pdf containing only one page. My main question is how can i get as crisp results as earlier approach from the current approach.
(I am also presenting a tweaked approach at the end but i dont think its very efficient but the result is crisp with it)
********************EARLIER APPROACH******************************
1)
Code: Select all
convert -density 300 ./sam.pdf -depth 8 -strip -background white -alpha off -threshold 70% sam.png
the output image sam.png from this step is pretty crisp so the result in step 3 is also crisp
https://drive.google.com/open?id=1fBFFo ... HG-8w-6zGI
2)
Code: Select all
convert ^
sam.png ^
-strip ^
( +clone ^
-threshold 50%% ^
-write mpr:ORG ^
+delete ^
) ^
( mpr:ORG ^
-negate ^
-morphology Erode rectangle:200x1 ^
-mask mpr:ORG -morphology Dilate rectangle:200x1 ^
+mask ^
-morphology Dilate Disk:3 ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "1x4:1,0,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "1x3:1,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "3x1:1,0,1" ^
) ^
-compose Lighten -composite ^
-blur 0x0.5 out.png
3) Result image
https://drive.google.com/open?id=1obtnH ... hHFV0VsOCL
*********************CURRENT APPROACH*********************************
1) doonepage.bat
Code: Select all
convert ^
-density 300 ^
%1 ^
-depth 8 ^
-strip ^
( +clone ^
-threshold 50%% ^
-write mpr:ORG ^
+delete ^
) ^
( mpr:ORG ^
-negate ^
-morphology Erode rectangle:200x1 ^
-mask mpr:ORG -morphology Dilate rectangle:200x1 ^
+mask ^
-morphology Dilate Disk:3 ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "1x4:1,0,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "1x3:1,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "3x1:1,0,1" ^
) ^
-compose Lighten -composite ^
-blur 0x0.5 %2
2) domanypages.bat
Code: Select all
set INPDF=sam.pdf
for /F "usebackq" %%L in (`exiftool -args -PageCount %INPDF%`) do set %%L
set /A LASTPAGE=%-PageCount%-1
for /L %%I in (0,1,%LASTPAGE%) do call DoOnePage %INPDF%[%%I] out_%%I.png
3)Result
https://drive.google.com/open?id=1toqjB ... 5pItrdMeRi
*********************TWEAKED APPROACH*********************************
1) doonepagepre.bat
Code: Select all
convert -density 300 %1 -depth 8 -strip -background white -alpha off -threshold 70%% %2
2)doonepage.bat
Code: Select all
convert ^
%1 ^
-strip ^
( +clone ^
-threshold 50%% ^
-write mpr:ORG ^
+delete ^
) ^
( mpr:ORG ^
-negate ^
-morphology Erode rectangle:200x1 ^
-mask mpr:ORG -morphology Dilate rectangle:200x1 ^
+mask ^
-morphology Dilate Disk:3 ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "1x4:1,0,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "1x3:1,0,1" ^
) ^
-compose Lighten -composite ^
( +clone ^
-morphology HMT "3x1:1,0,1" ^
) ^
-compose Lighten -composite ^
-blur 0x0.5 %2
3)domanypages.bat
Code: Select all
set INPDF=sam.pdf
for /F "usebackq" %%L in (`exiftool -args -PageCount %INPDF%`) do set %%L
set /A LASTPAGE=%-PageCount%-1
for /L %%I in (0,1,%LASTPAGE%) do (
call DoOnePagePre %INPDF%[%%I] out_%%I.png
call DoOnePage out_%%I.png out_%%I.png
)
4) Result
They are crisp as earlier approach