I'm looking to extract the text from some old books and put them online, and trying to do it by taking photos with a mobile phone camera, textcleaner and tesseract and am getting very poor results. A lot of it seems to be related to the device I use to take the photo on but I can't see why.
First of all, I took a photo of a page with my Lumia 1020 40+MP camera and I was getting fairly good results. About 90% of the text retrieved. The only problem there was was at the top and bottom of the page where the text curves a bit.
I then tried my wife's basic Lumia and an iPad and can't get either of them to retrieve any text at all. I read in the tesseract documentation that the images should be at least 10 point in 300x300ppi. I looked at all the images in gimp (including the Lumia 1020 image) and they are all 72x72ppi so I can't understand why one is working and the rest aren't. I've tried to increase the ppi using the command
convert -units PixelsPerInch IMG_0016.JPG -resample 300 pic2.jpg
to increase the ppi size but it just seems to have choked my little notebook processing the huge files.
I've posted a sample image taken on the ipad in dropbox.
https://www.dropbox.com/s/87hxua7yx096c ... 6.JPG?dl=0
Here is the link to my Lumia 1020 image that gives good results but is a lot smaller than the one above.
https://www.dropbox.com/s/mt8460lvcf2y5 ... o.jpg?dl=0
Am I expecting too much to extract the text from images like this? I am a newbie at this so any advice would be appreciated.
Thanks
Shaun
Scanning old books with textcleaner and tesseract
-
- Posts: 12159
- Joined: 2010-01-23T23:01:33-07:00
- Authentication code: 1151
- Location: England, UK
Re: Scanning old books with textcleaner and tesseract
Pointsize or pixels per inch is meaningless in this context. The important measurement is the height of the letters, in pixels. In my limited experience, Tesseract needs a minimum of 10 pixels in the height of characters "Th" etc, and 20 pixels is better. Your first image has this, so that is good.
The camera wasn't perpendicular to the page, so you need a perspective transformation. Windows BAT syntax:
The camera wasn't perpendicular to the page, so you need a perspective transformation. Windows BAT syntax:
Code: Select all
set SRC=IMG_0016.jpg
set PERSP=^
496,664,400,500,^
1604,500,1748,500,^
400,2260,400,2268,^
1748,2268,1748,2268
%IM%convert ^
%SRC% ^
-rotate -90 ^
-distort perspective "%PERSP%" ^
t.png
snibgo's IM pages: im.snibgo.com