Scanning old books with textcleaner and tesseract
Posted: 2015-09-23T08:11:38-07:00
I'm looking to extract the text from some old books and put them online, and trying to do it by taking photos with a mobile phone camera, textcleaner and tesseract and am getting very poor results. A lot of it seems to be related to the device I use to take the photo on but I can't see why.
First of all, I took a photo of a page with my Lumia 1020 40+MP camera and I was getting fairly good results. About 90% of the text retrieved. The only problem there was was at the top and bottom of the page where the text curves a bit.
I then tried my wife's basic Lumia and an iPad and can't get either of them to retrieve any text at all. I read in the tesseract documentation that the images should be at least 10 point in 300x300ppi. I looked at all the images in gimp (including the Lumia 1020 image) and they are all 72x72ppi so I can't understand why one is working and the rest aren't. I've tried to increase the ppi using the command
convert -units PixelsPerInch IMG_0016.JPG -resample 300 pic2.jpg
to increase the ppi size but it just seems to have choked my little notebook processing the huge files.
I've posted a sample image taken on the ipad in dropbox.
https://www.dropbox.com/s/87hxua7yx096c ... 6.JPG?dl=0
Here is the link to my Lumia 1020 image that gives good results but is a lot smaller than the one above.
https://www.dropbox.com/s/mt8460lvcf2y5 ... o.jpg?dl=0
Am I expecting too much to extract the text from images like this? I am a newbie at this so any advice would be appreciated.
Thanks
Shaun
First of all, I took a photo of a page with my Lumia 1020 40+MP camera and I was getting fairly good results. About 90% of the text retrieved. The only problem there was was at the top and bottom of the page where the text curves a bit.
I then tried my wife's basic Lumia and an iPad and can't get either of them to retrieve any text at all. I read in the tesseract documentation that the images should be at least 10 point in 300x300ppi. I looked at all the images in gimp (including the Lumia 1020 image) and they are all 72x72ppi so I can't understand why one is working and the rest aren't. I've tried to increase the ppi using the command
convert -units PixelsPerInch IMG_0016.JPG -resample 300 pic2.jpg
to increase the ppi size but it just seems to have choked my little notebook processing the huge files.
I've posted a sample image taken on the ipad in dropbox.
https://www.dropbox.com/s/87hxua7yx096c ... 6.JPG?dl=0
Here is the link to my Lumia 1020 image that gives good results but is a lot smaller than the one above.
https://www.dropbox.com/s/mt8460lvcf2y5 ... o.jpg?dl=0
Am I expecting too much to extract the text from images like this? I am a newbie at this so any advice would be appreciated.
Thanks
Shaun