Scanning old books with textcleaner and tesseract

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
shaunc
Posts: 1
Joined: 2015-09-23T02:52:15-07:00
Authentication code: 1151

Scanning old books with textcleaner and tesseract

Post by shaunc »

I'm looking to extract the text from some old books and put them online, and trying to do it by taking photos with a mobile phone camera, textcleaner and tesseract and am getting very poor results. A lot of it seems to be related to the device I use to take the photo on but I can't see why.

First of all, I took a photo of a page with my Lumia 1020 40+MP camera and I was getting fairly good results. About 90% of the text retrieved. The only problem there was was at the top and bottom of the page where the text curves a bit.

I then tried my wife's basic Lumia and an iPad and can't get either of them to retrieve any text at all. I read in the tesseract documentation that the images should be at least 10 point in 300x300ppi. I looked at all the images in gimp (including the Lumia 1020 image) and they are all 72x72ppi so I can't understand why one is working and the rest aren't. I've tried to increase the ppi using the command

convert -units PixelsPerInch IMG_0016.JPG -resample 300 pic2.jpg

to increase the ppi size but it just seems to have choked my little notebook processing the huge files.

I've posted a sample image taken on the ipad in dropbox.

https://www.dropbox.com/s/87hxua7yx096c ... 6.JPG?dl=0

Here is the link to my Lumia 1020 image that gives good results but is a lot smaller than the one above.

https://www.dropbox.com/s/mt8460lvcf2y5 ... o.jpg?dl=0

Am I expecting too much to extract the text from images like this? I am a newbie at this so any advice would be appreciated.

Thanks
Shaun
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Scanning old books with textcleaner and tesseract

Post by snibgo »

Pointsize or pixels per inch is meaningless in this context. The important measurement is the height of the letters, in pixels. In my limited experience, Tesseract needs a minimum of 10 pixels in the height of characters "Th" etc, and 20 pixels is better. Your first image has this, so that is good.

The camera wasn't perpendicular to the page, so you need a perspective transformation. Windows BAT syntax:

Code: Select all

set SRC=IMG_0016.jpg

set PERSP=^
496,664,400,500,^
1604,500,1748,500,^
400,2260,400,2268,^
1748,2268,1748,2268

%IM%convert ^
  %SRC% ^
  -rotate -90 ^
  -distort perspective "%PERSP%" ^
  t.png
snibgo's IM pages: im.snibgo.com
Post Reply