WatchOCR - A Linux Bootable OCR System
Sometimes serendipity steps in just as you think you're going to have to re-invent the wheel or (shudder) pay for some proprietary software and run it on Windoze.
This happened late last week as I looked around for ways of converting the publishing PDF of a huge reference book (made using Quark) into something I could put up online as a reference work with good search engine coverage and potential for crowd-sourced editing and extension.
The project is about to be announced as an adjunct to the work being done by Hancock Wildlife Foundation, an organization I've been part of in both technical and managerial roles since its founding in 2006.
The book Raptor Research and Management Techniques is a compendium of papers that deal with all manner of the life management of raptors; eagles, hawks, falcons, ospreys, owls, vultures, etc. - birds of prey.
The problem I ran into is that once the manuscript was in Quark (where all the final editing had been done), the only way to get it out in any "portable" fashion is via conversion to PDF - just one of those things you find when you deal with proprietary software it seems; nothing reads Quark files because they are not documented and the format is protected jealously by the company.
OK - so I have a PDF. I also have tools that can take one apart and do other "interesting things" like convert to Postscript (pdf2ps) which can then be converted to ASCII (ps2ascii) and all the stuff in the PDF Toolkit (pdftk) but... none of them work properly with the format of this book with its two columns and lots of diagrams, etc.
What I needed was some method of running OCR on the file and getting things out that way. I've seen some impressive facilities that accompany scanners and such - and of course only run on their output it seems.
I and many others have been looking for some open source facility that would do the trick. One that appeared to be "interesting" was Cuneiform, but the information on this Russian software is sparse in English. I had downloaded the source and had a short session trying to get it to run on my Fedora Core 11 box but was missing some libraries, and the docs are for Debian systems with "apt get" instead of "yum" and obviously different package names so I had put it aside for the time being.
The alternative was a Windows binary already configured - low on my list for now but a fall-back.
And lo and behold, along comes a note about a bootable Linux disk with all the things necessary:
It's web page contains a link to Cuneiform so maybe the problem is resolved.
Read on to find out how the system performed but the docs differ from reality - and how to make the system at least a bit better and do some other interesting things such as create HTML output from your PDFs.
This bootable disk image is based on Knoppix, a distro I've played with in the past but that is not one I use, I'm a Red Hat/Fedora user. That being said, the actual interface is not that much to worry about since the disk is single-function oriented. Only when I had to get "under the hood" to fix some things did it become obvious that things were in different places - but I muddled on.
I downloaded and burned the CD image onto a DVD (no CDs in this house it seems - everything was packed to move) and tried to boot it on a PC sitting on my floor here. I have a 4-way KVM switch hooked to my system here because I have 2 main systems and sometimes have other machines in place as I configure them for custom uses and customers. I've had a Compaq Presario sitting on my floor and hooked in for a couple of months now, since it was proposed as a machine for an application and then the application was dropped.
This model SR1734x has an AMD Athalon 3200+ processor and a couple of Gigs of RAM so should have done the trick - but no GUI came up after a reasonable time. These things happen. This system uses an ATI video chip.
I tried on another machine, this time a HP Pavillion A1430n - with Athalon 3800+ processor and again a couple of Gigs of RAM. Booted and ran fine. This machine uses an Nvidia chipset for video - maybe this is the difference that helped.
The SETUP section of the minimal documentation on the web site says the next thing to do is to open up the web browser on the screen and put in "localhost" in the address bar. The "Web Browser" icon on the bottom bar lead me to bring up "Iceweasel" which is a re-branded Mozilla browser with modifications by the Debian Project.
With the browser up I looked for Google.com and discovered I'd forgotten to plug in the network. Plugged it in and the network icon in the lower right of the screen told me it had configured and connected fine.
Putting "localhost" into the address bar brought up a minimal screen pre-filled with the defaults for the "scanin" and "scanout" directories and buttons to start/stop the program.
At this point the web site says you can install the software to the computer's hard disk but for now I'll skip that step so the next step was getting it talking to my network.
This involves the Linux command line. The "terminal emulator" icon is to the left of the browser icon and brings up a command line prompt as the "knoppix" user. It turns out that this user can't mount the file share, so you have to issue "su -" to get root status.
The "recommended" (only documented) way to connect your system to this one is via a SMB share. You set up a share on your network and then mount it using the command:
mount -t smbfs -o ip=XXX.XXX.XXX.XXX, workgroup=MSHOME, username=user, password=pass, uid=www-data, gid=www-data, dir_mode=0775, file_mode=0775, rw //ServerName/ShareName /home/knoppix/watchocr/
As it turned out, this didn't work for me for a couple of reasons, the first of which is that I had a Windows 7 Professional box on my network to run some video real-time mixing software I was testing, and it had decided that it was the master of the SMB network and was being anal about talking to or admitting that anything else on the network was allowed, including the several shares on my main Linux box that I connect to with my (admitedly Windoze) laptop sometimes.
A quick trip downstairs to where the Windows 7 box was - and telling it to shut down ("don't turn off the power until I've finished installing these 14 updates") and patiently waiting for it to get off the network; time for a coffee.
Once the nasty master was off, it was time to try the above incantation again - nope, got a dump of the syntax of the "mount" command and nothing much that told me why it didn't work. I know the IP, the username and password are correct so something else is screwy.
Tried the bare mount "mount -t smbfs sharename /home/knoppix/watchocr" and got a different message so something in the options was not working out.
The "xsmbrowser" program offered as a means to discover the share names didn't seem to want to find mine from my Linux box (didn't even find the Windows 7 box either but I don't think that machine has any). Maybe it works fine with a "real" Windows share.
The final outcome was that the system wanted a "-o" for each option and no commas between them so ended up:
mount -t smbfs -o username=xxx -o password=yyy -o workgroup=WORKGROUP -o uid=www-data -o gid=www-data -o dir_mode=0775 //192.168.100.1/scan /home/knoppix/watchocr/
There is another way to mount a file system - nfs (the "right" way IMHO)
mount -f nfs -o nolock 192.168.100.1:mountpoint /home/knoppix/watchocr
This will work "out of the box" if you have a file system exported already on your machine. If you want to use the nfslockd then you'll have to start the "statd" program locally before the mount - and then remove the "-o nolock"
By the way - I'm using the IP addresses in all these commands but in fact on my system I could use the names since when the watchocr system picked up a DHCP address it also configured the DNS correctly and my local names resolve correctly.
I used the web interface to start the server's processing.
This got me to the point where I could try out the actual conversion.
The first thing that happened when I copied (I already have the pdfs, so in this case I'm not scanning to the source directory, just copying) was that the files simply disappeared and nothing else seemed to happen. Since there is really no documentation on what is supposed to happen, this was a bit mystifying.
It turned out that I had restarted my SMB share on the Linux box after I'd mounted it on the watchocr system and that had caused problems. This in turn showed up a major (IMHO) inadequacy of the system in that at start up (and every cycle) it removes all the files in the TMP directory, including any that had not yet been processed. I commented out the "rm -rf $PREFIX" line near the bottom of the file, just about the sleep 5. The correct running of the script should cycle through all the files in this TMP directory and move the results to the scanout directory with no problems, regardless of whether it has been stopped and restarted in the middle of a run as I had to do. I admit that the intended purpose of just doing scans as they show up should not suffer from this problem normally, but "fail safe" is the way to do things and this would have lost data if the machine had restarted in the middle of a job.
After digging through the system I found the actual program that sits and loops, watching for the incoming files (5 second loop): /usr/bin/watchocr - started/stopped fairly brutally by programs in /var/www
The first thing the program does in each loop is make a "tmp" directory on the share and move any PDFs to this directory. It then cycles through each of them in turn, running pdfinfo to get the page count, then for each page, splitting out the individual page (using gs) as a TIFF with ghostscript (gs -dNOPAUSE -r300 -dBATCH -dFirstPage=$page -dLastPage=$page -sDEVICE=tiff24nc -sOutputFile=$letter.$page $letter) and rewriting it as a pdf with img2pdf. Finally it again uses ghostscript to package up the "searchable" PDF.
It then checks to ensure it won't clobber an existing file in the scanout directory and moves the resulting file there, either with its original name or with a timestamp prepended.
OK - but...
All in all this work seems to be kind of useful if you only have need of a searchable PDF - but the real reason I got this disk was that it purports to have the Cuneiform OCR software on it too, and what I need for this project is something I can put into a HTML file.
It turns out that Cuneiform is on the system in /usr/local/bin. The script above does not appear to use it, but with a bit of modification I think we can do something even more useful - real OCR output, maybe in a number of formats too.
The biggest problem with Cuneiform is that it is written and supported in Russia and I have yet to find any definitive documents on how to make it work. Ah well, "strings" to the rescue.
While I was at it, I finally decided that switching to the console with the KVM switch was just plain time-consuming, so I went looking for either telnet or ssh to let me bring up remote terminal sessions on my normal desktop. Long-time readers will know that I have 4 monitors on my system so lots of screen real estate, but the KVM only switches the left-most monitor to the other systems.
Starting sshd involved creating server key-pairs in /etc/sshd:
ssh-keygen -q -t dsa -f /etc/sshd/ssh_host_dsa_key -C '' -N ''
ssh-keygen -q -t rsa -f /etc/sshd/ssh_host_rsa_key -C '' -N ''
then, by copying it to the share, transferring my public key to /root/.ssh/authorized_keys2
and then simply running "/usr/sbin/sshd &" - now I can log on via ssh and open a terminal session to the command line. IMHO this is something that should be done by default these days on such bootable systems - set up the server keys and run a sshd daemon - especially if there is little or now real security anyway as with this "single-function server" image. Yes, this can be done after the image has been pushed to a real hard drive, but it can be done by default too.
Back to the show... so running "strings" on the cuneiform binary reveals:
Supported languages:
Supported formats:
Could not open file
is not a BMP file.
BMP is not of type "Windows V3", which is the only supported format.
Please convert your BMP to uncompressed V3 format and try again.
is a compressed BMP. Only uncompressed BMP files are supported.
Please convert your BMP to uncompressed V3 format and try again.
cuneiform-out.
Cuneiform for Linux
0.8.0
Unknown language
Unknown format
Missing output file name.
--dotmatrix
--fax
--singlecolumn
html
buginprogram
Usage:
[-l languagename -f format --dotmatrix --fax -o result_file] imagefile
PUMA_Init failed.
none.txt
PUMA_Xopen failed.
PUMA_XFinalrecognition failed.
PUMA_XSave failed.
PUMA_XClose failed.
PUMA_Done failed.
ruseng
HTML format
hocr
hOCR HTML format
native
Cuneiform 2000 format
RTF format
smarttext
plain text with TeX paragraphs
text
plain text
A little bit of thought says that most of the last few lines describe the "format" and result something like:
-f hocr (or maybe html) -> hOCR HTML format
-f native -> Cuneiform 2000 format
-f smartext -> plain text with TeX paragraphs
-f ??? (rtf???) -> RTF (rich text) format
-f text -> plain text
It also appears that the program only works on bitmap files (.bmp)
OK - so let's try just a basic format. The first thing is to modify the watchocr script to both preserve the TIFF file and to convert it to bitmap. So right after the first "gs -dNOPAUSE..." command we'll insert the following:
# now we have a TIFF with the PDF in it
# let's now create a BMP file for cuneiform to work on
mogrify -format bmp $letter.$page
mv $letter.bmp $letter.$page.bmp
This takes advantage of the fact that this bootable disk has ImageMagick on it - mogrify is one of the commands it provides.
Now we have the bitmap, let's run cuneiform on it with no format flag and see what we get:
cuneiform $letter.$page.bmp -o $letter.$page.txt
The script doesn't know what to do with the bitmap and text files, so we have to add a line to put these into the scanout directory (I put it after the img2pdf... command so we get all the temporary files):
cp $letter.* $2/
The rest of the script can run happily on its way as far as I'm concerned - at least for now. There is one other mod that I favor, that of capturing the original scan file instead of just overwriting it with the newly created PDF - with today's disk sizes there is no excuse to throw away good (and potentially better) data, just in case you find something wrong with your new PDF processor.
Now comes the test
OK - so I've got the script creating not only the "searchable" PDF but also both a TIFF and BMP version of the image of it. I've dumped 464 pages into the scanin directory and the system is munching happily on them. Took about 4.5 hours to do them all.
At the end, I created a copy of the BMP files in a separate directory and a HTML directory under it - and proceeded to craft a script that would hopefully turn these into HTML pages using cuneiform:
#!/bin/sh
mkdir -p HTML
for file in *bmp
do
echo $file
if [ ! -e HTML/$file.html ]
then
cuneiform -f html -o HTML/$file.html $file
fi
Straight forward, simple and must be run from the directory containing the BMP files - but that's ok, it illustrates the use of cuneiform to get HTML files out. It runs about 7-8 pages/minute on this system which is not too bad. Graphics are dumped into a subdirectory with the base document's name, and linked into the HTML file to show at the top. If I decide this system is worth having around I'll probably add this (and maybe the RTF) line to the main process and get all manner of outputs out at the same time the system is processing the original scans.
It also appears that the format (-f) flags "html" and "hocr" are equivalent. The output from using either is identical.
There are definitely some OCR misbehaviors in the files - but for this particular purpose the output is more than good enough, and over time these output files may be edited by website members to include fixes, extensions and formatting changes. Many hands make light work so to speak.
Next, I have to take all these files and stuff them down the throat of our content management system (CMS) - GLfusion. At this point I'm not sure where in the system I'll put the files, maybe even a couple of places (static pages and wiki maybe) but that's for another article.
richard

Feed from the Whole Site

What's Related