Sometimes serendipity steps in just as you think you're going to have to re-invent the wheel or (shudder) pay for some proprietary software and run it on Windoze.
This happened late last week as I looked around for ways of converting the publishing PDF of a huge reference book (made using Quark) into something I could put up online as a reference work with good search engine coverage and potential for crowd-sourced editing and extension.
The project is about to be announced as an adjunct to the work being done by Hancock Wildlife Foundation, an organization I've been part of in both technical and managerial roles since its founding in 2006.
The book Raptor Research and Management Techniques is a compendium of papers that deal with all manner of the life management of raptors; eagles, hawks, falcons, ospreys, owls, vultures, etc. - birds of prey.
The problem I ran into is that once the manuscript was in Quark (where all the final editing had been done), the only way to get it out in any "portable" fashion is via conversion to PDF - just one of those things you find when you deal with proprietary software it seems; nothing reads Quark files because they are not documented and the format is protected jealously by the company.
OK - so I have a PDF. I also have tools that can take one apart and do other "interesting things" like convert to Postscript (pdf2ps) which can then be converted to ASCII (ps2ascii) and all the stuff in the PDF Toolkit (pdftk) but... none of them work properly with the format of this book with its two columns and lots of diagrams, etc.
What I needed was some method of running OCR on the file and getting things out that way. I've seen some impressive facilities that accompany scanners and such - and of course only run on their output it seems.
I and many others have been looking for some open source facility that would do the trick. One that appeared to be "interesting" was Cuneiform, but the information on this Russian software is sparse in English. I had downloaded the source and had a short session trying to get it to run on my Fedora Core 11 box but was missing some libraries, and the docs are for Debian systems with "apt get" instead of "yum" and obviously different package names so I had put it aside for the time being.
The alternative was a Windows binary already configured - low on my list for now but a fall-back.
And lo and behold, along comes a note about a bootable Linux disk with all the things necessary:
It's web page contains a link to Cuneiform so maybe the problem is resolved.
Read on to find out how the system performed but the docs differ from reality - and how to make the system at least a bit better and do some other interesting things such as create HTML output from your PDFs.



Feed from the Whole Site
