who knows about the Ocr?

nomos

Administrator
Has anyone worked much with optical character recognition (OCR) software? I'm tempted to scan some of the books that I go to most often, especially the ones without indexes, with the idea of converting the images to text using OCR so that they can be turned into searchable PDF files.

But how good is the technology at this point? My understanding is that it can still be a bit shaky.

I noticed that there's at least one line of scanners on the market right now that's made for scanning books perfectly flat and without shadows. Would be good but not worthwhile if the text conversion i$ all g4rbl3d.
 

john eden

male pale and stale
I did this a few years back for some of the articles on uncarved.org

I recall the process as being:

1) scan that sucker in as big as it can go
2) maximise the contrast
3) feed it into yer ocr prog
4) output as word file
5) remove formatting (i.e. unbold, un-underline etc)
6) spell check
7) proof read

it's actually not that bad if you have a bit of time and the file it isn't too big.

What really knackered it up was overprinted text and weird fonts...
 

michael

Bring out the vacuum
Haven't done anything recently but used to do some OCR-related work for a law firm, where we'd get a bureau to scan evidence so that we could search it. That was up to about 2004 and the assumption at that time was that it would only be about 80% accurate.

The kind of things we were scanning involved hand-written annotations and so on, which of course were a complete write-off. I think if you were scanning a book and were willing to tweak the file things would be better, but I'd still expect to follow John's process and end up at least browsing the OCR file.

A friend who uses computers a fair amount but is not an IT head is having a go at using OCR on a bunch of old books for something related to his PhD and finding it a mix of painful and accidentally hilarious. He is using university equipment, so it wouldn't surprise me if it's less than cutting edge.

So I guess you need to way up the potential pain in the arse of getting the OCR file up to scratch vs. the potential benefits of having the texts searchable. If you're just idly musing about it, or if the texts are really big and involve any kind of weird formatting, I'd suggest against it.
 

nomos

Administrator
thanks gents. i may give it a trial run if i decide it's worth the time and effort.

until then, amazon's "search inside this book" and books.google are amazingly handy. i'm always using them to find things in books that i have sitting right beside me.
 

Numbers

Well-known member
I use the OCR-tool in Adobe Acrobat Pro 8 quite often and I have to say that it amazes me every time again. I never had any weird words or nonsense characters turning up. It works just perfect.

Although I have to admit that until now I only used it on pdf-files from JSTOR, which are scans merged into a pdf-file (as far as I understood).
 
Top