Mon, 21 Jul 08

Open Source Optical Character Recognition (OCR) on the Mac

I recently received a letter reply to an enquiry I made to South Eastern Trains. I wanted to make the text available (I don’t like reading text in images) but didn’t want to type it all out. Having scanned it in, I looked around for some OCR software, for the Mac, that would extract the text for me. Through the (relatively large amount of) search result noise, I managed to find a reference to tesseract. Installing from Mac Ports was easy but actually getting useful output was slightly tricker. I was trying to extract the text from this letter but kept getting garbled output. I tried a few things before stumbling across this page that suggests making the image greyscale and ensuring that the extension is .tif (not .tiff). I duly complied, using Seashore to convert the image to greyscale, and tried again. This time the text was extracted perfectly (even the addresses, that appear in two columns, were extracted correctly).