People are increasingly acquiring digital images of the world and of documents; often these images contain Roman letters. When viewing web pages with Flash and other accessibility issues we are increasingly faced with "pictures of letters". Such text cannot be copy-n-pasted into other documents, resized, offered to Google Language Tools for translation, etc. The free OCR software currently available offers poor recognition rates for realistic images.
RecognizeThis! will identify regions of text, deskew, extract font metrics, isolate lines and words, recognize text, and emit UTF8. Initial goal is to achieve > 90% character recognition rates for upper & lowercase letters. Then we can work on other characters, and later on diacriticals used by western European languages. The recognizer will be font independent, and will prefer grayscale input of at least 200 dpi. Letters will be normalized to adjust for font size. Dealing with underlined and italicized text is not a current goal. The project will not tackle script, Hindi, Asian languages, etc.; even recognizing printed handwriting is beyond the scope. It is within scope to get rates which are competitive with or better than gocr and ocrad. Some healthy competition is always a good thing.
Unit tests will roundtrip known ASCII input through a ghostscript renderer followed by the recognizer. Automated unit tests will ship with releases and should work fine for all users unless specially marked. The standard Ruby tools rdoc and rcov will be used for internal documentation and for code coverage measurements. Please let us know if you would like to participate in this development effort!