Optical Character Recognition, better known as OCR is a process of transforming images of any readable handwritten or printed text to machine editable text. After conversion these texts can be edited with any word processing software.
Pros
OCR process is the fastest way of converting text images to its editable form, and the product becomes accurate if cleaned up properly using many features offered by almost all the professional tools available in today’s market. OCR is quite popular among academicians, who often feel the need of having some part or even full of their reference materials be converted to editable text, so that they can re-use it in entirety or with minor editing instead of taking the pain of keying in the pages.
Cons
The OCR transcription generally has some unwanted characters or signs, which are multiplied if the quality source document is not excellent in terms of clarity of print. Some of the common problems are:
- every page may be of different layout
- the document is broken down to numerous sections – each page being an individual section
- unwanted line breaks at the end of the lines in paragraphs
- unwanted spaces in between paragraphs
- problems in character recognition — ‘i’ becomes I, ‘e’ becomes c, ‘rn’ becomes m, etc. This depends upon the printing quality, readability and type face used in the source document.
A word of Caution
There is a chance of a line getting dropped altogether here and there. This doesn’t happen every time, or to be specific most of the times, but it do happen, rarely. I personally experienced this a few times. So, it is wise to be aware of the issue beforehand.
Cleaning up
A clean, distinctly readable printed document produce good results when put under OCR process. This is because the scanner can read and recognize characters perfectly, and hence the reproduction is good, doesn’t need to be cleaned up much, except removing the section breaks, and occasional line breaks. But, for an average or poor manuscript, which is difficult to read or printing quality is not so good, the transcription will be filled with enormous level of various mark-ups, which you need to clear or clean using the OCR applications. The more attention is paid in cleaning-up process, the more accuracy is achieved.
OCR applications generally ask to open or import image files or pdfs containing printed texts, set-up the recognition areas with text, table or image tools – run the process by pressing read button or link. It then scans the image and returns texts recognized by the process to the editing window. Next it asks to run the spell check. After spell check it allows the product to be exported to word processing applications. But truly speaking, it is not so easy task as it looks to be.
OCR products must be cleared/ cleaned-up from all things listed under cons section in the editing mode itself. Otherwise can cause serious trouble if these are left out to be taken care of after exporting text to the word processing software.
I will try to explain more with examples in my next post.
Category: Uncategorized