Appropriate Use of OCR for Translators and Project Managers

Optical Character Recognition (OCR) is a technology that allows non-editable images of text (i.e., a scanned document) to be converted in editable or “live” text (i.e., a Word file). This is useful in the translation process because it allows a variety of productivity-enhancing tools such as translation memories, term bases, and machine translation to be used.

Unfortunately, the formatting and segmentation of text created by OCR programs is often illogical and in many cases can be so problematic during the translation process and further downstream during editing or desktop publishing processes that many outsourcers have chosen to simply prohibit their translators from using OCR to convert source files for translation.

 more often than not improperly OCRed files entail tremendous headaches during the proofreading stage (right a delivery time)….
From a freelancer’s perspective, I have been provided source files for translation by project managers which have been improperly processed and formatted with OCR on several occasions. Such files create problems when processed through a CAT tool since the text is not segmented correctly (at natural line breaks, full stops, etc.), causing the tool to present a line or even a word at a time for translation rather than a sentence or coherent segment at a time as it is designed to do. In these cases, the CAT tool can actually become more of a burden than a help to the translator while reducing rates through repetition discounts. To add insult to injury, more often than not improperly “OCRed” files entail tremendous headaches during the proofreading stage (right a delivery time) because the illogical formatting elements created by the OCR program (section breaks, margins, tables, text boxes, fonts, etc.) behave in such an unpredictable way that the final translation product becomes difficult or even impossible to work with. I am sure I am not the only translator out there who is prepared to decline jobs, require a rate adjustment, or insist on starting from scratch when provided source files for translation that have been poorly processed with OCR!

Despite that fact that so many translators and project managers out there are using OCR inappropriately, there is such a thing as appropriate use of OCR. The problem is that the notion of waiving the magic wand of technology over a pile of hard copy or scanned documents in order create formatted “live text” files with just a click is so tempting that it is easy to lose site of the proven fact that OCR requires a certain degree of human input in order to create natively formatted and properly segmented text. In my experience, the only way to leverage the benefits of OCR without compromising the quality of the final product is to take the text output from the OCR program while tossing out the formatting, then recreating the formatting manually in Word or your text editor of choice.

The following is an example of a very a simple procedure for doing this:

  1.  Extract plain text from the source file

This unformatted text will be your base for creating the editable source file. All OCR programs that I know of have a “plain text” or .txt option, but another way to do this is to start with any file created by the OCR program, “select all” (Ctrl + a), “copy” (Ctrl + c), and then paste (Ctrl + v) the text into a notepad file. Then repeat this process to paste that text from the notepad file into a blank Word (or the appropriate program for the deliverable) file with default settings.

  1. Edit the text and recreate formatting manually

This is the point where there is a danger of the process breaking simply because it is time consuming, but I have found that time spent preparing my sources files for translation is well spent because it allows me to familiarize myself with the content and context of the file before I dive into the translation process. As you recreate the formatting, be sure to scan the text for illogical errors in the text (“0″s instead of “O”s, etc.) or missing bits. The standard Word proofreading tools should be helpful here.

  1. Proceed with normal translation workflow

Since the files have been formatted manually, the translation process can proceed as if the client has submitted a nice editable Word file from the beginning. The irony is that when OCR is properly formatted (by hand), no one is likely to notice that it was ever even used in the first place, and the productivity gains through the use of translation memories, term bases, and machine translation justify the time spent recreating formatting by hand. Most importantly, you can now rest assured that your final translation will be indistinguishable from a file natively authored in Word (assuming that you remembered to change all of those “0”s to “O”s) without the last-minute formatting nightmares!

Here are a few helpful tips for using OCR properly:

Don’t forget to select the proper language!

Since your OCR program needs to know what language it is “seeing”, be sure to select the proper language before the program reads the files, otherwise you will end up doing a lot more editing than you need to, especially when it comes to accents and special characters!

Use notepad to see if your text is segmented properly

While it may not immediately be apparent, proper segmentation of text is vital for a CAT tool to work efficiently. Improper segmentation can be easier to see if you paste your text into notepad and under “view”, deselect “word wrap”. In this view, proper segmentation is apparent because each sentence or segment appears as a single unbroken line, while improper segmentation appears as a separate bit of text.

In the example below, I “OCRed”: a .pdf file of the U.S. Bill of Rights:

Blog.2014.06.25 OCR

 The example on the left shows improper segmentation. In this example, the text was output by the OCR program exactly as it “saw” the source file, with each sentence containing a line break whenever the source text began a new line. In the example on the right, I have removed the line breaks manually, and each segment now appears as a single line.

Be sure to edit OCRed text properly!

Imperfect scans or interference such as signatures or stamps over the text will force an OCR program to guess as to whether the character that it sees is an “0” or an “O”, an “l” or a “1”, or an “,” or a “.”, so be sure to scan your source text carefully to catch these errors, paying special attention to accents or special characters.

Don’t have the time to do things the old fashioned way? Outsource it!

Editing and formatting a source file manually is a tedious, time-consuming task, and while I do think that it is worthwhile since it allows the translator or PM to preview the text for translation, it can easily be outsourced since it does not require as demanding a skillset as translation or project management does. Here in Ecuador there are of opportunities to outsource secretarial tasks at hourly rates beneficial to both parties, and I have seen entrepreneurs based in the Philippines specializing in exactly this service with very reasonable per-page rates and overnight turnarounds.

Final Thoughts

OCR has become a commonplace in the translation and localization industry and plays a role in increasing the efficiency of the translation process by allowing productivity-enhancing tools such as translation memories, term bases, and machine translation to be used. However, since PMs and translators are working under more and more pressure to produce greater volumes, higher quality, with faster turnaround times, the temptation to let technology replace human input when preparing source files creates a temptation to use improperly formatted OCR files for translation, giving OCR  a bad rap in the industry. However, as long as it is used properly, and the formatting and segmentation is done manually with a procedure similar to the one above, there is no reason why we should turn our backs on this productivity-enhancing tool.


Jason Hall

Leave a Reply

Your email address will not be published.

Time limit is exhausted. Please reload the CAPTCHA.