Experienced translators are well aware of the horrors of bad OCR, including documents that look like the original but which undergo disconcerting font changes with the use of the classic Trados macros to translate, or where text blocks disappear when embedded in wrong-sized boxes, section breaks disrupt text, words display in CAT tools with tags embedded in the middle of words, thus screwing up terminology lookups and TM matches and more. I hope there is a special place in Hell for those who think that usual rates should apply to such time-wasting messes.
There are many remedies for these problems, as solutions such as Dave Turner's CodeZapper macros or the memoQ import option to ignore irrelevant tags for Word documents, but there really is no good substitute for doing the conversion right the first time. This is a skill I have taught to colleagues and clients on a number of occasions, because it saves everyone time and money.
After I ended my chat and tutorial with C, I went to work on a new project due tomorrow. I had been putting it off while working on some tutorials for next week, but I still had time for dealing with it at a relaxed pace. Then I opened the document and realized that while I was distracted earlier this week, the project manager had sent me the horror of all horrible OCR jobs, an automatic conversion that violates every principle of good OCR practice. And it's Sunday. I'm screwed. No PDF to re-do the OCR my way.
Then I realized there are two possible solutions to address this problem which do not involve medium-range missiles.
I could, if the document had a lot of text boxes in bad sequence, print the file to PDF and do another OCR of that. But the text flow in this case is mostly in one block, so that's really not an issue. I followed another procedure which gave me a raw file much like I create when doing a complex OCR, which I then quickly reformatted to give me a usable source file that would not be polluted with tag trash and inconvenient text breaks. The steps were as follows:
- Save the bad OCR DOC file as plain text with the desired encoding.
- Open the plain text document in Microsoft Word or another full-featured text editor.
- Check the sequence of the text flow to be sure that it is correct and complete.
- Correct any "broken" sentences caused by line breaks in the wrong place in the OCR document. A bit of clever search and replace can usually be used to protect desired paragraph breaks before converting the unwanted ones into spaces to restore the messed-up sentences.
- Do any other formatting you want for page numbering, bold text, subtitle styles or whatever.
Cleaning up my 18 pages of garbage from the PM took a bit over half an hour altogether, including some corrections of OCR errors, and if I want to get a clean source document after my translation, I simply correct any source errors I find in memoQ as I work and export a fixed source document later.
Excellent advice, very applicable also to those document that never went near OCR, but whose authors use a word processor as a mechanical typewriter (you know, a mix of tabs, spaces and hard returns instead of tables, a series of hard return to force a page break, and similar).
ReplyDeleteHi Kevin, your advice is sound, but work and time consuming. I always prefer to do the OCR anew - faster and in difference to whomever has done the first one, I KNOW what I'm doing and how to avoid various glitches (or how to correct them). Of course, you also should have the original, but my clients together with OCR file usually provides it without additional request.
ReplyDeleteUldis
Kevin, have you by any chance seen a tool called "Infix Editor" by Iceni? The same company who make "Gemini", one of the several viable OCR options on the market. Infix is here: http://www.iceni.com/infix.htm. Ignoring the marketingspeak, Infix can be a cool addendum to your zoo of PDF tools **if** the PDFs you get are relatively "clean", clean meaning: they are not password-protected, they use commonly available fonts like Arial, the paragraphs/pages are not full and leave room for text expansion. If everything works as planned, roundtripping a PDF file through a translation process is actually easy, any ou can use any XML-educated CAT tool on it.
ReplyDelete