Showing posts with label pretranslation. Show all posts
Showing posts with label pretranslation. Show all posts

Oct 15, 2015

The Invisible Hand of memoQ LiveDocs - making "broken" corpora work

Last month I published a post describing the "rules" for document visibility in the list of documents for a memoQ LiveDocs corpus. Further study has revealed that this is only part of the real story and is somewhat misleading.

I (wrongly) assumed that, in a LiveDocs corpus, if a document was visible in the list its content was available in concordance searches or the Translation Results pane, and if it was not shown in the list of documents for the corpus in the project, its content would not be available in the concordance or Translation Results pane. Both assumptions proved wrong in particular cases.

In the most recent versions of memoQ, for corpora created and indexed in those versions, all documents in a corpus shown in the list will be available in the concordance search and the Translation Results pane as expected. And the rules for what is currently shown in the list are described accurately in my previous post on this topic. However,

  • if there are documents in the corpus which share the same main language (as EN-US and EN-UK both share the main language, English) but are not shown in the list, these will still be used for matching in the memoQ Concordance and Translation Results and
  • if the corpus was created in an older version of memoQ (such as memoQ 2013R2), documents shown in the list of a corpus may in fact not show up in a Concordance search or in the Translation Results. 
This second behavior - documents shown in the list but their content not appearing in searches - has been described to me recently by several people, but it could not be reproduced at first, so I thought they must be mistaken, and statements that "sometimes it works and sometimes it doesn't" made these pronouncements seem even more suspect. Except that they happen to be true and I now (sort of) understand why.

Prior to publishing my post to describe the rules governing the display of documents for a LiveDocs corpus in a project, I had been part of a somewhat confusing discussion with one of my favorite Kilgray experts, who mentioned monolingual "stub" documents a number of times as a possible solution to content availability in a corpus, but when I tried to test his suggestion and saw that the list of documents on display in the corpus had not expanded to include content I knew was there, I thought he was wrong. But actually, he was right; we were talking about two different things - visibility of a document versus availability of its content.

For purposes of this discussion, a stub document is a small file with content of no importance, added only to create the desired behavior in memoQ LiveDocs. It might be a little text file - "stubby.txt" - with any nonsense in it.

I went back to my test projects and corpora used to prepare the last article and found that in fact for the main languages in a project all the content was available from the corpora, regardless of whether the relevant documents were displayed in the list. In the case of a corpus not offered in the list for a project because of sublanguage mismatches in the source and target, adding a stub document with either a generic setting (DE, EN, PT, etc.) or sublanguage-specific setting for the source language or the correct sublanguage setting for the target (DE-CH, EN-US, etc.) made all the corpus content for the main languages available instantly. (In the project, documents added will have the project language settings; use the Resource Console for any other language settings you want.)

Content of a test corpus before a stub document was added. Viewed in the Resource Console.
The test corpus with the document list shown in my project; only the stub document is displayed, but
all the indexed content shown above is also available in the Concordance and Translation Results.
It is unfortunate that in the current versions of memoQ the document list for a corpus in a project may not correspond to its actual content for the main languages. Not only does this preclude accessing a document's content without a match or a search, it also means that binary documents (such as one of the PDF files shown in the list) cannot be opened from within the project. I hope this bug will be fixed soon.

Since a few of my friends, colleagues and clients were concerned about odd behavior involving older corpora, I decided to have a look at those as well. Kilgray Support had made a general recommendation of rebuilding these corpora or had at least suggested that problems might occur, so I was expecting something.

And I found it. Test corpora created in the older version of memoQ (2013 R2) behaved in a way similar to my tests with memoQ 2015 - although the "display rules" for documents in the list differed as I described in my previous blog post, the content of "hidden" documents was available in Concordance searches and in the Translation Results pane. But....

When I accessed these corpora created in memoQ 2013 R2 using memoQ 2015, even if I could see documents (for example, a monolingual source document with a generic setting), the content was available in neither the Concordance nor the Translation Results until I added an appropriate stub document under memoQ 2015. Then suddenly the index worked under memoQ 2015 and I could access all the content, regardless of whether the documents were displayed in the list. If I deleted the stub document, the content became inaccessible again.

So what should we do to make sure that all the content of our memoQ corpora are available for searches in the Concordance or matches in the Translation results?

If you always work out of the same main source language (which in my case would be German or "DE", regardless of whether the variant is from Germany, Austria or Switzerland), then add a generic language stub document for your source language to all corpora - old and new - under memoQ 2015 and all will be well.

If your corpora will be used bidirectionally, then add a generic stub for both the source and target to those corpora or add a "bilingual stub" with generic settings for both languages. This will ensure that the content remains available if you want to use the corpora later in projects with the source and target reversed.

Although it's hard to understand the principles governing what is displayed, when and why, following the advice in the red text will at least eliminate the problem of content not being available for pretranslation, concordance searches and translation grid matches. And the mystery of inconsistent behavior for older corpora appears to be solved. The cases where these older corpora have "worked" - i.e. their content has been accessible in the Concordance, etc. - are cases where new documents were added to them under recent versions of memoQ. If you just keep adding to your corpora, doing so particularly from a project with generic language settings, you'll not have to bother with stub documents and your content will be accessible.

And if Kilgray deals with that list bug so we actually see all the documents in a corpus which share the main languages of a project, including the binary ones, then I think the confusion among users will be reduced considerably.

Dec 11, 2013

General settings for memoQ TMs

memoQ TM settings are found in the Resource Console, the Options and a project's Settings.
This is a very useful "light resource" which is well worth nearly every user's time.
To define the TM settings to be used in new projects, select a settings configuration under Tools > Options... >  Default resources > TM settings (in the row of icons) by marking its checkbox.

To define the default TM settings to be used in the project you have opened, go to Project home > Settings > TM settings (in the row of icons) and mark the checkbox for the desired project default.

Different settings for individual TMs in a project (for example to set higher or lower match criteria) may be applied by going to Project home > Translation memories, selecting the TM of interest, clicking the Settings command at the right of the window and choosing the settings to apply instead of the project's standard TM settings.

The General settings tab is the same for all currently supported versions of memoQ. Role options are included on another tab in memoQ 2013 R2, and the Project Manager editions of memoQ offer additional possibilities for filtering and/or applying penalties to content on a Filters tab.


Match thresholds
The first value here (minimum) controls the fuzzy percentage below which a match will not be displayed in the translation results pane at the upper right of the working translation window.

The "good match" threshold is relevant to pretranslation (though this is unfortunately not made obvious in the dialog). The default value of 95% is really too high and would only apply to matches with small differences in tags or numbers; since any small difference in words is penalized significantly in memoQ (something I find very helpful, as I can understand more quickly what differences to look for compared to working in Trados). I usually set my "good matches" to 80%.

Not a "good match" according to the memoQ TM default setting
Penalties
In my work, an alignment penalty, which is a deduction from the match rate of a translation unit created by feeding an alignment to a translation memory, does not make a lot of sense. This is because
  • I almost never send alignments to a TM. Why bother? LiveDocs may be slower in pretranslation, but it provides context matching just like a TM, and you can actually read what you find in a concordance search in its original document context. TMs suck because you do not get the full context for your matching segment and are thus at greater risk for missing information which may be important for a translation. This is especially the case with short match segments.
  • if I happen to be aligning a dodgy translation and want to send it to a TM, I'll put it in a "quarantine TM" which already has its own penalty.
  • on those rare occasions when I might feed an alignment to a TM, it's because the content is going to a user of another CAT tool, and if that person uses Trados or another tool that can read XLIFF files or other available bilingual formats, I'll send the data as that instad, so it can be reviewed and modified more easily before feeding to a TM. This also gives the other person a bilingual reference with document context.
  • alignment for TMs is soooooo 1990s!
User penalties: If you have the misfortune to share a TM with someone whose work you do not trust completely and you want to avoid letting that person's 100% and context match segments slip past you unnoticed, apply a suitable penalty for the level of "risk" that person represents. If you want to be sure that user's content never gets used in a pretranslation and never appears in the translation results pane, apply a whopping big penalty like 80%. Those segments not be shown or inserted but will still be there in a concordance search if you want them.

TM penalties: Sometimes a client provides you with a TM you do not trust completely, or you may have a "quarantine TM" with content of dubious quality. Or I might have a TM with good content in British English but need to deliver a translation in American English. Applying penalties to such TMs will reduce the priority of their matches and prevent 100% matches with inappropriate language from slipping past without more careful inspection. As in the case of user penalties, you can also apply a very large penalty to ensure that matches will never be displayed in the translation results pane or used in a pretranslation but still have the TM content available for concordance searches.

Adjustments
It seems to be a good idea generally to enable the adjustment of fuzzy hits and inline tags. In many (but not all) cases, this will correct small differences in numbers, punctuation, cases and inline tags.

The only significant effect I was able to determine in adjusting the inline tag strictness in my tests was that more permissive settings might count a match with different tags as a full match. While this might meet the requirements of some clients hoping to impose discount schemes, from a quality assurance perspective, this does not seem like a good idea, and I believe it is better to have a strict setting here to draw attention to differences and reduce the chance that errors might be overlooked.

Nov 5, 2013

Proofreading LiveDocs bilinguals, recycling versions in memoQ

(These tests were performed with memoQ 2013 R2 but should, in principle, work the same in any version of memoQ 6.0 or later.)

I really like memoQ's versioning features, and I use the X-Translate function fairly often when a document I'm translating has been updated or a new version comes sometime later. However, I don't keep documents in my projects forever. I use "container projects" for particular clients or subject domains so that I don't have to keep reattaching the same translation memories, terminologies and LiveDocs corpora and various light resources (non-translatables, autocorrect lists, segmentation rules, etc.) all the time. These can get rather full, so I send my old translations off to a LiveDocs corpus after a while. And then if a new version of a document shows up, well... I'm sort of out of luck if I want to use the X-Translate function with the previous version.

Or so I thought. And then a friend rang me and asked how she can export a LiveDocs alignment she did to an RTF bilingual file to make it more convenient to proofread in Microsoft Word and pass on to one of her partners with tracked changes. With that the answer to both problems was clear.

Select the corpus and the file in it to export:


Click Export and choose a location in which to save the MQXLZ file:


In the Translations window of any memoQ project with the correct source and target language settings, select Import and choose your MQXLZ file:


After the file exported from LiveDocs has been imported as a translation file, it can serve as "version 1" for a new file version to be translated using the Reimport document and X-translate features. It does not matter that the file types are different. A bilingual RTF file can also be exported for external correction and commentary.


Here is an example of an exported bilingual RTF file with changes tracked. The changes do not have to be accepted before the bilingual file is re-imported to update the translation using the Import command.


Here is the updated translation:


Changed are marked with blue arrows. Only text changes were implemented in this case, no format changes such as italic text, because the XLIFF file does not support WYSIWYG text formatting. (MQXLZ is a Kilgray-renamed ZIP-package with XLIFF and a chocolate surprise inside.)

Now I've got a new version of my text to translate in a DOCX file. I use the Reimport document function, answer No to the dialog so I can select the new version at a different location:


I'm curious what the differences from the original text are, so I use the History/reports command in the Translations window to find that out:




Then using Operations > X-Translate in the working window, followed by pretranslation to get the changed "exact matches" (like "Aussehen und Gewicht" above) and the fuzzy matches, I end up with this:


If you make it a point to store your important versions in a LiveDocs corpus, this procedure will allow you to recover your archived texts and re-use them for more controlled, reference-based translation. It would be nice, of course, if some day Kilgray would enable specific LiveDocs files to be used as the basis of a reference translation, perhaps even scanning a corpus or a set of corpora to identify the best-matching document or documents. It would also be nice if bilinguals stored in LiveDocs could be exported to other formats and perhaps even be updated with something like an exported bilingual RTF. However, those bilinguals can simply be imported directly to a LiveDocs corpus as new documents, and any corrections made to a document in the Translations list can be sent back to LiveDocs using the relevant command in the Translations window.

Aug 2, 2012

memoQuickie: pretranslation in memoQ


Pretranslating is the process of automatically translating a text or portions of it by comparison with a translation memory, the application of machine translation or the use of a previous version as a reference document. This description covers the first two cases.


Select Operations > Pre-Translate... from the menus.

 
Select the appropriate pretranslation settings for your file(s). In some cases, using TM-driven segmentation can greatly improve matching.

 
Here is an example of results with fuzzy matches from a translation memory and some segments (2 and 4) machine-translated (using the pseudotranslation engine as an example for contrast). Without machine translation or fragment assembly those segments would remain empty.

This method of pretranslation is often good to use as a follow-up step when translating with a reference document (i.e. a previousversion using the X-Translate function).

Dec 21, 2011

Presegmented "classic" Trados files

Given that many outsourcing translators, agencies and companies still use older versions of Trados but often want to work with qualified translators without tripping over tool issues, this is still a current topic despite the new SDL Trados tools having been on the market for several years. And my old published procedures on these matters are either no longer publicly available or are somewhat in need of updating.

Before I began blogging in 2008, I wrote a number of procedures to help my partner, colleagues and clients understand the best procedures for handling "Trados jobs" with other translation environment tools. When translating a TTX file with Déjà Vu, memoQ and many other applications, it is often best practice to "presegment" the file using a demo or licensed version of Trados 2007 or earlier. In fact, if this is done on the client's system, many little quirks of incompatibility that can be experienced if the translator used a different build of Trados (for example) can be avoided.

What does "presegment" actually mean? It is a particular method of pretranslation in which for segments where the translation memory offers no match, the source text is copied to the target segment. If performed with an empty TM, the target segments are initially identical to the source segments. If this procedure is followed, full, reliable compatibility is achieved between applications such as Déjà Vu and memoQ for clients using Trados versions predating Trados Studio 2009. For newer versions of Trados, the best procedure involves working with the SDLXLIFF files from Studio. If a freelance translator does not own a copy of SDL Trados 2007 or an earlier version used by an agency or direct client, this is the procedure to share with a request for presegmentation. While some clients might expect the translator to do such work using his or her own copy of Trados, I have experienced enough trouble with complex files over the years when different builds of the same version of Trados are used that I consider this to be the safest procedure to follow - safer even than having the translator do the work in Trados in many cases.

Step 1: Prepare the source files
Before creating a TTX file and presegmenting it for translation in DVX or creating a presegmented RTF, DOC or DOCX file compatible with the Trados Workbench or Wordfast Classic macros in Microsoft Word, it is a very good idea to take a look at the file and clean up any "garbage" such as optional hyphens, unwanted carriage returns or breaks, inappropriate tabbing in the middle of sentences, etc. Also, if the file has been produced by incompetent OCR processes, there may be a host of subtle font changes or spacing between letters, etc. that will create a horrible mess of tags when you try to work with most translation environment tools. Dave Turner's CodeZapper macros are a big help in such cases, and other techniques may include copying and pasting to and from WordPad or even converting to naked text in Notepad and reapplying any desired formatting. This will ensure that your work will not be burdened by superfluous tags and that the uncleaned file after the translation will have good quality segmentation.

Step 2: Segment the source files
If the source files are of types which Trados handles only via the TagEditor interface, then they may be pretranslated directly by Trados Workbench to produce presegmented TTX files. If they are RTF or Microsoft Word files, on the other hand, and a TTX file is desired, you must first launch TagEditor, open the files in that environment and then save them to create the TTX files, which are then subsequently pre-translated using Trados Workbench. If a presegmented RTF or Microsoft Word file is desired (for subsequent review using the word processor, for example), then the files can be processed directly with Trados Workbench.

Important Trados settings:
  • In Trados Workbench, select the menu option Options > Translation Memory Options… and make sure that the checkbox option Copy source on no match is marked. 

  • In the dialog for the menu option Tools > Translate, mark the options to Segment unknown sentences and Update document.

After the settings for Trados Workbench are configured correctly, select the files you wish to translate in the dialog for the Workbench menu option Tools > Translate and pretranslate them by clicking the Translate button. This will create the "presegmented" files for import into DVX, memoQ, etc. If the job involves a lot of terminology in a MultiTerm database, which cannot be made available for the translation in the other environment (perhaps due to password protection or no suitable MultiTerm installation on the other computer), you might want to consider selecting the Workbench option to insert the terms.

Note: to get a full source-to target copy, use an empty Trados Workbench TM. However, if an original customer TM is used for this step you will often get better "leverage" (higher match rates) than if you work only with a TMX export of the TM to the other environment. If I am supplied with a TWB TM, I usually presegment with it first, then export it to TMX and bring it into memoQ or DVX for concordancing purposes. However, in some cases, such as with the use of memoQ's "TM-driven segmentation", you might get better matches in the other environment (not Trados).

The one performing the presegmentation might want to inspect the segmented files in TagEditor or MS Word to ensure that the segmentation does not require adjustment. Segments can typically be joined in other environments such as memoQ in order to have sensible TM entries in that environment or deal with structural issues in the language, but this will not avoid useless segments in the content for Trados. The best way to deal with that is by fixing segments there. Otherwise, I often provide a TMX export from memoQ to improve the quality of the Trados TM.

Step 3: Import the segmented source files into the other environment
The procedure for this varies depending on your translation environment tool. Usually the file type will be recognized and the appropriate filter offered. In some cases, the correct filter type must be specified (such as in memoQ, where a presegmented bilingual RTF/DOC must be imported using the "Add document as..." function and specifying "Bilingual DOC/RTF filter" instead of the default "Microsoft Word filter".

Some tools, like memoQ, offer the possibility of importing content which Trados ignores, such as numbers and dates, This is extremely useful when number and date formats differ between the languages involved. It saves tedious post-editing in Word or TagEditor and also enables a correct word count to be made.

A few words about output from the other (non-Trados) environment
If you import a TTX to Déjà Vu, memoQ, etc., what you will get when you export the result is a translated TTX file, which must then be cleaned using Trados under the usual conditions. Exporting a presegmented RTF or Microsoft Word file from DVX gives you the translated, presegmented file. The ordinary export from memoQ will clean that file and give you a deliverable target file. To get the bilingual format for review, etc. you will have to use the option to export a bilingual file.

Other environments such as memoQ or Déjà Vu may also offer useful features like the export of bilingual, commented tables for feedback. This saves time in communicating issues such as source file problems, terminology questions, etc. and is infinitely superior to the awful Excel feedback sheets that some translation agencies try to impose on their partners.

Editing translations performed with Trados
A translation performed using the Trados Workbench macros in Microsoft Word or using TagEditor can be easily reviewed in many other environments such as Déjà Vu or memoQ. In fact, I find that the QA tools and general working environment with this approach is far superior to working in TagEditor or Word, for example. Tag checks can be performed easily, compliance with standard terminology can be verified, content can be filtered for more efficient updates and more.

Editing translations performed with more recent versions of Trados (SDL Trados Studio 2009 and 2011) is also straightforward, as these SDLXLIFF files are XLIFF files which can be reviewed in any XLIFF-compatible tool.