Translation Tribulations: glossary

Showing posts with label glossary. Show all posts

Nov 25, 2023

Book review: "Terminology Extraction for Translation and Interpretation Made Easy"

A few months ago, I received a pre-release copy of this book as a courtesy from the author, terminologist Uwe Muegge, with a request to give a quick language check to the English used by its native German author. As I expected, there wasn't much to complain about, because he has lived in the US for a long time and taught at university there as well as been involved in important corporate roles. I was particularly pleased by his disciplined style of writing, the plain, consistent English of the text and the overall clarity of the presentation. Anyone with good basic English skills should have no difficulty understanding and applying the material.

At the time I read the draft, I was completely focused on language use and style, but I found his approach and suggestions interesting, so I looked forward to "field testing" and direct comparisons with my usual approach to terminology mining with that feature in memoQ. About a day's worth of tests shows very interesting potential for applying the ChatGPT section of the book and also made the context and relevance of the other two sections clearer, I will discuss those sections first before getting to the part that interests me the most.

Uwe presents three approaches:

ChatGPT (https://chat.openai.com/)
OneClick Terms (https://terms.sketchengine.eu/)
Wordlist (https://www.webcorp.org.uk/live/wdlist.jsp)

I wouldn't really call these three approaches alternatives as the book does, because all three operate in very different ways and are fit for different purposes. That didn't register fully in my mind when I was in "editor mode", although the first part of the book made the differences, advantages and disadvantages clear enough, but as soon as I began using each of the sites, the differences were quite apparent as were the similarities to more familiar tools like memoQ's term extraction module.

Wordlist from Webcorp is functionally similar to Laurence Anthony's AntConc or memoQ's term extraction. It's essentially useful for getting frequency lists of words, but the inability to use my own stopword lists for filtering out uninteresting common vocabulary makes me prefer my accustomed desktop tools. However, the barriers to first acquaintance or use are lower than for AntConc or memoQ, so this would probably be a better classroom tool for introducing concepts of word frequencies and identifying possible useful terminology on that basis.

OneClick Terms was interesting to me mostly because friends and acquaintances in academia talk about Sketch Engine a lot. The results were similar to what I get with memoQ, including similar multiword "trash terms". I found the feature for term extraction from bilingual texts particularly interesting, and the fact that it can work well on the TMX files distributed by the Directorate-General for Translation (DGT) of the European Commission suggests that it could be an efficient tool for building glossaries to support translation referencing EU legislation, for example, though I expect only slight advantages over my usual routine with memoQ. These advantages are not worth the monthly subscription fee to me. However, for purposes of teaching and comparison, the inclusion of this platform in the book is helpful. I see more value for academic institutions and those rare large volume translation companies that do a lot of work with EU resources.

ChatGPT was an interesting surprise. I have a very low opinion of its use as a writing tool (mediocre on its best day, clumsy and boring in nearly all its output) or for regex composition (hopelessly incompetent for what I need, and anything it does right for regex is newbie stuff for which I need no support). However, as a terminology research tool I have found excellent potential, though formatting the results can be problematic.

My testing was done with ChatGPT 3.5, not a Professional subscription with access to version 4.0. However, I am sorely tempted to try the subscription version to see if it is able to handle some formatting instructions (avoiding unnecessary capitalization) more efficiently. No matter how carefully I try to stipulate no default capitalization of the first letter of every expression, I inevitably have to repeat the instruction after a list of improperly capitalized candidate terms is created.

I keep an e-book copy of Uwe's book in the Kindle app on my laptop, so I can simply copy and paste his suggested prompts, then add whatever additional instructions I want.

The prompt

Please examine the text below carefully and list words or expressions which may be difficult to translate, but when writing the list, do not capitalize any words or expressions which don't require capitalization.

is too long, and only the part marked red is executed correctly, but this follow-up prompt will fix the capitalization in the list:

Please re-examine that text and this time when writing the list, do not capitalize any words or expressions which do not require capitalization.

Further tests involved suggesting translations for the expressions, with or without a translated text and building tables with example sentences:

Other prompt variations, for example to write terms bold in the example sentences, worked without complications.

What about the quality of the selections? Well, I used memoQ's term extraction module on the same text I submitted to ChatGPT for term extraction in order to compare something with which I am quite familiar with this new process.

memoQ identified a few terms based on frequency, which ChatGPT ignored, but these were arguably terms that a qualified specialist would have known anyway. And ChatGPT did a superior job of selecting multi-word expressions with no "noise". It also selected some very relevant single-occurrence phrases which might be expected to arise more in later, similar texts.

Split screen review of memoQ extraction vs. ChatGPT results

The split-screenshot is an intermediate result from one of my many tests. The overlayed red box was intended to show a conversation partner the limits of ChatGPT's "alphabetizing skill", and the capitalization of the German is not correct after a prompt to correct the capitalization of adjectives misfired. It is not always trivial to get formatting exactly as I want it. However, looking at the results of each program side-by-side like this showed me that ChatGPT had in fact identified the nearly all the most relevant single words and phrases in my text. And for other texts with dates or citation formats, these were also collected by ChatGPT as "relevant terms", giving me an indication of what legislation I might want to use as reference documents and what auto-translation rules might also be helpful.

I also found that the split view as above helped me to work my way through the noise in the memoQ term candidate list much faster and make decisions about which terms to accept. The terms of interest found in memoQ but not selected by ChatGPT were few enough that I am not at all tempted to suggest people follow my traditional approach with the memoQ term extraction module and skip the work with ChatGPT.

My preferred approach would be to do a quick screening in ChatGPT, import the results into a provisional (?) term base and then, as time permits, use that resource in a memoQ term extraction to populate the target fields in the extraction grid. With those populated terms in place, I think the review of the remaining candidates would proceed much more efficiently.

All in all, I found Uwe's book to be a useful reference for teaching and for my personal work; it is one of the few texts I have seen on LLM use which is sober and modest enough in its claims that I was inspired to test them. The sale price is also well within anyone's means: about $10 for the e-book and $16 for the paperback on Amazon. For the "term curious" without access to professional grade tools, it's a great place to get started building better glossaries and for more seasoned wordworkers it offers interesting, probably useful suggestions.

The book is available HERE from Amazon.

Dec 29, 2018

memoQ Terminology Extraction and Management

Recent versions of memoQ (8.4+) have seen quite a few significant improvements in recording and managing significant terminology in translation and review projects. These include:

Easier inclusion of context examples for use (though this means that term information like source should be placed in the definition field so it is not accidentally lost)
Microsoft Excel import/export capabilities which include forbidden terminology marking with red text - very handy for term review workflows with colleagues and clients!
Improved stopword list management generally, and the inclusion of new basic stopword lists for Spanish, Hungarian, Portuguese and Russian
Prefix merging and hiding for extracted terms
Improved features for graphics in term entries - more formats and better portability

Since the introduction of direct keyboard shortcuts for writing to the first nine ranked term bases in a memoQ project (as part of the keyboard shortcuts overhaul in version 7.8), memoQ has offered perhaps the most powerful and flexible integrated term management capabilities of any translation environment despite some persistent shortcomings in its somewhat dated and rigid term model. But although I appreciate the ability of some other tools to create customized data structures that may better reflect sophisticated needs, nothing I have seen beats the ease of use and simple power of memoQ-managed terminology in practical, everyday project use.

An important part of that use throughout my nearly two decades of activity as a commercial translator has been the ability to examine collections of documents - including but not limited to those I am supposed to translate - to identify significant subject matter terminology in order to clarify these expressions with clients or coordinate their consistent translations with members of a project team. The introduction of the terminology extraction features in memoQ version 5 long ago was a significant boost to my personal productivity, but that prototype module remained unimproved for quite a long time, posing significant usability barriers for the average user.

Within the past year, those barriers have largely fallen, though sometimes in ways that may not be immediately obvious. And now practical examples to make the exploration of terminology more accessible to everyone have good ground in which to take root. So in two recent webinars, I shared my approach - in German and in English - to how I apply terminology extraction in various client projects or to assist colleagues. The German talk included some of the general advice on term management in memoQ which I shared in my talk last spring, Getting on Better Terms with memoQ. That talk included a discussion of term extraction (aka "term mining"), but more details are available here:

Due to unforeseen circumstances, I didn't make it to the office (where my notes were) to deliver the talk, so I forgot to show the convenience of access to the memoQ concordance search of translation memories and LiveDocs corpora during term extraction, which often greatly facilitates the identification of possible translations for a term candidate in an extraction session. This was covered in the German talk.

All my recent webinar recordings - and shorter videos, like playing multiple term bases in memoQ to best advantage - are best viewed directly on YouTube rather than in the embedded frames on my blog pages. This is because all of them since earlier in 2018 include time indexes that make it easier to navigate the content and review specific points rather than listen to long stretches of video and search for a long time to find some little thing. this is really quite a simple thing to do as I pointed out in a blog post earlier this year, and it's really a shame that more of the often useful video content produced by individuals, associations and commercial companies to help translators is not indexed this way to make it more useful for learning.

There is still work to be done to improve term management and extraction in memoQ, of course. Some low-hanging fruit here might be expanded access to the memoQ web search feature in the term extraction as well as in other modules; this need can, of course, be covered very well by excellent third-party tools such as Michael Farrell's IntelliWebSearch. And the memoQ Concordance search is long overdue for an overhaul to allow proper filtering of concordance hits (by source, metadata, etc.), more targeted exploration of collocation proximities and more. But my observations of the progress made by the memoQ planning and development team in the past year give me confidence that many good things are ahead, and perhaps not so far away.

Mar 14, 2018

Come to Terms in Amsterdam, June 30th

At end of June this year I'll be doing an expanded, in-person reboot of my occasional terminology workshop with new material and workflows for those who want to do more to control quality and improve communicative vocabularies in interpreting, translation and review projects.

Space is limited at the All-Round Translator event, but I hope you can join us to learn about

Better teamwork through timely terminology sharing
Faster, more effective discovery of frequently occurring specialist terminology
Better access to critical terminology in many environments
More efficient and accurate QA for terminology
More accurate, efficient and fault-tolerant term use when translating with memoQ
Greater flexibility to meet client terminology needs

The Early Bird rate for the workshop is €99 + VAT until the end of April, €120 + VAT thereafter.

The content is applicable to work with many translation environments, but some segments will share particular tips for maximum productivity using the unbeatably practical memoQ environments.

Jul 26, 2017

Shortcuts to managing bitext corpora and terminologies in free Google Sheets

When I presented various options for using spreadsheets available in the free Google Office tools suite on one's Google Drive, I was asked if there wasn't a "simpler" way to do all this.

What's simple? The answer to that depends a lot on the individual. Yes, great simplicity is possible with using the application programming interface for parameterized URL searches described in my earlier articles on this topic:

The answer is yes. However, there will be some restrictions to accept regarding your data formats and what you can do with them. If that is acceptable, keep reading and you'll find some useful "cookie cutter" options.

When I wrote the aforementioned articles, I assumed that readers unable to cope with creating their own queries would simply ask a nerdy friend for five minutes of help. But another option would be to used canned queries which match defined structures of the spreadsheet.

Let's consider the simplest cases. For anything more complicated, post questions in the comments. One can build very complex queries for a very complex glossary spreadsheet, but if that's where your at, this and other guns are for hire, no checks accepted.

You have bilingual data in Language A and Language B. These can be any two languages, even the same "language" with some twist (like a glossary of a modern standard English with 19th century thieves' cant from London). The data can be a glossary of terms, a translation memory or other bitext corpus, or even a monolingual lexicon (of special terms and their definitions or other relevant information. The fundamental requirement is that these data are placed in an online spreadsheet, which can be created online or uploaded from your local computer and that Language A be found in Column A of the spreadsheet and Language B (or the definition in a monolingual lexicon) in Column B of the spreadsheet. And to make things a little more interesting we'll designate Column C as the place for additional information.

Now let's make a list of basic queries:

Search for the text you want in Column A, return matches for A as well as information in Column B and possibly C too in a table in that order
Search for the text you want in Column B, return matches for B as well as information in Column A and possibly C too in a table in that order
Search for the text you want in Column A or Column B, return matches for A/B and possibly C too in a table in that order

Query 1: searching in Column A

The basic query could be: SELECT A, B WHERE A CONTAINS '<some text>'

Of course <some text> is substituted by the actual text to look for enclosed in the single straight quote marks. If you are configuring a web search program like IntelliWebSearch or the memoQ Web Search tool or equivalents in SDL Trados Studio, OmegaT or other tools, the placeholder goes here.

If you want the information in the supplemental (Comment) Column C, add it to the SELECT statement: SELECT A, B, C WHERE A CONTAINS '<some text>'

The results table is returned in the order than the columns are named in the SELECT statement; to change the display order, change the sequence of the column labels A, B and C in the SELECT, for example: SELECT B, A, C WHERE A CONTAINS '<some text>'

Query 2: searching in Column B

Yes, you guessed it: just change the column named after WHERE. So

SELECT B, A, C WHERE B CONTAINS '<some text>'

for example.

Query 3: searching in Column A or Column B (bidirectional search)

For this, each comparison after the WHERE should be grouped in parentheses:

SELECT A, B, C WHERE (A CONTAINS '<some text>') OR (B CONTAINS '<some text>')

The statement above will return results where the expression is found in either Column A or Column B. Other logic is possible: substituting AND for the logical OR in the WHERE clause returns a results table in which the expression must be present in both columns of a given record.

And yes, in memoQ Web Search or a similar tool you would use the placeholder for the expression twice. Really.

Putting it all together

To make the search URL for your Google spreadsheet three parts are needed:

The base URL of the spreadsheet (look in your browser's address bar; in the address https://docs.google.com/spreadsheets/d/1Bm_ssaeF2zkUJR-mG1SaaodNSatGdvYernsE7IJcEDA/edit#gid=1106428424 for example, the base URL is everything before /edit#gid=1106428424.
The string /gviz/tq?tqx=out:html&tq= and
Your query statement created as described above

Just concatenate all three elements:

{base URL of the spreadsheet} + /gviz/tq?tqx=out:html&tq= + {query}

An example of this in a memoQ Web Search configuration might be:

https://docs.google.com/spreadsheets/d/1Bm_ssaeF2zkUJR-mG1SaaodNSatGdvYernsE7IJcEDA/gviz/tq?tqx=out:html&tq=SELECT B, A WHERE (A CONTAINS '{}') OR (B CONTAINS '{}')

and here you can see a search with that configuration and the characters 'muni' : https://goo.gl/D5cQmh

Adding custom labels to the results table

If you clicked the short URL given as an example above, you'll notice that the columns are unlabeled. Try this short URL to see the same search with labels: https://goo.gl/3zJQqK

This is accomplished simply by adding LABEL A 'Portuguese', B 'English' to the end of the query string.

If you look at the URL in the address bar for any of the live web examples you'll notice that space characters, quote marks and other stuff are substituted by codes. No matter. You can type in clear text and use what you type; modern browsers can deal with stuff that is ungeeked too.

To do more formatting tricks, RTFM! It's here.

Jun 23, 2017

Terminology output management with SDL MultiTerm

I have always liked SDL MultiTerm Desktop - since long before it was an SDL product, back when it came as part of the package with my Trados Workbench version 3 license.

Then, as now, Trados sucked as a working tool, so I soon switched to Atril's Déja Vu for my translation work, and after 8 or 9 years to memoQ, but MultiTerm has continued to be an important working tool for my language service business. I extract and manage my terminology with memoQ for the most part, but when I want a high-quality format for sharing terminology with my clients' various departments, there is currently no reasonable alternative to MultiTerm for producing good dictionary-style output.

Terminology can be exported from whatever working environment you maintain it in, and then transferred to a MultiTerm termbase using MultiTerm Convert or other tools. In the case of memoQ, there is an option to output terms directly to "MultiTerm XML" format:

Fairly simple; there are no options to configure. Just select the radio button for the MultiTerm export format at the top of any memoQ term export dialog. And what do you get?

Three files: the XML file with the actual term data and the XDT file with the termbase specifications are the important ones. The latter is used to create the termbase in SDL MultiTerm. If you have an existing termbase to use in MultiTerm, you won't need the XDT file, though if that termbase is not based on Kilgray's XDT file there might be some mapping complications for the term inport from the XML file.

Now let's create a termbase in SDL MultiTerm 2017 Desktop:

Give it a name:

When the termbase wizard starts, choose the option to load an existing termbase definition and select the XDT file created by memoQ:

At the end of the process you will have an empty Multiterm termbase into which the data in the XML file are imported:

Now you'll have an SDL Multiterm termbase with the glossary content exported from memoQ. This is a process which can be carried out when sharing terminology with a colleague who uses SDL Trados Studio for translation, for example. If they don't know how to use the import functions of SDL Multiterm or you want to save them the bother of doing so, just share the SDLTB file.

Now that the glossary is in Multiterm it can be exported in various formats which can be helpful to people who prefer the data in a more generally accessible format. Please note that this is not done using the export functions under the File menu! SDL Multiterm is a program originally developed by German programmers, who have their own Konzept of Benutzerfreundlichkeit. Even in the hands of Romanian developers, it's still kinda weird. The desired functions are found in the Termbase Management area of course:

In keeping with the German Benutzerfreundlichkeitskonzept, the command to generate the desired output is Process, of course.

There are a number of pre-defined output templates included with Multiterm. I usually use a version of the "Word Dictionary" export definition, which produces a two-column RTF file, which by default will give output like this:

I prefer something a little different, so I have prepared various improved versions of this output definition, and I usually edit the text, adjust the column breaks as needed and clean up any garbage (like redundant initial letters caused by inflected vowels in a language like Portuguese), then I slap a cover page on the file and make a PDF out of it or create a nice printed copy, possibly with other page size formatting. Here is an example:

Example PDF dictionary output - click to enlarge

Other possible output formats include HTML, which can be useful for term access on an intranet, for example. Custom definitions can be created by cloning and editing an existing definition; these are specific to a given termbase. If you want to apply a custom export definition to another termbase, export it as an XDX file and then load it for the other termbase. The definition file used to generate the example above is available here.

One essential weakness of the SDL export definition which has always annoyed me is the failure to include the last word on the page in the header as most proper dictionaries do. I addressed this in the definition with my limited knowledge of RTF coding, but the change can be made manually in Microsoft Word too, for example, by copying and pasting the SortTerm field and editing it to add the \l argument:

There are, of course other, possibly better ways to get some nice output formats from memoQ glossaries or termbases in other tools. One approach with memoQ is to create XSL scripts to process the MultiTerm XML output from memoQ. For years I have been hoping that Kilgray would create a simple extension to the term export dialog in memoQ, which would allow XSL scripts to be chosen and a transformation applied when the data are exported. It really is a shame that after more than a decade the best translation environment tool available - memoQ - still cannot match the excellent formatted output that my clients and I have enjoyed with MultiTerm since I first started using that program 17 years ago!

Jun 5, 2017

Technology for Legal Translation

Last April I was a guest at the Buenos Aires University Facultad de Derecho, where I had an opportunity to meet students and staff from the law school's integrated degree program for certified public translators and to speak about my use of various technologies to assist my work in legal translation. This post is based loosely on that presentation and a subsequent workshop at the Universidade de Évora.

Useful ideas seldom develop in isolation, and to the extent that I can claim good practice in the use of assistive technologies for my translation work in legal and other domains it is largely the product of my interactions with many colleagues over the past seventeen years of commercial translation activity. These fine people have served as mentors, giving me my first exposure to the concepts of platform interoperability for translation tools, and as inspirations by sharing the many challenges they face in their work and clearly articulating the desired outcomes they hoped to achieve as professionals. They have also generously and frequently shared with me the solutions that they have found and have often unselfishly shared their ideas on how and why we should do better in our daily practice. And I am grateful that I can continue to learn with them, work better, and help others to do so as well.

A variety of tools for information management and transformation can benefit the work of a legal translator in areas which include but are not limited to:

corpus utilization,
text conversion,
terminology management,
diverse information retrieval,
assisted drafting,
dictated speech to text,
quality assurance,
version control and comparison, and
source and target text review.

Though not exhaustive, the list above can provide a fairly comprehensive basis for education of future colleagues and continued professional development for those already active as legal translators. But with any of the technologies discussed below, it is important to remember that the driving force is not the hardware and software we use in technical devices but rather the human mind and its understanding of subject matter and the needs of the particular task or work process in the legal domain. No matter how great our experience, there is always something more and useful to be learned, and often the best way to do this is to discuss the challenges of technology and workflow with others and keep an open mind for new approaches with promise.

Reference texts of many kinds are important in legal translation work (and in other types of translation too, of course). These may be monolingual or multilingual texts, and they provide a wealth of information on subject matter, terminology and typical usage in particular contexts. These collections of text – or corpora – are most useful when the information found in them can be read in context rather than isolation. Translation memories – used by many in our work – are also corpora of a kind, but they are seriously flawed in their usual implementations, because only short segments of text are displayed in a bilingual format, and the meaning and context of these retrieved snippets are too often obscure.

An excerpt from a parallel corpus showing a treaty text in English, Portuguese and Spanish

The best corpus tools for translation work allow concordance searches in multiple selected corpora and provide access to the full context of the information found. Currently, the best example of integrated document context with information searches in a translation environment tool is found in the LiveDocs module of Kilgray's memoQ.

A memoQ concordance search with a link to an "aligned" translation

A past translation and its preview stored in a memoQ LiveDocs corpus, accessed via concordance search

A memoQ LiveDocs corpus has all the advantages of the familiar "translation memory" but can include other information, such as previews of the translated work as well. It is always clear in which document the information "hit" was found, and corpora can also include any number of monolingual documents in source and target languages, something which is not possible with a traditional translation memory.

In many cases, however, much context can be restored to a traditional translation memory by transforming it into a "document" in a LiveDocs corpus. This is because in most cases, substantial portions of the translation memory will have its individual segment records stored in document order; if the content is exported as a TMX file or tab-delimited text file and then imported as a bilingual document in a LiveDocs corpus, the result will be almost as if the original translations had been aligned and saved, and from a concordance hit one can open the bilingual content directly and read the parts before and after the text found in the concordance search.

Legal translation can involve text conversion in a broad sense in many ways. Legal translators must often deal with hardcopy or faxed material or scanned files created from these. Often documents to translate and reference documents are provided in portable document format (PDF), in which finding and editing information can be difficult. Using special software, these texts can be converted into documents which can be edited, and portions can be copied, pasted and overwritten easily, or they can be imported in translation assistance platforms such as SDL Trados Studio, Wordfast or memoQ. (Some of these environments include integrated facilities for converting PDF texts, but the results are seldom as suitable for work as PDF or scanned files converted with optical character recognition software such as ABBYY FineReader or OmniPage.)

Software tools like ABBYY FineReader can also convert "dead" scanned text images into searchable documents. This will even work with bad contrast or color images in the background, making it easier, for example, to look for information in mountains of scanned documents used in legal discovery. Text-on-image files like the example shown above completely preserve the layout and image context of the text to be read in the best way. I first discovered and used this option while writing a report for a client in which I had to reference sections of a very long, scanned policy document from the European Parliament. It was driving me crazy to page through the scanned document to find information I wanted to cite but where I had failed to make notes during my first reading. Converting that scanned policy to a searchable PDF made it easy to find what I needed in seconds and accurately cite its page number, etc. Where there is text on pictures, difficult contrast and other features this is often far better for reference purposes than converting to an MS Word document, for example, where the layouts are likely to become garbled.

Software tools for translation can also make text in many other original formats accessible to translators in an ergonomically simpler form, also ensuring, where necessary, that no text is overlooked because of a complicated layout or because it is in an easily overlooked footnote or margin note. Text import filters in translation environments make it easy to read and translate the words in a uniform working environment, with many reference tools and other help available, and then render the translated text back into its original format or some more useful bilingual format.

An excerpt of translated patent claims exported as a bilingual table for review

Technology also offers many possibilities for identifying, recording and controlling relevant terminology in legal translation work.

Large quantities of text can be analyzed quickly to find the most frequent special vocabulary likely to be relevant to the translation work and save these in project glossaries, often enabling that work to be organized better with much of the clarification of terms taking place prior to translation. This is particularly valuable in large projects where it may be advisable to ensure that a team of translators all use the same terms in the target language to avoid possible confusion and misunderstanding.

Glossaries created in translation assistance tools can provide terminology hints during work and even save keystrokes when linked to predictive, "intelligent" writing features.

Integrated quality checking features in translation environments enable possible deviations of terminology or other issues to be identified and corrected quickly.

Technical features in working software for translation allow not only desirable terms to be identified and elaborated; they also enable undesired terms to be recorded and avoided. Barred terms can be marked as such while translating or automatically identified in a quality check.

A patent glossary exported from memoQ and then made into a PDF dictionary via SDL Trados MultiTerm

Technical tools enable terminology to be shared in many different ways. Glossaries in appropriate formats can be moved easily between different environments to share them with others on a team which uses diverse technologies; they can also be output as spreadsheets, web pages or even formatted dictionaries (as shown in the example above). This can help to ensure consistency over time in the terms used by translators and attorneys involved in a particular case.

There are also many different ways that terminology can be shared dynamically in a team. Various terminology servers available usually suffer from being restricted to particular platforms, but freely available tools like Google Sheets coupled with web look-up interfaces and linked spreadsheets customized for importing into particular environments can be set up quickly and easily, with access restricted to a selected team.

The links in the screenshot above show a simple example using some data from SAP. There is a master spreadsheet where the data is maintained and several "slavesheets" designed for simple importing into particular translation environment tools. Forms can also be used for simplified data entry and maintenance.

If Google Sheets do not meet the confidentiality requirements of a particular situation, similar solutions can be designed using intranets, extranets, VPNs, etc.

Technical tools for translators can help to locate information in a great variety of environments and media in ways that usually integrate smoothly with their workflow. Some available tools enable glossaries and bilingual corpora to be accessed in any application, including word processors, presentation software and web pages.

Corpus information in translation memories, memoQ LiveDocs or external sources can be looked up automatically or in concordance searches based on whole or partial content matches or specified search terms, and then useful parts can be inserted into the target text to assist translation. In some cases, differences between a current source text and archived information is highlighted to assist in identifying and incorporating changes.

Structured information such as dates, currency expressions, legal citations and bibliographical references can also be prepared for simple keystroke insertion in the translated text or automated quality checking. This can save many frustrating hours of typing and copy revision. In this regard, memoQ currently offers the best options for translation with its "auto-translation" rulesets, but many tools offer rules-based QA facilities for checking structured information.

Voice recognition technologies offer ergonomically superior options for transcription in many languages and can often enable heavy translation workloads with short deadlines to be handled with greater ease, maintaining or even improving text quality. Experienced translators with good subject matter knowledge and voice recognition software skills can typically produce more finished text in a day than the best post-editing operations for machine pseudo-translation, with the exception that the text produced by human voice transcription is actually usable in most situations, while the "gloss" added to machine "translations" is at best lipstick on a pig.

Reviewing a text for errors is hard work, and a pressing deadline to file a brief doesn't make the job easier. Technical tools for translation enable tens of thousands of words of text to be scanned for particular errors in seconds or minutes, ensuring that dates and references are correct and consistent, that correct terminology has been used, et cetera.

The best tools even offer sophisticated tools for tracking changes, differences in source and target text versions, even historical revisions to a translation at the sentence level. And tools like SDL Trados Studio or memoQ enable a translation and its reference corpora to be updated quickly and easily by importing a modified (monolingual) target text.

When time is short and new versions of a source text may follow in quick succession, technology offers possibilities to identify differences quickly, automatically process the parts which remain unchanged and keep everything on track and on schedule.

For all its myriad features, good translation technology cannot replace human knowledge of language and subject matter. Those claiming the contrary are either ignorant or often have a Trumpian disregard for the truth and common sense and are all too eager to relieve their victims of the burdens of excess cash without giving the expected value in exchange.

Technologies which do not assist translation experts to work more efficiently or with less stress in the wide range of challenges found in legal translation work are largely useless. This really does include machine pseudo-translation (MpT). The best “parts” of that swindle are essentially the corpus matching for translation memory archives and corpora found in CAT tools like memoQ or SDL Trados Studio, and what is added is often incorrect and dangerously liable to lead to errors and misinterpretations. There are also documented, damaging effects on one’s use of language when exposed to machine pseudo-translation for extended periods.

Legal translation professionals today can benefit in many ways from technology to work better and faster, but the basis for this remains what it was ten, twenty, forty or a hundred years ago: language skill and an understanding of the law and legal procedure. And a good, sound, well-rested mind.

*******

Further references

Speech recognition

Dragon NaturallySpeaking: https://www.nuance.com/dragon.html
Tiago Neto on applications: https://tiagoneto.com/tag/speech-recognition
Translation Tribulations – free mobile for many languages: http://www.translationtribulations.com/2015/04/free-good-quality-speech-recognition.html
Circuit Magazine - The Speech Recognition Revolution: http://www.circuitmagazine.org/chroniques-128/des-techniques
The Chronicle - Speech Recognition to Go: http://www.atanet.org/chronicle-online/highlights/speech-recognition-to-go/
The Chronicle - Speech Recognition Is in Your Back Pocket (or Wherever You Keep Your Mobile Phone): http://www.atanet.org/chronicle-online/none/speech-recognition-is-in-your-back-pocket-or-wherever-you-keep-your-mobile-phone/

Document indexing, search tools and techniques

Archivarius 3000: http://www.likasoft.com/document-search/
Copernic Desktop Search: https://www.copernic.com/en/products/desktop-search/
AntConc concordance: http://www.laurenceanthony.net/software/antconc/
Multiple, separate concordances with memoQ: http://www.translationtribulations.com/2014/01/multiple-separate-concordances-with.html
memoQ TM Search Tool: http://www.translationtribulations.com/2014/01/the-memoq-tm-search-tool.html
memoQ web search for images: http://www.translationtribulations.com/2016/12/getting-picture-with-automated-web.html
Upgrading translation memories for document context: http://www.translationtribulations.com/2015/08/upgrading-translation-memories-for.html
Free shareable, searchable glossaries with Google Sheets: http://www.translationtribulations.com/2016/12/free-shareable-searchable-glossaries.html

Auto-translation rules for formatted text (dates, citations, etc.)

Translation Tribulations, various articles on specifications, dealing with abbreviations & more:
http://www.translationtribulations.com/search/label/autotranslatables
Marek Pawelec, regular expressions in memoQ: http://wasaty.pl/blog/2012/05/17/regular-expressions-in-memoq/

Authoring original texts in CAT tools

Translation Tribulations: http://www.translationtribulations.com/2015/02/cat-tools-re-imagined-approach-to.html

Autocorrection for typing in memoQ

Translation Tribulations: http://www.translationtribulations.com/2014/01/memoq-autocorrect-update-ms-word-export.html

May 27, 2017

CAT tools for weapons license study

More than a decade ago I found a very useful book on practical corpus linguistics, which has had perhaps the greatest impact of any single thing on the way I approach terminology. Among other things, it discusses how to create special text collections for particular subjects and then mine these for frequently used expressions in those domains. It has become a standard recommendation in my talks at professional conferences and universities as well as in private consultations for terminology.

Slide from my recent talk at the Buenos Aires University Facultad de Derecho

In the last two weeks I had an opportunity to test my recommendations in a little different way than the one in which I usually apply them. Typically I use subject-specific corpora in English (my native language) to study the "authentic" voice of the expert in a domain that may be related to my own technical specialties but which differs in its use of language in significant ways. This time I used it and other techniques to study subject matter I master reasonably well (the features, use and safety aspects of firearms for hunting) with the aim of acquiring vocabulary and an idea of what to expect for a weapons qualification test in Portugal, where I have lived for several years but have not yet achieved satisfactory competence in the language for my daily routine.

It all started two weeks ago when I attended an all-day course on Portugal's firearm and other weapon laws in Portalegre. Seven and a half solid hours of lecture left me utterly fatigued at the end of the day, but it was an interesting one in which I had a lot of aha! moments as I saw a lot of concepts presented in Portuguese which I knew well in German and English. Most of the time I looked up words I saw in the slides or in the course textbook prepared by the PSP and made pencil notes on vocabulary in my book.

Twelve days afterward I was scheduled to take a written text, and in the unlikely event that I passed it, I was supposed to be subject to a practical examination on the safe use of hunting firearms are related matters.

Years ago when I studied for a hunting license in Germany I had hundreds of hours of theoretical and practical instruction in a nine-month course concurrent with a one-year understudy with an experienced hunter. Participants in a German hunting course typically read dozens of supplemental books and study thousands of sample questions for the exam.

The pickings are a little slimmer in Portugal.

There are no study guides in Portuguese or any other language which help to prepare for the weapons tests that I am aware of except the slim book prepared by the police.

There are, however, a number of online forums where people talk about their experiences in the required courses and on the tests. Sometimes there are sample questions reproduced with varying degrees of accuracy, and there is a lot of talk about things which people found particularly challenging.

So I copied and pasted these discussions into text files and loaded them into a memoQ project for Portuguese to English translation. The corpus was not particularly large (about 4000 words altogether), so the number of candidates found in a statistical survey was limited, but still useful to someone with my limited vocabulary. I then proceeded to translate about half of the corpus into English, manually selecting less frequent but quite important terms and making notes on perplexing bits of grammar or tricks hidden in the question examples.

A glossary in progress as I study for my Portuguese weapons license

The glossary also contained some common vocabulary that one might legitimately argue does not belong in a specialist glossary, but since these were common words likely to occur in the exam and I did not know them, it was entirely appropriate to include them.

Other resources on the subject are scarce; I did find a World War II vintage military dictionary for Portuguese and English which can easily be made into a searchable PDF using ABBYY Finereader or other tools but not much else.

Any CAT tool would have worked equally well for my learning objectives - the free tools AntConc and OmegaT are in no way inferior to what memoQ offered me.

On the day of the test, I was allowed to bring a Portuguese-to-English dictionary and a printout of my personal glossary. However, the translation work that I did in the course of building the glossary had imprinted the relevant vocabulary rather well on my mind, so I hardly consulted either. I was tired (having hardly slept the night before) and nervous (so that I mixed up the renewal intervals for driver's licenses and hunting licenses), and I just didn't have the stamina to pick apart some particularly long, obtuse sentences), but in the end I passed with a score of 90% correct. That wouldn't win me any kudos with a translation customer, but it allowed me to go on to the next phase.

Practical shooting test at the police firing range

In the day of lectures, I dared to ask only one question, and I garbled it so badly that the instructor really didn't understand, so I was not looking forward to the oral part of the exam. But much to my surprise, I understood all the instructions on exam day, and I was even able to joke with the policeman conducting the shooting test. In the oral examination in which I had to identify various weapons and ammunition types and explain their use and legal status, and in the final part where I went on a "hunt" with a police commissioner to demonstrate that I could handle a shotgun correctly under field conditions and respond appropriately to a police check, I had no difficulties at all except remembering the Portuguese word for "trigger lock". All the terms I had drilled for passive identification in the written exam had unexpectedly become active vocabulary, and I was able to hold my own in all the spoken interactions - not a usual experience in my daily routine.

The use of the same professional tools and techniques that I rely on for my daily work proved far better than expected as learning aids for my examination and in a much greater scope than I expected. I am confident that a similar application could be helpful in other areas where I am not very competent in my understanding and active use of Portuguese.

If it works for me, it is reasonable to assume that others who must cope with challenges of a test or interactions of some kind in a foreign language might also benefit from learning with a translator's working tools.

Oct 21, 2016

A day in the life....

One of the things I enjoy most about professional translation is the range of activities and subject matters that one can encounter, even as a specialist in a few domains. I can't say the work is never boring, but when it does drift that way, very suddenly it isn't any more. Quite unpredictably.

Yesterday I typed translations. A bit more than expected after two sets of PowerPoint slides - a small one to translate from German and another to edit the rather acceptable English - turned out to have about 8,000 words of highly specialized slide notes about military command and control structures and the technology of fighting forest fires. (Note to self: no matter how busy you are, always import those presentations into memoQ with the options set to extract every kind of text as well as the bitmap graphics if you have to translate those too. Then do a word count! Appearances can be deceiving.)

Yesterday I dictated translations. The job started out as a bunch of text fragments from slides, where context über alles was the rule, lots of terminology required research, and voice recognition offered no particular advantages, then suddenly it became the translation of a rather long lecture using all that new terminology, and the deadline was tighter than thumbscrews operated by an angry ex-girlfriend. Dragon NaturallySpeaking to the rescue. Not only was this necessary to finish the text in a long workday rather than most of a week, but the more natural style of translation by dictation suited the purpose of the translated presentation particularly well. I could imagine myself in the room with equipment vendors, military commanders, firefighting specialists and freight forwarders, talking about the challenges faced and the technology required to avoid the tragedies of an out-of-control firestorm. And the words came out, transcribed from my voice directly into the target text fields of memoQ, exactly as they should be spoken to that audience. And at the end of that long day my hands still had feeling in them, which would not have been the case if I had typed even a third of the text.

Yesterday I made a specialized glossary to share with a presenter who will travel halfway around the world to lecture with the slides I translated for his talk. Long ago I discovered that the way I produce translations has the potential to provide additional benefits for those who will use my work. Sales representatives might need to write letters to their prospects, discussing their products in a language not mastered as a native, and the vocabulary from my work may help them to improve communication and avoid confusion that might result from using incorrect or simply different words to describe the same stuff. Or an attorney might need a quick overview of the language I used to translate the pleading she intends to file, to ensure that it is consistent with previous efforts and will not complicate discussions with her client. The terminology I research and record for each translation can be exported and reformatted quickly to produce glossaries or more complex dictionaries in a variety of formats suited for purpose. Little time and often a lot of benefits for my clients.

Yesterday I translated bitmap graphics and not only had to deal with the editing tools for that but also had to consider the best strategy for transforming the original German graphics into English ones. Would those charts be translated again into other languages? Would the graphics be re-used in other types of documents, so that I should consider ease of portability in my approach to the translation? And how the Hell do I actually use that new bitmap graphics transcription and substitution for Microsoft Office files which was added to memoQ some time ago and sort out the five charts to translate from the fifty to ignore? (Maybe I should blog the solutions some day.)

And yesterday I was asked to write summaries of large, badly scanned articles so that the equipment manufacturer would understand how its latest technology was discussed by German reviewers. As a kid I had a silly fantasy about getting paid to read, and this is just one of the many ways it unexpectedly came true. But before I get that far, these scanned files needed to be reworked so that they could be read and searched on the screen, so as I described in a guest post on another blog some years ago, I converted them to searchable PDF/A with ABBYY FineReader, which in this case also reduced their size by about 75%. The video below also shows how this works. Strangely, when I describe this procedure to other translators, many of them don't get it, and they go on about converting PDF files into editable MS Word files or plain text, or, God help them, something really stupid like importing PDF files directly into a CAT tool for translation, though none of this really relates to my purpose. Conversions often contain errors, and many texts are harder to interpret when the context of an accurate layout is lost. So "text-on-image" PDF files for translation reference to the original source files are often critical, and for files to summarize or consult sporadically for reference (with many pages to look at and essentially nothing to translate), a searchable PDF is the gold standard for efficient work.

In the course of that day I had to work with two computers linked by remote access using four networks at various time, working in German, English and Portuguese (the latter mostly involving questions to the housekeeper on how to do an online pizza delivery order so I could stay in the office and keep working). I used well over a dozen software applications for necessary tasks. These, and the environments in which they operate must be balanced carefully for efficient work. And even after some months in my new office, the balance isn't quite as good as I've had it before, and more attention to ergonomics is required.

Some colleagues are nostalgic for the "good old days" when they received a stack of paper to translate and sent off another stack of paper when the work was done, and they had a filing cabinet or a shelf of notebooks full of old work to use as reference material, and boxes of index cards stuffed full of scribbled notes on terminology next to seldom-dusty specialist dictionaries prepared by presumed experts, often full of marginalia commenting on errors or omissions and stuffed with papers bearing other scribbled notes. Not me. Since the day 30 years ago when I laboriously typed a text file full of file folder numbers and content descriptions for my research work and personal papers I have been a big believer in electronic retrieval of information wherever possible, and I miss retyping botched pages just as little as I miss the lines in the post office or the stress of dealing with delivery services.

I suspect that some feel a loss of control with the advent of new technologies in an old profession, and certainly the changes in the business environment for translation since the days of the typewriter often require a very different mentality to survive and thrive. What that mentality is, exactly, is a matter of healthy debate and often misunderstanding - again, because of the great diversity of the profession and the professions and unprofessionals in it.

The greatest challenges of new technologies that I find are the same as those faced in many other kinds of work and in modern life in general. Filtering the overabundance of input for the few things that are truly of use or interest and maintaining focus and calm amidst omnipresent distractions. Not relying too much on technologies that are far more fallible than most people, even experts, realize or acknowledge. And remembering that a fool with a tool, however many features and failsafes it may offer, remains a fool.

Search me!