Showing posts with label extraction. Show all posts
Showing posts with label extraction. Show all posts

Nov 25, 2023

Book review: "Terminology Extraction for Translation and Interpretation Made Easy"


A few months ago, I received a pre-release copy of this book as a courtesy from the author, terminologist Uwe Muegge, with a request to give a quick language check to the English used by its native German author. As I expected, there wasn't much to complain about, because he has lived in the US for a long time and taught at university there as well as been involved in important corporate roles. I was particularly pleased by his disciplined style of writing, the plain, consistent English of the text and the overall clarity of the presentation. Anyone with good basic English skills should have no difficulty understanding and applying the material.

At the time I read the draft, I was completely focused on language use and style, but I found his approach and suggestions interesting, so I looked forward to "field testing" and direct comparisons with my usual approach to terminology mining with that feature in memoQ. About a day's worth of tests shows very interesting potential for applying the ChatGPT section of the book and also made the context and relevance of the other two sections clearer, I will discuss those sections first before getting to the part that interests me the most.

Uwe presents three approaches:
I wouldn't really call these three approaches alternatives as the book does, because all three operate in very different ways and are fit for different purposes. That didn't register fully in my mind when I was in "editor mode", although the first part of the book made the differences, advantages and disadvantages clear enough, but as soon as I began using each of the sites, the differences were quite apparent as were the similarities to more familiar tools like memoQ's term extraction module.

Wordlist from Webcorp is functionally similar to Laurence Anthony's AntConc or memoQ's term extraction. It's essentially useful for getting frequency lists of words, but the inability to use my own stopword lists for filtering out uninteresting common vocabulary makes me prefer my accustomed desktop tools. However, the barriers to first acquaintance or use are lower than for AntConc or memoQ, so this would probably be a better classroom tool for introducing concepts of word frequencies and identifying possible useful terminology on that basis.

OneClick Terms was interesting to me mostly because friends and acquaintances in academia talk about Sketch Engine a lot. The results were similar to what I get with memoQ, including similar multiword "trash terms". I found the feature for term extraction from bilingual texts particularly interesting, and the fact that it can work well on the TMX files distributed by the Directorate-General for Translation (DGT) of the European Commission suggests that it could be an efficient tool for building glossaries to support translation referencing EU legislation, for example, though I expect only slight advantages over my usual routine with memoQ. These advantages are not worth the monthly subscription fee to me. However, for purposes of teaching and comparison, the inclusion of this platform in the book is helpful. I see more value for academic institutions and those rare large volume translation companies that do a lot of work with EU resources. 

ChatGPT was an interesting surprise. I have a very low opinion of its use as a writing tool (mediocre on its best day, clumsy and boring in nearly all its output) or for regex composition (hopelessly incompetent for what I need, and anything it does right for regex is newbie stuff for which I need no support). However, as a terminology research tool I have found excellent potential, though formatting the results can be problematic.

My testing was done with ChatGPT 3.5, not a Professional subscription with access to version 4.0. However, I am sorely tempted to try the subscription version to see if it is able to handle some formatting instructions (avoiding unnecessary capitalization) more efficiently. No matter how carefully I try to stipulate no default capitalization of the first letter of every expression, I inevitably have to repeat the instruction after a list of improperly capitalized candidate terms is created.

I keep an e-book copy of Uwe's book in the Kindle app on my laptop, so I can simply copy and paste his suggested prompts, then add whatever additional instructions I want.

The prompt
Please examine the text below carefully and list words or expressions which may be difficult to translate, but when writing the list, do not capitalize any words or expressions which don't require capitalization.

is too long, and only the part marked red is executed correctly, but this follow-up prompt will fix the capitalization in the list:

Please re-examine that text and this time when writing the list, do not capitalize any words or expressions which do not require capitalization.

Further tests involved suggesting translations for the expressions, with or without a translated text and building tables with example sentences:

Other prompt variations, for example to write terms bold in the example sentences, worked without complications.

What about the quality of the selections? Well, I used memoQ's term extraction module on the same text I submitted to ChatGPT for term extraction in order to compare something with which I am quite familiar with this new process. 

memoQ identified a few terms based on frequency, which ChatGPT ignored, but these were arguably terms that a qualified specialist would have known anyway. And ChatGPT did a superior job of selecting multi-word expressions with no "noise". It also selected some very relevant single-occurrence phrases which might be expected to arise more in later, similar texts.

Split screen review of memoQ extraction vs. ChatGPT results

The split-screenshot is an intermediate result from one of my many tests. The overlayed red box was intended to show a conversation partner the limits of ChatGPT's "alphabetizing skill", and the capitalization of the German is not correct after a prompt to correct the capitalization of adjectives misfired. It is not always trivial to get formatting exactly as I want it. However, looking at the results of each program side-by-side like this showed me that ChatGPT had in fact identified the nearly all the most relevant single words and phrases in my text. And for other texts with dates or citation formats, these were also collected by ChatGPT as "relevant terms", giving me an indication of what legislation I might want to use as reference documents and what auto-translation rules might also be helpful.

I also found that the split view as above helped me to work my way through the noise in the memoQ term candidate list much faster and make decisions about which terms to accept. The terms of interest found in memoQ but not selected by ChatGPT were few enough that I am not at all tempted to suggest people follow my traditional approach with the memoQ term extraction module and skip the work with ChatGPT.

My preferred approach would be to do a quick screening in ChatGPT, import the results into a provisional (?) term base and then, as time permits, use that resource in a memoQ term extraction to populate the target fields in the extraction grid. With those populated terms in place, I think the review of the remaining candidates would proceed much more efficiently.

All in all, I found Uwe's book to be a useful reference for teaching and for my personal work; it is one of the few texts I have seen on LLM use which is sober and modest enough in its claims that I was inspired to test them. The sale price is also well within anyone's means: about $10 for the e-book and $16 for the paperback on Amazon. For the "term curious" without access to professional grade tools, it's a great place to get started building better glossaries and for more seasoned wordworkers it offers interesting, probably useful suggestions.

The book is available HERE from Amazon.

Dec 29, 2018

memoQ Terminology Extraction and Management

Recent versions of memoQ (8.4+) have seen quite a few significant improvements in recording and managing significant terminology in translation and review projects. These include:
  • Easier inclusion of context examples for use (though this means that term information like source should be placed in the definition field so it is not accidentally lost)
  • Microsoft Excel import/export capabilities which include forbidden terminology marking with red text - very handy for term review workflows with colleagues and clients!
  • Improved stopword list management generally, and the inclusion of new basic stopword lists for Spanish, Hungarian, Portuguese and Russian
  • Prefix merging and hiding for extracted terms
  • Improved features for graphics in term entries - more formats and better portability
Since the introduction of direct keyboard shortcuts for writing to the first nine ranked term bases in a memoQ project (as part of the keyboard shortcuts overhaul in version 7.8), memoQ has offered perhaps the most powerful and flexible integrated term management capabilities of any translation environment despite some persistent shortcomings in its somewhat dated and rigid term model. But although I appreciate the ability of some other tools to create customized data structures that may better reflect sophisticated needs, nothing I have seen beats the ease of use and simple power of memoQ-managed terminology in practical, everyday project use.

An important part of that use throughout my nearly two decades of activity as a commercial translator has been the ability to examine collections of documents - including but not limited to those I am supposed to translate - to identify significant subject matter terminology in order to clarify these expressions with clients or coordinate their consistent translations with members of a project team. The introduction of the terminology extraction features in memoQ version 5 long ago was a significant boost to my personal productivity, but that prototype module remained unimproved for quite a long time, posing significant usability barriers for the average user.

Within the past year, those barriers have largely fallen, though sometimes in ways that may not be immediately obvious. And now practical examples to make the exploration of terminology more accessible to everyone have good ground in which to take root. So in two recent webinars, I shared my approach - in German and in English - to how I apply terminology extraction in various client projects or to assist colleagues. The German talk included some of the general advice on term management in memoQ which I shared in my talk last spring, Getting on Better Terms with memoQ. That talk included a discussion of term extraction (aka "term mining"), but more details are available here:


Due to unforeseen circumstances, I didn't make it to the office (where my notes were) to deliver the talk, so I forgot to show the convenience of access to the memoQ concordance search of translation memories and LiveDocs corpora during term extraction, which often greatly facilitates the identification of possible translations for a term candidate in an extraction session. This was covered in the German talk.

All my recent webinar recordings - and shorter videos, like playing multiple term bases in memoQ to best advantage - are best viewed directly on YouTube rather than in the embedded frames on my blog pages. This is because all of them since earlier in 2018 include time indexes that make it easier to navigate the content and review specific points rather than listen to long stretches of video and search for a long time to find some little thing. this is really quite a simple thing to do as I pointed out in a blog post earlier this year, and it's really a shame that more of the often useful video content produced by individuals, associations and commercial companies to help translators is not indexed this way to make it more useful for learning.

There is still work to be done to improve term management and extraction in memoQ, of course. Some low-hanging fruit here might be expanded access to the memoQ web search feature in the term extraction as well as in other modules; this need can, of course, be covered very well by excellent third-party tools such as Michael Farrell's IntelliWebSearch. And the memoQ Concordance search is long overdue for an overhaul to allow proper filtering of concordance hits (by source, metadata, etc.), more targeted exploration of collocation proximities and more. But my observations of the progress made by the memoQ planning and development team in the past year give me confidence that many good things are ahead, and perhaps not so far away.

Dec 4, 2018

Optimizing memoQ terminology extraction

On December 28, 2018 from 2:00 to 3:30 pm Lisbon time (3:00 to 4:30 pm CET, 9:00 to 10:30 am EST), I'll be giving a talk on terminology extraction in the latest version of memoQ. Recent versions of this tool have included many improvements to its terminology features, and it's time for an update on how to get the most out of the term extraction features of memoQ among other things.

Topics to be covered include the creation of new stopword lists or the extension of existing ones, customer-, project- or topic-specific stopword lists, criteria for corpora, term mining strategies and the subsequent maintenance and use of term bases in projects. Participants will be equipped with all the information needed to use this memoQ feature confidently, reliably and profitably in their professional work.

The webinar is free, but registration is required. To register, go to:
https://zoom.us/meeting/register/cfd1a47cd5c54114d746f627e8486654

The same presentation (more or less) will be held in German on December 21 at the same time for those who prefer to hear and discuss the topic in that language.

Dec 3, 2018

Terminologieextraktion mit memoQ: die neuesten Möglichkeiten


Am 21. Dezember um 15:00 Uhr bis 16:30 Uhr MEZ findet wieder eine deutschsprachige memoQ-Schulung online statt. Thema: Optimierung der Terminologieextraktion. Der Vortrag bietet eine Übersicht der Möglichkeiten für effizientes Arbeiten mit dem Extraktionsmodul für Terminologie in memoQ. Von der Neuerstellung bzw. Erweiterung der Stoppwortlisten, kunden-, projekt- oder themenspezifische Stoppwortlisten, Korpuskriterien und Extraktionsstrategien bis zu der anschließenden Pflege der Terminologiedatenbanken und dem Einsatz im Projekt werden Sie mit den notwendigen Informationen gerüstet, diese Funktion bei Ihren professionellen Tätigkeiten sicher, zuverlässig und gewinnbringend einzusetzen. Teilnahme ist kostenlos aber registrierungspflichtig: https://zoom.us/meeting/register/d68e024c63ad506f7c24e00bf0acd2b8 Ein inhaltsgleicher Vortrag in englischer Sprache findet eine Woche (am 28.12.2018) später statt: https://zoom.us/meeting/register/cfd1a47cd5c54114d746f627e8486654

Apr 4, 2018

New in memoQ 8.4: easy stopword list creation!

This wasn't really on Kilgray's plan, but hey - it's now possible, and that makes my life easier. An accidental "feature".

Four years ago, frustrated by the inability of memoQ to import stopword lists obtained from other sources to memoQ, I published a somewhat complex workaround, which I have used in workshops and classes when I teach terminology mining techniques. For years I had suggested that adding and merging such lists be facilitated in some way, because the memoQ stopword list editor really sucks (and still does). Alas, the suggestion was not taken up, so translators of most source languages were left high and dry if they wanted to do term extraction in memoQ and avoid the noise of common, uninteresting words.

Enter memoQ version 8.4... with a lot of very nice improvements in terminology management features, which will be the subject of other posts in the future. I've had a lot of very interesting discussions with the Kilgray team since last autumn, and the directions they've indicated for terminology in memoQ have been very encouraging. The most recent versions (8.3 and 8.4) have delivered on quite a number of those promises.

I have used memoQ's term extraction module since it was first introduced in version 5, but it was really a prototype, not a properly finished tool despite its superiority over many others in a lot of ways. One of its biggest weaknesses was the handling of stopwords (used to filter out unwanted "word noise". It was difficult to build lists that did not already exist, and it was also difficult to add words to the list, because both the editor and the term extraction module allowed only one word to be added at a time. Quite a nuisance.

In memoQ 8.4, however, we can now add any number of selected words in an extraction session to the stopword list. This eliminates my main gripe with the term extraction module. And this afternoon, while I was chatting with Kilgray's Peter Reynolds about what I like about terminology in memoQ 8.4, a remark from him inspired the realization that it is now very easy to create a memoQ stopword list from any old stopword lists for any language.

How? Let me show you with a couple of Dutch stopword lists I pulled off the Internet :-)


I've been collecting stopword lists for friends and colleagues for years; I probably have 40 or 50 languages covered by now. I use these when I teach about AntConc for term extraction, but the manual process of converting these to use in memoQ has simply been too intimidating for most people.

But now we can import and combine these lists easily with a bogus term extraction session!

First I create a project in memoQ, setting the source language to the one for which I want to build or expand a stopword list. The target language does not matter. Then I import the stopword lists into that project as "translation documents".


On the Preparation ribbon in the open project, I then choose Extract Terms and tell the program to use the stopword lists I imported as "translation documents". Some special settings are required for this extraction:


The two areas marked with red boxes are critical. Change all the values there to "1" to ensure that every word is included. Ordinarily, these values are higher, because the term extraction module in memoQ is designed to pick words based on their frequencies, and a typical minimum frequency used is 3 or 4 occurrences. Some stopword lists I have seen include multiple word expressions, but memoQ stopword lists work with single words, so the maximum length in words needs to be one.


Select all the words in the list (by selecting the first entry, scrolling to the bottom and then clicking on the last entry while holding down the Shift key to get everything), and then select the command from the ribbon to add the selected candidates to the stopword list.

But we don't have a Dutch stopword list! No matter:


Just create a new one when the dialog appears!


After the OK button is clicked to create the list, the new list appears with all the selected candidates included. When you close that dialog, be sure to click Yes to save the changes or the words will not be added!


Now my Dutch stopword list is available for term extraction in Dutch documents in the future and will appear in the dropdown menu of the term extraction session's settings dialog when a session is created or restarted. And with the new features in memoQ 8.4, it's a very simple matter to select and add more words to the list in the future, including all "dropped" terms if you want to do that.

More sophisticated use of your new list would include changing the 3-digit codes which are used with stopwords in memoQ to allow certain words to appear at the beginning, in the middle, or at the end of phrases. If anyone is interested in that, they can read about it in my blog post from six years ago. But even without all that, the new stopword lists should be a great help for more efficient term extractions for your source languages in the future.

And, of course, like all memoQ light resources, these lists can be exported and shared with other memoQ users who work with the same source language.

Aug 3, 2017

"Coming to Terms" workshop materials for terminology mining



I recently put together a two-hour online workshop to teach some practical aspects of terminology mining and the creation and management of stopword lists to filter out unwanted word "noise" and get to interesting specialist terminology faster.

A recording of the talk as well as the slides and a folder of diverse resources usable with a variety of tools are available at this short URL: https://goo.gl/qvwJbf. The TVS recording file can be opened and played by the free TeamViewer application.

The discussion focuses primarily on Laurence Anthony's AntConc and the terminology extraction module of Kilgray's memoQ.

May 27, 2017

CAT tools for weapons license study

More than a decade ago I found a very useful book on practical corpus linguistics, which has had perhaps the greatest impact of any single thing on the way I approach terminology. Among other things, it discusses how to create special text collections for particular subjects and then mine these for frequently used expressions in those domains. It has become a standard recommendation in my talks at professional conferences and universities as well as in private consultations for terminology.

Slide from my recent talk at the Buenos Aires University Facultad de Derecho
In the last two weeks I had an opportunity to test my recommendations in a little different way than the one in which I usually apply them. Typically I use subject-specific corpora in English (my native language) to study the "authentic" voice of the expert in a domain that may be related to my own technical specialties but which differs in its use of language in significant ways. This time I used it and other techniques to study subject matter I master reasonably well (the features, use and safety aspects of firearms for hunting) with the aim of acquiring vocabulary and an idea of what to expect for a weapons qualification test in Portugal, where I have lived for several years but have not yet achieved satisfactory competence in the language for my daily routine.

It all started two weeks ago when I attended an all-day course on Portugal's firearm and other weapon laws in Portalegre. Seven and a half solid hours of lecture left me utterly fatigued at the end of the day, but it was an interesting one in which I had a lot of aha! moments as I saw a lot of concepts presented in Portuguese which I knew well in German and English. Most of the time I looked up words I saw in the slides or in the course textbook prepared by the PSP and made pencil notes on vocabulary in my book.

Twelve days afterward I was scheduled to take a written text, and in the unlikely event that I passed it, I was supposed to be subject to a practical examination on the safe use of hunting firearms are related matters.

Years ago when I studied for a hunting license in Germany I had hundreds of hours of theoretical and practical instruction in a nine-month course concurrent with a one-year understudy with an experienced hunter. Participants in a German hunting course typically read dozens of supplemental books and study thousands of sample questions for the exam.

The pickings are a little slimmer in Portugal.

There are no study guides in Portuguese or any other language which help to prepare for the weapons tests that I am aware of except the slim book prepared by the police.

There are, however, a number of online forums where people talk about their experiences in the required courses and on the tests. Sometimes there are sample questions reproduced with varying degrees of accuracy, and there is a lot of talk about things which people found particularly challenging.

So I copied and pasted these discussions into text files and loaded them into a memoQ project for Portuguese to English translation. The corpus was not particularly large (about 4000 words altogether), so the number of candidates found in a statistical survey was limited, but still useful to someone with my limited vocabulary. I then proceeded to translate about half of the corpus into English, manually selecting less frequent but quite important terms and making notes on perplexing bits of grammar or tricks hidden in the question examples.

A glossary in progress as I study for my Portuguese weapons license
The glossary also contained some common vocabulary that one might legitimately argue does not belong in a specialist glossary, but since these were common words likely to occur in the exam and I did not know them, it was entirely appropriate to include them.

Other resources on the subject are scarce; I did find a World War II vintage military dictionary for Portuguese and English which can easily be made into a searchable PDF using ABBYY Finereader or other tools but not much else.

Any CAT tool would have worked equally well for my learning objectives - the free tools AntConc and OmegaT are in no way inferior to what memoQ offered me.

On the day of the test, I was allowed to bring a Portuguese-to-English dictionary and a printout of my personal glossary. However, the translation work that I did in the course of building the glossary had imprinted the relevant vocabulary rather well on my mind, so I hardly consulted either. I was tired (having hardly slept the night before) and nervous (so that I mixed up the renewal intervals for driver's licenses and hunting licenses), and I just didn't have the stamina to pick apart some particularly long, obtuse sentences), but in the end I passed with a score of 90% correct. That wouldn't win me any kudos with a translation customer, but it allowed me to go on to the next phase.

Practical shooting test at the police firing range
In the day of lectures, I dared to ask only one question, and I garbled it so badly that the instructor really didn't understand, so I was not looking forward to the oral part of the exam. But much to my surprise, I understood all the instructions on exam day, and I was even able to joke with the policeman conducting the shooting test. In the oral examination in which I had to identify various weapons and ammunition types and explain their use and legal status, and in the final part where I went on a "hunt" with a police commissioner to demonstrate that I could handle a shotgun correctly under field conditions and respond appropriately to a police check, I had no difficulties at all except remembering the Portuguese word for "trigger lock". All the terms I had drilled for passive identification in the written exam had unexpectedly become active vocabulary, and I was able to hold my own in all the spoken interactions - not a usual experience in my daily routine.

The use of the same professional tools and techniques that I rely on for my daily work proved far better than expected as learning aids for my examination and in a much greater scope than I expected. I am confident that a similar application could be helpful in other areas where I am not very competent in my understanding and active use of Portuguese.

If it works for me, it is reasonable to assume that others who must cope with challenges of a test or interactions of some kind in a foreign language might also benefit from learning with a translator's working tools.

Aug 1, 2016

Corpus Linguistics and AntConc in the 2016 US Presidential Contest

Professor Laurence Anthony's AntConc concordancing software remains my favorite tool for analyzing the word content of text collections for my professional translation purposes. Although a free tool, it offers some important functionality beyond what I can get from the integrated term extraction and concordancing means in my translation environment tools, particularly SDLMultiTerm Extract and memoQ. AntConc is my first recommendation to my friends who teach at university and want to introduce their students to practical corpus linguistics and to my clients in industry who need to produce useful glossaries which cover the most frequently discussed things in their range of products and services.

That is not to say that its features are the most wide-ranging, but in addition to dead-simple incorporation of stopword lists (still a problem for most memoQ users), AntConc (like many other academic concordancers) offers excellent facilities for studying collocations, those words which occur together in important contexts. For years I have begged that this useful feature be added to the tools for professional translators, because it is a great aid in studying the proper language of a particular field or subject matter, and although the memoQ concordance can in fact search for multiple terms at once so that one forms a visual impression of their co-occurrence in text, it lacks the simple precision of AntConc for specifying the proximity range of the words found together in a sentence.

In one form or another, tools for analyzing the frequency of words and the contexts in which they occur have been a part of my life for a very long time. And yet it did not occur to me to use them as a means of studying the many words that are part of the many political and social debates taking place in the countries that concern me. One can get a quick impression with fun word cloud pictures (such as those in this post, created from the convention speeches of The Orange One and The Infamous HRC using a free online tool). But AntConc lets you go deeper and achieve a greater understanding of how language is used to influence our thoughts and discussions.

Katelyn Guichelaar and Kristin Du Mez have done just that in an interesting article title, "Donald Trump and Hillary Clinton, By Their Words", which offers some interesting insights into the psychology and public postures of the two candidates. No spoilers here – go read the article and enjoy. Then think about the professionally and personally relevant ways in which you might use the practical tools of corpus linguistics.


Apr 11, 2014

memoQ: stopwords for term extraction

The recent Kilgray blog post about the "terminology as a service" (TaaS) project reminded me of the considerable unfinished business with the term extraction extraction module introduced three years ago with memoQ 5.0. It's a very useful feature that I apply frequently to my projects, and my prediction years ago that it would not replace SDL's MultiTerm Extract in my workflows was wrong. Overall it proved to be more convenient, and after the shock of discovering that the defective logic of MultiTerm Extract created new German "words" that neither existed nor were in my text sources, I dumped that dodgy tool and stuck to memoQ's extractor. But sometimes its rough edges are irritating, and I wish Kilgray would finally pick up the ball that was dropped after a great start in the game.

One of the major weaknesses (aside from never remembering my changes to the options or my preferred settings for extractions) is the management of stopword lists.

Overall, Kilgray's approach to stopwords is reasonably sophisticated; some time ago I published a rather incomprehensible post in which I demonstrated how the "stopword codes" - those three digit binary numbers which appear stopwords - control whether a word can appear at the start, the end or the middle of a phrase even if it is excluded as a single word. These codes are quite useful in some cases. However, they also complicate the use of stopword data for most users.

memoQ includes only a few stopword lists for a few languages in its shipping configuration - German, English, French and Italian I think. Not even all the user interface languages are included. That's rather sad, because there are quite a few public domain stopword lists available on the Internet. However, most memoQ users have absolutely no idea how to incorporate these in memoQ, and Kilgray offers no features or information that I am aware of to facilitate this process.

I was reminded of this problem when I was asked to discuss terminology mining with masters students at my local university in Portugal. I thought it might be nice for the students to be able to make use of the stopword lists they could find on the Internet for their target languages (Spanish and Portuguese for that group). These lists are typically just text files of single words. When I build my own master stopword list for German a few years ago, I gathered half a dozen or more large, mostly redundant lists for a start. Then I carried out the following steps:
  1. Combine all the stopword lists for a language from various sources into one big text file.
  2. Open that text file in a spreadsheet program such as Microsoft Excel.
  3. Sort the list and use the integrated function to eliminate duplicates.
  4. Fill the number "111" in the second column of the spreadsheet. (This will completely exclude the term from phrases as well; if you want to make individual exceptions according to the scheme I described in an earlier blog post, you can do so now or at any time later after the list has been imported to memoQ.)
  5. Save the data as tab-delimited Unicode text.
  6. Open the text file and paste in this XML header, adapting the red parts to your particular list:

    <MemoQResource ResourceType="Stopwords" Version="1.0">
      <Resource>
        <Guid>dc7006ad-8db8-4724-b22d-7acfd600fd9f</Guid>
        <FileName>ger#KSL_stopwords-DE.mqres</FileName>
        <Name>KSL_stopwords-DE</Name>
        <Description>Combined lists from many sources</Description>
        <Language>ger</Language>
      </Resource>
    </MemoQResource>


  7. Save the file, change the file extension to MQRES, and import the file as a stopword list in the memoQ Resource Console.
Afterward, the stopword list you imported will be available  when you start a term mining session using Operations > Extract Terms...

During an extraction session, words can be added to the chosen stopword list (one at a time unfortunately - I've been asking for multiple addition as a productivity measure for 3 years now so far) by selecting a term and Clicking the Add as stopword command or pressing Ctrl+W.

You might be a little confused if you look for words you've added to a stopword list from the term extraction interface. They are not inserted in alphabetical order, but instead at the end of the words starting with a given letter. Thus, for example, the red box in the screen clipping below shows all the words I've added to my previously alphabetized list since I created it:


Apr 26, 2012

Twitterview: SDL Trados Studio, memoQ, DVX2 and PDF extraction

When I began using Twitter somewhat hesitantly three years ago, I never expected that it would eventually prove to be one of the most useful social media tools for gathering information of professional value. Much of this is serendipitous; I really never know what will come floating down the twitstream or where some of the conversations in it will go. Like the direct chat I had with with a colleague in New Zealand about features she liked best in the two main CAT tools she uses, SDL Trados Studio and memoQ.

We both really appreciate the TM-driven segmentation in memoQ and the superior leverage this offers. But to my surprise, she expressed a preference for SDL Trados Studio, particularly for the quality of its PDF text extractions from electronically generated files. This is not a feature I make heavy use of in either tool, though I have used it more often lately in memoQ for alignments in the LiveDocs module and found it generally satisfactory. Most of my work involving PDF files is with scanned documents - there one has no choice but to use a good OCR tool like OmniPage or ABBYY FineReader.

So I was quite intrigued that the quality of PDF was "better" than from standalone tools. Especially because my experience is quite different. Further discussion (not shown in the graphic) revealed that what she actually meant was that the quality of the text extraction with the CAT tool usually beat the quality of text received from translation agencies who performed conversions. That is easy to explain, really. In my experience, most agencies are clueless about how to use conversion tools and too often use automated settings and save the results "with layout". This is very often utterly unsuited for work with translation environment tools or requires a lot of cleanup and code zapping.

For years I have recommended to agencies and colleagues that they spare themselves a lot of headaches by saving PDF conversions as plain text and adding any desired formatting later. Most people ignore that advice and suffer accordingly. So in a way, a CAT tool that does so encourages "best practice" for PDF translation for those files they are actually able to handle.

Encouraged by the Twitter exchange, I decided to do a few tests with files from recent projects. I took a PDF I had with various IFRS-related texts from EU publications. It appeared to extract quickly and cleanly in memoQ, giving me a translation grid full of nicely segmented text. SDL Trados Studio 2009 choked badly on it and extracted nothing. Her extraction in SDL Trados Studio 2011 caused a timeout with the project I was told, but the text itself was completely extracted and converted to DOCX format. This is useful, because unlike the extraction to plain text in memoQ, this offers the possibility to add or change some text formatting in the translation grid. Other extraction examples from SDL Trados Studio 2011 showed that text formatting was preserved.

A closer examination of the extracted texts revealed some problems with both the memoQ and Trados Studio extractions. The memoQ 5 PDF text extraction engine proved incapable of handling text in multiple columns properly. The paragraph order was all fouled up. The extraction with SDL Trados Studio had a great number of superfluous spaces. Whether it is possible to optimize this in the settings somehow I do not know. The results of all the extraction tests are downloadable here in a 6 MB ZIP file. I've included the SDL Trados Studio extraction saved to plain text as well for a better comparison of the text order and surplus spaces problems.

Overall, I am personally not very pleased with the results of the text extractions from PDF in either tool. The results from SDL Trados Studio are clearly better, and other examples that were shared made it clear that this tool works better than many an untrained PM with better PDF conversion software. This is certainly much better than solutions I see many translators using. But really, nothing beats good OCR software, an understanding of how to use it well and a proper workflow to get a good TM and target file better fit for most purposes.

*****

Update 2012-05-22: I met colleague Victor Dewsbery at a recent gathering in Berlin, and he told me about his tests with the recently introduced PDF import feature of Atril's Déjà Vu X2 translation environment. He kindly offered to share his results (available for download here) and wrote:

Here is the result of the PDF>DVX2>RTF>ZIP process for your monster EU PDF file. Comments on the process and the result:
  • The steps involved were: 1. import the file into DVX2 as a PDF file; 2. mark all segments and copy source to target; 3. export the file as if it were a translated file (it comes out as an RTF file). The RTF file is 20 MB in size and zips to 3 MB.
  • Steps 1 and 3 took a long time, and DVX2 claimed to be not responding. For step 1 I just left it and it eventually came up with the goods. Step 3 exported the RTF file perfectly, even though DVX2 claimed that the export had not finished. I was able to open the RTF file (it was locked, but I simply renamed it), and this is the version which I enclose. Half an hour later DVX2 had still not ended the export process (and had to be closed via the Task Manager), although the exported file was in fact perfectly OK. The procedure worked more smoothly with a couple of smaller PDF files. Atril is working on streamlining the process and ironing out the glitches in the process, especially the “not responding” messages.
  • The result actually looks very good to me. There are hardly any codes in the DVX2 project file (the import routine also integrates CodeZapper). I didn’t spot any mistakes in the sequence of the text. Indented sections with numbering seem to be formatted properly - i.e. with tabs and without any multiple spaces.
  • The top and bottom page boundaries in the exported file are too wide, so most pages run over and the document has over 900 pages instead of just under 500. Marking the whole document and dragging the header/footer spaces in Word seems to fix this fairly quickly.
  • I note that some headlines are made up of individual letters with spaces between them. This may be related to the German habit of using letter spacing (“Sperrschrift”) for emphasis as an alternative to bold type.
  • I found one instance where text was chopped up into a table on page 857 of the file.
  • There are occasional arbitrary jumps in type size and right/left page boundaries between sections.
On the strength of this sample, it would usually be OK to simply import the PDF file into DVX2, translate in the normal way, and then fix any formatting problems in the exported file.

Jan 7, 2012

Understanding memoQ's term extraction stopword codes

Recently I shared a link to a small stopword list for a minor language, which I had set up as a memoQ resource for a friend, and another translator questioned why I had coded the stopwords as I did. My answer was truthful: no good reason. I had simply copied the practice in Kilgray's default files for other languages. As I looked further into discussions of term extraction and stopwords on the memoQ Yahoogroups list, I realized that I was not the only one who had a hard time getting a clear picture of how things actually work. So I decided to learn by experiment.
First I created a stopword list with nonsense words having every possible coding combination. A memoQ stopword list is a test file with an XML header and *.mqres extension, with a structure that looks like this:
<memoqresource resourcetype="Stopwords" version="1.0">
  <resource>
    <guid>2b077cde-8c10-4ee1-86db-14eb42f010cc</guid>
    <filename>KSL_test-stopwords_EN.mqres</filename>
    <name>KSL_test-stopwords-EN</name>
    <description>For testing only</description>
    <language>eng</language>
  </resource>
</memoqresource>
gak    111
unga   101
munga  011
kunga  110
fra    000
blu    100
bly    001
bla    010
The entries in the stopword list (here the nonsense words gak through bla) are each followed by a tab and a three digit binary code. The first digit of this code controls whether a phrase is excluded from the list of candidates if it begins with this entry. (Kilgray calls this "blocks as first".) The second digit of the code controls whether a phrase is excluded if the entry occurs within it (not at the beginning nor at the end, Kilgray calls this "blocks inside"). The third digit controls whether a phrase is excluded if the entry occurs at its end ("blocks as last").

A "1" means yes, "0" means no. So "011" means
  • allowed at the start of the phrase,
  • not allowed inside the phrase
  • not allowed at the end of a phrase
Thus kunga will cause a phrase to be excluded if it occurs at the start of or inside the phrase, but not at the end. Phrases ending with kunga might appear in the list of candidates.

My test file contained the sentence
The quick brown fox jumped over the lazy dog 
repeated four times in three blocks for each test stopword, with the stopword substituted at the beginning, inside and at the end of "over the lazy dog":
The quick brown fox jumped unga the lazy dog. The quick brown fox jumped unga the lazy dog. The quick brown fox jumped unga the lazy dog. The quick brown fox jumped unga the lazy dog.
The quick brown fox jumped over unga lazy dog. The quick brown fox jumped over unga lazy dog. The quick brown fox jumped over unga lazy dog. The quick brown fox jumped over unga lazy dog.
The quick brown fox jumped over the lazy unga. The quick brown fox jumped over the lazy unga. The quick brown fox jumped over the lazy unga. The quick brown fox jumped over the lazy unga.
After the term extraction, the following four-word phrases from the text chunk of interest were found with the stopwords:
fra The lazy dog
bly The lazy dog
bla The lazy dog

munga
The lazy dog

over unga lazy dog

over fra lazy dog
over blu lazy dog
over bly lazy dog

over The lazy kunga
over The lazy fra
over The lazy blu
over The lazy bla
All these occurrences follow the defined rules as you can see from the stopword list above. None of the stopwords occurred singly in the extraction candidates, of course. So entering "000" as the code for a stopword will exclude that stopword alone but not in any phrase.

How is this relevant in practice? In English, for example, words like in, the and first are uninteresting by themselves and belong in a stopword list. But a phrase containing them, like "in the first instance" might indeed be of interest. In cases like that, the proper code for these stopwords might be "001" or "101" (allowing inside in both cases, at the beginning as well in the first case) might be appropriate. These are matters of judgment that will differ for each language. One user commented that he finds it more useful to be very restrictive in the extraction ("111") and add phrases during the actual translation, and I am inclined to follow this practice as well. Where one discovers exceptions, the stopword rules can always be edited in various places in memoQ.