First I created a stopword list with nonsense words having every possible coding combination. A memoQ stopword list is a test file with an XML header and *.mqres extension, with a structure that looks like this:
<memoqresource resourcetype="Stopwords" version="1.0">The entries in the stopword list (here the nonsense words gak through bla) are each followed by a tab and a three digit binary code. The first digit of this code controls whether a phrase is excluded from the list of candidates if it begins with this entry. (Kilgray calls this "blocks as first".) The second digit of the code controls whether a phrase is excluded if the entry occurs within it (not at the beginning nor at the end, Kilgray calls this "blocks inside"). The third digit controls whether a phrase is excluded if the entry occurs at its end ("blocks as last").
<resource>
<guid>2b077cde-8c10-4ee1-86db-14eb42f010cc</guid>
<filename>KSL_test-stopwords_EN.mqres</filename>
<name>KSL_test-stopwords-EN</name>
<description>For testing only</description>
<language>eng</language>
</resource>
</memoqresource>
gak 111
unga 101
munga 011
kunga 110
fra 000
blu 100
bly 001
bla 010
A "1" means yes, "0" means no. So "011" means
- allowed at the start of the phrase,
- not allowed inside the phrase
- not allowed at the end of a phrase
My test file contained the sentence
The quick brown fox jumped over the lazy dog
repeated four times in three blocks for each test stopword, with the stopword substituted at the beginning, inside and at the end of "over the lazy dog":
The quick brown fox jumped unga the lazy dog. The quick brown fox jumped unga the lazy dog. The quick brown fox jumped unga the lazy dog. The quick brown fox jumped unga the lazy dog.The quick brown fox jumped over unga lazy dog. The quick brown fox jumped over unga lazy dog. The quick brown fox jumped over unga lazy dog. The quick brown fox jumped over unga lazy dog.The quick brown fox jumped over the lazy unga. The quick brown fox jumped over the lazy unga. The quick brown fox jumped over the lazy unga. The quick brown fox jumped over the lazy unga.
After the term extraction, the following four-word phrases from the text chunk of interest were found with the stopwords:
fra The lazy dog
bly The lazy dog
bla The lazy dog
munga The lazy dog
over unga lazy dog
over fra lazy dog
over blu lazy dog
over bly lazy dog
over The lazy kunga
over The lazy fra
over The lazy blu
over The lazy bla
All these occurrences follow the defined rules as you can see from the stopword list above. None of the stopwords occurred singly in the extraction candidates, of course. So entering "000" as the code for a stopword will exclude that stopword alone but not in any phrase.
How is this relevant in practice? In English, for example, words like in, the and first are uninteresting by themselves and belong in a stopword list. But a phrase containing them, like "in the first instance" might indeed be of interest. In cases like that, the proper code for these stopwords might be "001" or "101" (allowing inside in both cases, at the beginning as well in the first case) might be appropriate. These are matters of judgment that will differ for each language. One user commented that he finds it more useful to be very restrictive in the extraction ("111") and add phrases during the actual translation, and I am inclined to follow this practice as well. Where one discovers exceptions, the stopword rules can always be edited in various places in memoQ.
No comments:
Post a Comment
Notice to spammers: your locations are being traced and fed to the recreational target list for my new line of chemical weapon drones :-)