The issue of Roman numerals in my translation work has been at the back of my mind for a few years now, but the pain level had not been such that I got around to dealing with it. It comes up time and again in legal translation work: references to the "X. Senat" or the like which mess up segmentation (and require a bit of regex to do a new segmentation rule); references to "Art. VII" of some law (I need to catch the typos like "VIII"); source text errors like "VIIII"; and of course dates like MCMXXIV, etc. and century references.
For simple matters I used regex which would capture and reproduce "Roman numerals", but erroneous data using the right letters would also be accepted:
[MDCLXVI]+
That is, of course, rather useless for QA which checks the correctness of the expression in the source text. So with a bit of thought I came up with:
Without the word border syntax ("\b"), non-standard expressions like "VIIII" might appear to be validated in the interface of memoQ, for example, because the whole express would be marked green in the source text, and one might not notice that it was resolved into "VIII" and "I".
These expressions can be used in various ways in any CAT tool that supports regular expressions, such as SDL Trados Studio or memoQ.
If you want this typing aid and QA tool as a memoQ autotranslatable (along with a little demo data file), you can get it here.
Nice one Kevin. I've used similar before, straight out of the regex cookbook, but I like the cleanliness of this variation a lot. Saved for future use :-)
ReplyDeleteKevin, a suggestion from a DVX3 user: I added the Roman numerals from I to L to a "check termbase", along with lots of other terms that will be correct perhaps 99.5% of the time, and then I use it in the terminology check that I run as one of the last steps before export.
ReplyDelete