Some years ago, I published a description of how data from these multilingual EUR-LEX displays can be transferred to translation memories or other corpora for reference purposes, and more recently I produced a video showing this same procedure. But some people don't like the paragraph-level alignment format of the EUR-LEX displays, and these can also occasionally be seriously out of sync for some reason, as in this example (or worse):
Now I don't find that much of a nuisance when I use memoQ LiveDocs, because I can simply view the full bilingual document context and see where the corresponding information really is (kind of like leaving alignments in memoQ uncorrected until you actually find a use for the data and determine that the effort is worthwhile), but if you plan to feed that aligned data to a translation memory, it's a bit of a disaster. And many people prefer data aligned at the sentence level anyway.
Well, there is a simple way to get the EU legislation texts you want, aligned at the sentence level, with the individual bitexts ready to import into a translation memory, LiveDocs corpus or other reference tool. See that document number above with the large red arrow pointing to it? That's where you start....
Did you know that much of the information available in EUR-LEX is also available in the publicly available DGT translation memories? These are sentence-level alignments. But most people go about using this data in a rather klutzy and unhelpful way. The "big data" craze some years ago had a lot of people trying to load this information into translation memories and other places, usually with miserable results. These include:
- the inability to load such enormous data quantities in a CAT tool's TM without having far more computer RAM than most translators ever think they'll need;
- very slow imports, some apparently proceeding on a geological time scale;
- data overload - so many concordance hits that users simply can't find the focused information they need; and
- system performance degradation, with extremely sluggish responses in a wide variety of tasks.
Bulk data is for monkeys and those who haven't evolved professionally much beyond that stage. Precision data selection makes more sense, and enables better use of the resources available. But how can you achieve that precision? If I want the full bilingual text of EU Regulation No. 575/2013 in some language pair, for example, with sentence-level alignment, how can I find that quickly in the vast swamp of DGT data?
Years ago, I published an article describing how it is better to load the individual TMX files found in the downloadable ZIP archives from the DGT into LiveDocs so that the full document context can be seen from the concordance searches. What I didn't mention in that article is that the names of those individual TMX files correspond to the document numbers in EUR-LEX.
Armed with that knowledge, you can be very selective in what data and how much you load from the DGT collection. For example, if you organize the data releases in folders by year...
... and simply unpack the ZIP files in each year's folder...
... each folder will contain TMX files...
... the names of which correspond to the document number found in EUR-LEX. So a quick search in Windows Explorer or by other means can locate the exact document you want as a TMX file ready to import into your CAT tool:
These TMX files typically contain 24 EU languages now, but most CAT tools will filter just the language pair you want. So the same file can usually give you Polish+French, German+English, Portuguese+Greek or whatever combination you need among the languages present.
I still prefer to import my TMX data into a LiveDocs corpus in memoQ, and there I can use the feature to import a folder structure, and in the import dialog, I simply write the name of the file I want, and all other files (thousands of them) are promptly excluded:
After I enter the file name in the Include files field, I click the Update button to refresh the view and confirm that only the file I want has been selected. Depending on where in memoQ you do the import, you may have to specify the languages (Resource Console) to extract or not (in a project, where the languages are already set). Of course, the data can also be imported to a translation memory in memoQ, but that is an inferior option, because then it is not possible to read the reference document in a bilingual view as you can in a LiveDocs corpus; only isolated segments can be viewed in the Concordance or Translation results pane.
How you work with these data and with what tools is up to you, but this procedure will provide you with a number of options for better data selection and improved access to the reference data you may need for EU legislation without getting stuck in the morass of millions of translation units in a performance-killing megabomb TM.
the nice thing of softwares is that you can find many way to obtain the same result...
ReplyDelete;-)
one example is just aligning EU laws
Until about last year I used downloading the source and the target from EUR-LEX, then aligning them with various tools, the latter one being NOVA text aligner, then indexing the bitextes with Archivarius
But a few months ago I found that downloading the EUR-LEX laws as 2 columns bilingual HTML docs is enough, as Archivarius indexes them perfectly with zero effort
My 2 cents