July 25, 2024 / by Adam Jones Estimated read time: 9 minutes
Don’t Let Machine Translation Contaminate Your Translation Memory
SimulTrans often receives translation memory files from new clients. We love getting these resources that increase consistency, reduce costs, and speed schedules. Unfortunately, we increasingly find the translation memories customers provide are contaminated by raw machine translation.
When translating new content, we see many segments populated with 100% matches from the translation memory. Some are even tagged as higher quality "101% matches" that maintain the context of the previous translations stored in the database. Historically, these high match levels indicated high-quality translations that were previously reviewed and should require little or no editing. Not anymore!
Now, many of these matched segments yield abysmal raw machine translations that nobody bothered to edit or even read. Unlike new machine translations that we are asked to post-edit, these segments provide no insight into their questionable pedigrees. No little "MT" tag serves as a warning they need extra attention. To the contrary, some segments are even automatically locked to prevent our translators from tampering with the previously published text despite its fundamental problems.
In the example below, this 101% match segment is a raw machine translation from translation memory. Nothing identifies it as such, causing a translator to believe it likely does not require review:
Usually a machine translation match is designated by "MT" so that the translator knows it needs careful review and editing, not the case with previously stored machine translation units.
Over time, the increasing volume of machine translation degrades translation memories further. With each new project, AI's delusions and inaccuracies are destroying these valuable repositories built carefully over years with hand-picked terminology and sentences painstakingly crafted by experts.
The problem multiplies if you use your contaminated translation memory to train or update your machine translation engine.
Protect your translation memories
SimulTrans uses effective methods to inoculate translation memories from the rapidly spreading machine translation infection. You can do the same.
Avoid storing machine translation
The most obvious and straightforward solution is to not store machine translated text in translation memory at all. If you do not dedicate human effort to perfecting it, there is no need to retain and reuse the translated text.
Even the most ardent proponents of machine translation should not allow it to populate translation memories. AI translation tools are evolving quickly. Why commit to today's output three years in the future? By then, a new machine translation will surely be better than anything stored from the past. Translation management systems usually prioritize translation memory over machine translation, which would result in your memorized old machine translation taking priority over new and improved AI output. You should prefer new neural content, rendering a translation memory of unedited machine translations unnecessary.
Create a separate translation memory
We recognize that some scenarios dictate the use of machine translation, ideally with post-editing. When this is the case, you can relegate its mediocrity and potential errors to a separate translation memory. Translation tools commonly allow translators to access multiple translation memories, reading from all while writing to one.
You can benefit from your high-quality, human-written translation memory without infecting it. Why not make a separate translation memory that you know will be contaminated with machine translation? Protect your legacy investment in human translation by preserving it in a dedicated memory database and not allowing machine translated segments to seep in.
When applying the translation memory you have created, you can assign a penalty percentage to move its matches out of the 100% and 101% categories to force the segments to be reviewed, as also shown in the example above. Taking this step will prevent translators from believing these questionable matches are perfect.
Catch parallel confirmers
In most translation tools—like Phrase, XTM, Trados, and memoQ—segments get stored in the translation memory when they are confirmed by the translator.
Lazy or busy translators may not confirm segments individually. When confronted with a full screen of pre-populated machine translations, some use a dangerous combination of keystrokes to confirm an entire file simultaneously. For example, in Phrase TMS, selecting all (ctrl+shift+A) followed by confirm (ctrl+Enter) is a recipe for disaster! Careless linguists may make some obvious edits, assume the rest of the machine translated text is acceptable, and commit it to memory without proper scrutiny.
While it is difficult to prevent such malfeasance or the conscious choice to use unedited machine translation in certain circumstances, it is easy to spot it. A review of translation memory metadata quickly reveals this short-sighted tactic. If 394 translation units (TUs) were all created on July 18, 2024 at 19:09:04, you know they were not individually confirmed. You have exposed a parallel confirmer! Throw out the translation memory and tell the translator to start over editing the segments more carefully, one-by-one.
Huh?
For people who don't work with translation management systems every day, the points above might seem a little confusing. Watch our video to see what all this means:
Is your translation memory contaminated?
SimulTrans can analyze your translation memory to diagnose the severity of its machine translation infection. Upload your TMX file and we will use our proprietary algorithm to assign a contamination score.
Written by Adam Jones
As President and COO of SimulTrans, Adam manages and supports the company worldwide. He has spent over 30 years helping customers launch products and content internationally. Adam graduated from Stanford University, where he studied Public Policy with an emphasis on Education.