Custom Neural Machine Translation Development Process

To launch a new Custom Neural Machine Translation Engine (CNMTE), Trusted Translations requires an initial training and set-up period. The following is a typical implementation process for building a new NMT engine.

Selecting a baseline engine

A slew of content is being developed on democratization of algorithms. However, this concept should be more comprehensive. For instance, democratizing technology solutions is starting to make robust baseline engines a good foundation on which to build a customized solution. Service offers from Google, Microsoft or Amazon will help feed your own clean data to engines that are already well trained.

Data selection and corpus preparation

There are various approaches to gathering training data for building a customized engine.

  • Existing translated content:

    The ideal starting point for any Custom Neural Machine Translation Engine is to find and utilize previously translated materials involving content that is as similar as possible to what is to be translated. The more previously translated material available, the faster and more economical the process will be. If source and target are not associated as translation memory units, an alignment can be performed to get the bilingual content needed to boost the engine’s performance

  • Existing monolingual data:

    If sufficient amounts of target reference content exist, it is possible to leverage all the style and terminology by adding that to the mix. This content may likely have been developed by local SMEs from scratch, and its value is second to none. Needless to say, domain- or even client-specific terminology is an excellent asset when customizing engines based on NMT technology, for which terminology has been identified as its main weakness.

  • Creating a specialized corpus from other sources:

    In addition to utilizing monolingual data, we will search the web for materials that are aligned as closely as possible to the content that will run through the engine. Again, investing time in searching for the best-quality corpora always pays off. The same applies to bilingual data that can be obtained from data marketplaces. This external parallel data will need to be cleaned (spell-checked, alignments checked, duplicates deleted, etc.) before it is of use as training data for an MT system. Much larger amounts of manual involvement is required in this scenario, compared to when the client is able to deliver sufficient amounts of high-quality aligned data from the outset. It will take 4 to 6 weeks to build the new engine.

As more and more output is post-edited, this can be converted into high-quality retraining data. This adaptive model will make the quality of the output from the system improve quite quickly over time.

Re-training: New CNMTEs Improve with Human Post-Editing

There are various workflows involving Custom Neural Machine Translation Engines. One common configuration is to integrate a human post-editing process. Under this workflow, the output from the Custom Neural Machine Translation Engine is edited by our expert linguists to improve the quality of the current output, as well as to re-train the engine for future translations. While the reviewer modifies the output to improve its quality, the engine becomes more intelligent using a dynamic adaptive model. Besides, as more translations flow through the engine, the engine becomes even more intelligent over time. In other words, the quality gap between a full human translation and this solution narrows dramatically, while turnaround time and costs are being significantly reduced. These engines, in our opinion, will become an asset and a market differentiator for any client with such a need.