Machine Translation engines evaluation framework
- Liffey Hall 1
- 10:55 on 15 July 2022
- 30 minutes
As an engineers in a ML R&D department of large healthcare enterprise company we were presented with the task to evaluate several Machine Translation engines and choose the one best suited for our corporate needs. To do that we created extendable Python-based framework that allowed us to easily plug-in different Machine Translation engines and compare them across large variety of test datasets with a unified set of quality metrics. Our goal from the start was to create universal MT evaluation framework, that will be useful not only for healthcare domain, but to a wider community as well.
At this talk we will present our evaluation framework an will do a walk-through of its capabilities. We also cover how it can be extended to new MT engines, new test datasets and new language pairs. We will also present our evaluation results for several state-of-the-art machine translation engines, both open-source and cloud-based.
All the source code of our framework will be published in open-source by the time of the talk.
TalkPyData: Deep Learning, NLP, CV
Task of Machine Translation engine evaluation may be very challenging. Quality of Machine Translation varies greatly depending on domain and language pair. Different MT engines may have different interfaces or APIs and different requirements to run. To add to that, even definition of a good translation may be debatable, with any automatic MT quality metric providing only approximation of actual translation quality. That's why having universal evaluation framework for this task is very important. In our work we tried to create such framework.
1) We defined base translation class that unified all file handling, batch creation and result processing. As a result of that, only work needed to support new MT engine was creation of small child class that implemented couple of simple functions. That allows us to easily extend our framework to MT engines and new language pairs.
2) We defined set of test datasets and provided a way to add new datasets to this set. For our evaluation our aim was to create test data that covers both general and healthcare domains EMEA dataset (https://opus.nlpl.eu/EMEA.php), OPUS-100 (https://opus.nlpl.eu/opus-100.php), Paracrawl (https://paracrawl.eu/) and several others. But our data preparations scripts can be easily extended to other domains and datasets as well.
3) We defined a set of quality metrics to evaluate results of MT engines. Metrics that we used included BLEU (https://github.com/mjpost/sacrebleu), BERTScore (https://github.com/Tiiiger/bert_score), ROUGE (https://github.com/pltrdy/rouge), TER and CHRF (both also from sacrebleu implementation).
Beside MT evaluation framework we will present our own evaluation results. For our evaluation we used cloud based engines - Azure Translator (https://azure.microsoft.com/en-us/services/cognitive-services/translator/), Google Translate (https://cloud.google.com/translate/), as well as open-source engines - Marian MT (https://huggingface.co/transformers/model_doc/marian.html), NVIDIA's NeMo (https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/machine_translation.html), Facebook's MBart 50 (https://huggingface.co/facebook/mbart-large-50-one-to-many-mmt), Facebook's M2M100 (https://huggingface.co/facebook/m2m100_418M). For open source engines we tried to use Huggingface's transformer implementation whenever possible. But as we mentioned our framework was designed in a way to be easily extendable to other MT engines and underlying frameworks.
We also will present evaluation results for NeMo and MarianMT engines that we fine-tuned specifically for healthcare domain. While these particular results may rather specific to our use case, they help to highlight how our framework can be extended to custom MT engines as well.