Skip to main content

Preserving endangered languages with LUMI: machine translation for rare Finno-Ugric languages

Using resources from the LUMI supercomputer, researchers at the University of Tartu Institute of Computer Science in Estonia have added Livonian, Komi, Veps and 14 other low-resource Finno-Ugric languages to the university’s machine translation engine, Neurotõlge. The translation engine supports the preservation of endangered Finno-Ugric languages and the speakers of these languages.

In total, the university’s machine translation engine supports 23 Finno-Ugric languages: in addition to the more commonly supported Estonian, Finnish and Hungarian, it now includes Livonian, Võro, Proper Karelian, Livvi Karelian, Ludian, Veps, Northern Sami, Southern Sami, Inari Sami, Skolt Sami, Lule Sami, Komi, Komi-Permyak, Udmurt, Hill Mari and Meadow Mari, Erzya, Moksha, Mansi and Khanty.

Most of these languages are added to a public translation engine for the first time, as they are not part of Google Translate or similar services.

– We started working with Finno-Ugric languages in 2021, with the first system supporting Võro, Northern Sami and Southern Sami, said Maali Tars, Scientific Programmer at the Institute of Computer Science at the University of Tartu.

According to her, they added Livonian to the machine translation engine the same year. Livonian is an extremely endangered language with just about 20 near-native speakers.

In the future, the researchers will continue to improve the quality of the current machine translation system and intend to include more Finno-Ugric languages and dialects.

– There are several reasons for developing machine translation for low-resource languages. For example, philologists and other interested parties need the translation from these languages to understand texts, folklore, etc., without learning the language. Translating into these languages is a way of preserving endangered languages and supporting the speakers, said Lisa Yankovskaya, Research Fellow in Natural Language Processing at the University of Tartu.

She added that this is why the translation system is unrestricted and open to all users, and the software and created models are open-source.

Image: the University of Tartu NLP researchers, on the picture from left to right: Research Fellow Lisa Yankovskaya, Professor Mark Fišel and Scientific Programmer Maali Tars. Image: Henry Narits

Caption: The University of Tartu NLP researchers, on the picture from left to right: Research Fellow Lisa Yankovskaya, Professor Mark Fišel and Scientific Programmer Maali Tars. Image: Henry Narits

Improving translation quality

The research group invites the speakers and researchers of these languages to contribute to corrected translations to improve translation quality. This can be done by editing translations at translate.ut.ee. Texts like poems, articles, books and other textual content in these languages are also of great help and can be sent to ping@tartunlp.ai.

Yankovskaya explained that feedback is needed to improve the translation quality because many of these languages have extremely scarce resources for creating such translation systems.

– This means two things: first, the translation quality can vary a lot, and it can be especially low when translating into low-resourced languages. Second, we need the help of speakers of those languages by having them contribute correct translations on our platform, noted Yankovskaya.

This collaboration was done with the Livonian Institute at the University of Latvia, Võro Institute, the University of Eastern Finland, the Karelian language revitalization programme of the University of Eastern Finland and the Arctic University of Norway.

The work is funded by the National Programme of Estonian Language Technology.

You can find the machine translation engine Neurotõlge here.

Author: Henry Narits, the University of Tartu, Estonia