Skip to main content

Big steps towards a Norwegian answer to ChatGPT with LUMI

GPT illustration

– There are many problems associated with the tech giants’ language models. They appear as black boxes to the outside world. We need Norwegian alternatives, says Professor Erik Velldal, at the Department of Informatics at the University of Oslo, Norway (UiO).

Before last Christmas, something crucial happened to speed up the Norwegian counterpart to ChatGPT being developed by the Language Technology Group (LTG) at UiO. They were granted computing time by Sigma2 on Europe’s most powerful computer, LUMI in Finland.

This allowed them to start training large language models on Norwegian data. Within a couple of weeks, enough data was processed for the researchers to launch three Norwegian language models.

– Training of large language models requires a lot of computing capacity from GPUs, graphics processing units. This training scales well, which means that if you double the number of GPUs involved, it will roughly go twice as fast to complete the training. The advantage of LUMI is that there are a lot of GPUs available, over 10,000, says special advisor Hans A. Eide at Sigma2.

– The National Library and the University of Oslo have made several Norwegian language models available earlier, but these are the largest we have made so far, and they are trained on over 30 billion words, explains Velldal.

All models have around 7 billion parameters, which the researchers consider optimal in relation to the amount of Norwegian training data available.

– A language model becomes poor if it is trained on too little material in relation to its size. It’s about finding the right balance, says Velldal.

A trick that has proven effective if you have enough processing power is to train on the same data in several rounds. The models UiO has trained on LUMI have been fed the same training data 6 times, explains Velldal.

Postdoc Vladislav Mikhailov, Associate Professor Andrei Kutuzov, PhD fellow David Samuel and Professor Erik Velldal, all from the Language Technology research group at the Department of Informatics at UiO. Image: Gina Aakre

Postdoc Vladislav Mikhailov, Associate Professor Andrei Kutuzov, PhD fellow David Samuel and Professor Erik Velldal, all from the Language Technology research group at the Department of Informatics at UiO. Image: Gina Aakre

Norway must have technological independence

The language technology group affiliated with UiO believes it is important to have Norwegian counterparts to OpenAI’s ChatGPT and Google’s LaMDA. Norwegian only makes up 0,1 % of the language amount ChatGPT is trained on. The data on which the model is trained is also not fully disclosed. Velldal believes this is problematic in several ways.

– Microsoft and OpenAI allow Norwegian users to access the model through a web interface. The model behind it is closed. In many contexts, it can also be problematic to send data to a commercial third party, as when using ChatGPT. If you work with sensitive health data, for example, it is important to be able to control where and how the data is processed. Then it is essential to have access to open and free models that developers can run on their own machines.

Erik Velldal. Image: Gina Aakre

Professor Erik Velldal, the Department of Informatics at the University of Oslo, Norway. Image: Gina Aakre.

Several large Norwegian public actors have nevertheless jumped on and bought access to OpenAI’s ChatGPT.

– It is important to ensure that open Norwegian-developed models become available as an alternative, and perhaps especially for the public sector, Velldal points out.

The Oslo School is among the latest to announce that they will use the American service. This is even though the models have several unresolved questions regarding rights and copyrighted material. UiO’s researchers are also working with the National Library of Norway on a project to compare language models developed on freely available and copyright-protected material. In the long term, this may be able to provide guidelines for a future compensation scheme for the use of copyrighted material in language models.

Language models highlight stereotypes

There are several important reasons why we need Norwegian language models according to Associate Professor Andrey Kutuzov at UiO. ChatGPT is very little adapted to the knowledge and value base in Norway, he points out.

– The tech giants’ language models are essentially trained on UK and US English. They thus also reflect an American value set and culture. An example may be that the American language models correspond more to a gender distribution of professions that is more stereotypical than is the case in Norway, says Kutuzov.

In addition, one often sees that English expressions rub off into Norwegian wording.

– A Norwegian language model will to a much greater extent reflect society as we know it in Norway, says Kutuzov.

Must be trained to solve tasks

The Norwegian language models have been launched and have already been downloaded by several thousand users. The models are initially aimed at researchers and developers. Kutuzov explains that the Norwegian versions have not been launched in web interfaces that are easy to use for everyone. He admits that they are still far from being able to offer the possibilities that the commercial language models provide. The models are trained to be general base models.

A language model is trained in several steps. These Norwegian models have received basic training, which means they can predict the next word in a text.

For the Norwegian models to reach the same level as ChatGPT or similar models, they need more so-called instruction training. This will enable them to solve various tasks to a greater extent. This work is already underway at UiO, and new updated versions of the language models will be launched continuously.

Even though the race with the American models seems tough, the researchers point out that Norwegian language models must be further developed.

– It’s an important principle that we create models that are free of restrictions. We must have such models that are based on openly available resources and that are transparent for the research community and industry. Large language models will increasingly serve as basic infrastructure for solving various tasks in research, industry, administration, and society in general, says Velldal.

Facts

Three new Norwegian language models have been launched, based on the GPT-like architectures BLOOM and Mistral, all with an “open source” license.

  • The models have been developed by the research environment at UiO in collaboration with Sigma2 and the National Library. Together with other actors in the national AI network NORA, the partners are planning a national infrastructure for the development and use of large Norwegian language models.
  • Two of the models have been trained from scratch in Norwegian.
  • The third is based on a model pre-trained for English by the French company Mistral AI, which has then been further trained for Norwegian.
  • The models are available at https://huggingface.co/norallm

This article was originally published on University of Oslo’s website and in English on the Sigma2 website.

Author: Gina Aakre