Skip to main content

Root Signals: The quest for leadership in LLM reliability evaluation and quality control with LUMI

Root Signals decorative image

The Finnish company Root Signals, based in Helsinki and in Palo Alto, USA, is a leader in automated quality control of AI applications, chatbots and agents that are powered by large language models (LLM). It provides companies with versatile tools to measure and improve the reliability and efficiency of LLM applications. Their newest work Root Judge LLM has been developed with the LUMI supercomputer.

Root Signals was founded in 2023 and has received pre-seed funding with later investments from Business Finland. After the Root Judge launch, the company has stepped up with new paying customers from various industries and today has seven employees. Oguzhan Gencoglu has been the main user and supervisor of the team deploying LUMI in product development.

LLM evaluation and comparison have been extremely hard problems since the start of the LLM boom. Root Signal recently announced the groundbreaking open source model Root Judge, which is specifically finetuned to judge the reliability of another LLM, i.e. to detect hallucinations and to provide transparent justifications for the scoring. This helps end-users and developers to evaluate and optimize their LLMs, ultimately building trust in AI-driven evaluation.

LUMI and BF computing grant as keys to a smooth start

Business Finland´s LUMI computing funding made the use of LUMI effortless for Root Signals. Oguzhan had already used LUMI and had previous experience with CSC`s other supercomputers.

–We knew that we needed huge computing capacity to train our LLM, so LUMI was the perfect choice for this job. Our initial tests went well. We benefitted from the expertise of CSC´s LUMI user support and used the improved documentation that enabled a smooth start of the project, he explains.

It takes time for a company to figure out what and how to do things in a supercomputing setting. The Business Finland funded RDI project starts at a given date, but the computing may not start right away. The LLM training effort can also take longer than expected. This was also Root Signal´s experience.

–There was a slight mismatch in terms of the computing time and funding period for us. It would have been beneficial if there had been more flexibility in the project time cycle. We could have started our runs a little later and used a couple of “extra” months to conclude our project development runs of Root Judge, instead of running out of computing time in the end of the project, Gencoglu explains.

–Improving the visibility of project progress to BF and regular information exchange with BF during the project could be a working solution, he suggests.

Outcome of the project – Root Judge

LLM behaviour is sometimes hard to predict. You need a specific LLM to evaluate its reliability. Root Judge is designed and trained with millions and millions of evaluation tasks. Massive amounts of data, both open-source and synthetic evaluation data, was needed to train an evaluation specific LLM – a judge LLM.

– We used almost 400 GPUs to develop our Root Judge evaluation LLM. We released it open source together with a totally open weighed model available also for commercial use. We benchmarked against the leading LLMs such as Open AI and Anthropic as well as other open-source LLMs and it outperformed them in the evaluation tasks, Gencoglu reveals.

Not many have access to a huge number of GPUs, so the commercially significant decision of Root Signals was to quantise the Judge model, in order to make it use less GPUs and still perform well with GPUs that can be bought off the shelf. Root Judge is now fast, accessible, and affordable, which has reinforced the market position of Root Signals LLM evaluation tools, accelerating their growth and development of a large service portfolio around Root Judge with rich features  for measuring, detecting and intervening on AI reliability issues etc.

New customers and Brand building

Business customers do understand the value and appreciate the pioneering efforts of Root Judge. Root Judge is hosted by Root Signal, which saves costs compared to tools from Open AI and other big players.

– We have noticed a growing interest since the public launch and due to our own marketing efforts. It has been important for us to expand our reputation and cement our public perception as an AI powerhouse.  We needed to show and convince our customers that they can trust us with outsourcing their evaluation. I can say that providing value also as open source is a cornerstone of our commercial strategy, to show developers what we can do,” Gencoglu explains.

New horizons

Root Signals LUMI computing project is now finished but the company is not settling for the current RDI results. They are hoping to continue the development work with LUMI or with other EuroHPC supercomputers to develop and train a reasoning Judge model. This is a new trending LLM product development direction for Gencoglu and his team.

–We want to develop our services and extend our LLM models not only to predict but also reason about the evaluation in decision making. Normal Judge LLMs cannot be turned to reasoning models. We need to teach them to reason, like humans in their behaviour, Gencoglu describes.

Nobody has trained a reasoning judge model yet.

–Reasoning models are slow to use, but also the use cases are different, Gencoglu concludes.

So, let´s wait and see when we see this kind of development happening in the Finnish AI scene.