Observations from the LUMI User Survey, what lies ahead for 2024?

The LUMI User Survey is conducted annually in order to measure the overall satisfaction of the LUMI services, to hear about needs for improvement, and to address any issues users may face. An overview of the 2023 results was given by Pekka Manninen, Director of Science and Technology at CSC, at the LUMI User Support Team’s (LUST) virtual coffee break in December 2023.

A bit more than a hundred replies were given in the survey, many of the replies left by LUMI project team leaders. The scientific domains represented in the survey correlated well with the actual use of the system – top three domains being physics, climate/earth sciences and AI – these domains are also on the top of the users in the actual use of LUMI.Pekka Manninen at the LUST User Coffee break in December 2023

Image: Pekka Manninen giving an overview of the results of the LUMI User Survey 2023.

Challenges and solutions

A lot of valuable feedback was given in the survey. Most of the survey respondents found it relatively easy to log into the system.

Survey respondents pointed out challenges with the length of the queues for the system, especially for the LUMI-C partition. They also indicated the need for help with installing software, and code optimization.

– Most of the survey respondents were satisfied with the job scheduling system Slurm. Current issues with queuing time affect essentially projects running on a large number of nodes. There are simply so many projects on the system, hence the queues for LUMI-C. In 2024, we will analyse and revisit the batch job policies, and the queue situation should improve slightly. We could also transfer users from LUMI-C to LUMI-G, where the queues are shorter. However, we do not touch job priorities manually, they follow Slurm practices. This has been asked from us many times and we want to clarify this, said Pekka Manninen.

– We are also putting more effort for the application support, and how to get software installation smoother, he continued.

Part of the respondents faced stability issues such as file system performance issues or crashing jobs. Half of the respondents didn’t see this as a critical issue but 25% said there were issues.

– We are aware of these issues, but I hope that we start to finally be on a steady flight with the system. Node failures and node availability have been improving lately, and we are already in a better state compared to where we were a few weeks ago. Furthermore, a full interconnect audit will take place in 2024 which should help in the instability of the jobs or on the nodes or file system, Manninen pointed out.

One valid observation from the survey was the inadequate ROCm versions.

– ROCm is developing very rapidly, and we are trying to improve the availability of newer ROCm versions to all LUMI users via a shared directory on LUMI, Manninen said.

Respondents also reported about frequent and long service breaks in 2023.

– There were system installations still ongoing in 2023 which caused service breaks. There are fewer service breaks planned in 2024: actually, only one major software stack upgrade in the spring, after Easter, but that’s the only one planned at the moment for 2024. We are also planning to increase the notice time about the system breaks from 1 to 2 weeks in order to have the plans public even earlier, Manninen told.

Things to be developed

The responders answered that the LUMI main website, user documentation and training materials were the most preferred ways of communication, and the users also used the LUST support forms a lot. Even though the current LUMI documentation is generally valued, more details are needed in the documentation for the use of Slurm, Python, and computing environments and containers in general. The documentation will be further developed in 2024 to meet the needs of the users.

Regarding the LUMI user training, users were quite satisfied with the frequency and topics of offered trainings. However, respondents pointed out that the introductory courses were too advanced. There were also other subjects for the training wish list, namely a course about LUMI computing environment, use of containers and Python programming.

– We have analysed the feedback and it will affect the course planning for 2024: we want to provide a LUMI 101 type course and courses about the computing environment and arrange porting workshops for AMD GPUs, Manninen added.

The survey results also revealed that there is room for improvement for the response times from the LUST team.

– This is a resourcing question, and unfortunately there are limits to how much the LUST team can do – but they are very active on filing different issues to the system vendor and to our sysadmin team, but not all problems can be solved on the spot. Everybody is doing their best every day and the human-level interaction indicators were very good in the responses. We will have a strong focus in 2024 to streamline and clarify the multi-layer user support in order to clarify the roles between LUST, national support teams from the LUMI consortium countries and support from the system vendor. Also, the EPICURE project will play a role to make the user support even better, Manninen continued.

Further ideas raised

Other suggestions for improvement raised in the survey were, e.g., a public knowledge base to check out problems before contacting the LUST team, and improvements for the help desk ticketing system. These will also be investigated further in 2024.

– We thank everyone who took part to the survey, your feedback is very valuable for us and we do our best to make the use of LUMI even better for the users, Manninen concludes.

Author: Anni Jakobsson, CSC – IT Center for Science