Exploring AI training on the LUMI supercomputer

14.6.2024

AI specialists from academia and industry took a deep dive into LUMI’s capabilities for advanced AI training during a two-day workshop organized by the LUMI user support team and EuroCC2.

The LUMI consortium and the EuroCC2 project aim to provide users access to world-class HPC facilities and equip them with practical skills to harness their potential. On May 29–30, they joined forces to organize a specialized workshop on AI training using the LUMI supercomputer titled “Moving your AI training jobs to LUMI: A Hands-On Workshop.”

– With this workshop, we aimed to give existing AI researchers an introduction to LUMI for AI, providing them with the knowledge and hands-on experience they need to move their AI training workflows to LUMI. I was impressed with the engagement of the participants and their progress towards running their AI training jobs on LUMI, says Christian Schou Oxvig, LUMI user support team member and senior specialist HPC & AI/ML, at DeiC.

The workshop drew AI specialists from several European countries keen to explore the state-of-the-art in high-performance computing for AI. The two-day event provided a deep dive into LUMI’s unique architecture and capabilities, aiming to equip attendees with practical skills for AI model training on LUMI. The workshop started with an introduction to LUMI and its architecture, highlighting differences from other clusters, such as using AMD GPUs and the Slingshot Interconnect. Following this, participants were guided through the LUMI web interface and hands-on sessions with PyTorch in JupyterLab, focusing on the practical limitations and advantages of the interactive interface.

Next, the LUMI batch job system was introduced, and the attendees learned to submit their PyTorch AI training jobs using the command line. They learned to monitor GPU activity and were introduced to using Singularity containers to manage their AI software environments on LUMI, including converting conda or pip environments into containers. To solve this task, the participants used the container software tool containr, which DeiC developed.

Day two focused on scaling AI training to multiple GPUs, addressing common challenges and solutions for distributed computing. Practical sessions included converting single GPU training jobs to use all GPUs in a node and performing hyper-parameter tuning with Ray using multiple GPUs. The workshop also covered extreme-scale AI with model parallelism and optimizing network performance across multiple nodes.

Participants were also introduced to the challenges related to handling AI training data on LUMI, with sessions on loading training data from Lustre and LUMI-O. A short introduction to coupling machine learning with “classical” HPC simulations using SmartSim was also given. The event culminated with participants applying their new skills to their projects, supported by the workshop’s instructors.

Among the participants was Eleni Briola, a Machine Learning Scientist at the Danish Meteorological Institute, who signed up for the workshop to familiarize herself with LUMI and to optimize the performance of her machine learning algorithms by running them on this high-performance computing system:

– The workshop was extremely beneficial. The hands-on exercises were particularly valuable, providing a deeper understanding of the concepts and practical experience with LUMI, she said.

With her was her DMI colleague, Research Scientist Irene Livia Kruse, who hopes to use the LUMI supercomputer in the near future for AI-based weather research and forecasting and, therefore, jumped at the chance to get guidance on the system before starting. As her colleague, she was very satisfied with the outcome of the workshop:

– The training workshop has been clear, and the hands-on tutorials throughout are a brilliant way to figure out in real-time if I’ve understood the lectures (and if there’s something I haven’t fully understood, I’ve enjoyed having the opportunity to ask the technical support for help on the spot). Besides learning about the LUMI system and its GPUs, I also got acquainted with coding techniques I can apply in my day-to-day work, Kruse adds.

Image: Some of the participants of the AI workshop. Photo credit: DeiC

Want to know more?

All slides, tasks, Q&A, and recorded presentations are available in the LUMI training archive: https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/

Author: Anne Rahbek-Damm, DeiC

EuroHPC JU

LUMI AI Factory

LUMI supercomputer

Resource allocation

LUMI consortium

Sustainable future

Newsletters

Open positions

Contact us

LUMI Documentation

FAQ

Need help?

LUMI Service Status

Exploring AI training on the LUMI supercomputer

Want to know more?