Scaling the pre-training of large language models of >100B parameters to thousands of AMD MI250X GPUs on LUMI

10.1.2024

Large language models (LLMs) are a hot topic right now and an increasingly important workload of HPC clusters, such as LUMI, which currently is the fastest supercomputer in Europe. Multiple initiatives are in progress for pre-training LLMs on LUMI, such as the OLMo project from the Allen Institute for AI and the Poro model series collaboration project of Silo AI, TurkuNLP, and HPLT. In addition to these ongoing projects, multiple new initiatives are brewing all around Europe, targeting various EuroHPC supercomputers.

LLM training is well known for requiring massive amounts of computing resources. Training the largest models from scratch requires trillions of tokens, and training on the data requires millions of GPU hours. The LUMI supercomputer offers most of its computing capacity in the form of AMD Instinct MI250X accelerators, and it is a powerful platform for training the largest models.

This technical blog post describes the considerations related to the configuration for efficient model training and evaluates the ability of LUMI to scale the training of a model with 140 billion parameters to thousands of GPUs. The results indicate that there are no fundamental bottlenecks in the system and that the software allows training and scaling to thousands of GPUs efficiently.

LUMI GPU partition hardware from the AI point of view

The GPU partition, LUMI-G, has 2978 nodes, each with 4x AMD MI250x GPUs with 128GB of memory and a 64-core (EPYC) CPU. The MI250x is a multi-chip module (MCM) with a dual-GCD (graphics compute die) design, which in practice means on every node, there are eight logical devices, each with direct access to 64GB of HBM. In contrast to 4x 128GB GPUs per node, this affects the optimal partitioning when partitioning AI models across multiple GPUs.

Each node has 4x200Gbps Slingshot-11 network interconnects (NICs). The GCDs, the CPU, and the NICs are connected with varying counts of Infinity Fabric links, which in turn influence the effective bandwidth between GCDs, providing up to 400 GB/s of bidirectional bandwidth. The nodes are then connected with the dragonfly topology. This network topology was not observed to limit the communication bandwidth when benchmarking it for the used collective operations. The total bandwidth proved to be plenty, and the pure communication overhead was minimal, at most approximated to be single-digit percentages, based on the weak scaling results shown in the results section.

Software

PyTorch is the most common machine learning framework for training large language models. In PyTorch, the AMD software stack, ROCm, has been officially supported since version 1.8, with a quickly increasing number of use cases working out of the box. With standard PyTorch workloads, everything should work just like with other vendors, usually with no code changes. Machine learning software can also be written with custom kernels, with just Cuda in mind. To address these use cases, AMD provides tools like HIPIFY to convert Cuda code into ROCm-compatible HIP C++. For the collective communications library NCCL, the AMD equivalent is RCCL. On LUMI, one should note that to enable the fast interconnects to work with RCCL, a plugin called aws-ofi-rccl is required. It comes pre-built in many of the AI software images on LUMI. This is further described in the LUMI documentation.

The benchmarks in this blog post were performed with Megatron-DeepSpeed on a standard PyTorch container image provided on LUMI, with the addition of a few relevant Python libraries and AMD-ported Flash Attention 2, which can be installed on LUMI by the user.

Note: This post only discusses PyTorch; for other machine learning frameworks, please refer to the LUMI documentation.

Pre-training LLMs with Megatron-DeepSpeed

The large parameter count of the large models means that with GPU accelerators, these models need to be split across multiple devices as they do not fit into the memory of a single device. Thus, a traditional scaling technique, data-parallel (DP), is not viable for the largest models. With a naïve implementation and splitting the model just layer- or tensor-wise, without an efficient schedule, scaling will be poor even beyond only tens of GPUs. Thus, advanced solutions are required to utilize thousands of GPUs efficiently. This blog post benchmarks the 3D-parallelization strategy implemented in Megatron-DeepSpeed. Recently, PyTorch has also launched a new technique, Fully Sharded Data Parallel (FSDP), which will be discussed separately in a later blog post.

Finding the optimal configuration for 3D parallelism

This post is not enough to fully describe all the details of optimizing a certain configuration to run on LUMI. However, in an effort to highlight the most relevant parts, this chapter describes some of the fundamental design considerations when optimizing the configuration of model parallelism with Megatron-DeepSpeed. In this chapter, we assume the reader has prior knowledge of the forms of parallelism.

Tensor parallelism is an operation that allows multiple GPUs to calculate parts of the same model layers simultaneously. However, there is an overhead in combining the results. Also, the smaller the parts get, the less efficiently each GPU can operate, especially on micro-batches as small as one. The small micro-batch is essential for allowing a high node count and a manageable pipeline schedule, and the benchmarks below were run with a micro-batch size of one. We have found that MI250X really likes multiples of 1024 for the model’s hidden layer size. With tensor parallelism, it seems a total model size divided by tensor-parallel degree should be at least 1280 for good performance on MI250X, as dividing the model into even smaller pieces started to cause a considerable performance penalty, due to loss of GPU efficiency and the overhead in combining the partial results.

Pipeline parallelism, even with a good schedule, adds a sequential component to the processing, which causes GPUs to idle while waiting for other GPUs in the pipeline to finish a step. This sequential component is the cause for the majority of lost performance in the results below, as there is a certain amount of time where no useful computation can be done.

Data parallelism refers to different copies of the model operating on a different set of data. In this experiment, data parallelism, together with the global batch size, sets the upper limit for how many nodes can be run, as at some point, there is only one sample per model copy for a gradient update. However, a small batch size is essential for keeping sample efficiency good. While a bigger batch size can improve the FLOPS utilization, the resulting model is less useful due to worsened sample efficiency. With a global batch size of 768 samples and 768 nodes, with a model degree of eight (8), and a micro-batch size of one (1), there are only eight (8) gradient accumulation steps, which is not a lot to schedule for eight (8) pipeline stages. For further details on all the features of Megatron, please see the Megatron-DeepSpeed documentation and the excellent Megatron-LM articles by Nvidia that go into deeper detail on the software. Short courses on the topic are also being planned, but there are no further details at this point.

Note: In this post, sequence parallelism was enabled to aid scaling, even if it is not currently used in actual model training work on LUMI. Some unidentified features in the Megatron-DeepSpeed configuration conflicted with sequence parallelism. There is information to suggest that it is an issue with Megatron-DeepSpeed and not with any AMD implementations, as others have reported it on other platforms as well. As of writing this post, the work on this is still continuing, so we advise against using it for now. If you plan to do anything similar, please reach out to us to ask for an update.

Benchmarks

To evaluate the scaling properties of LLM training, we chose a realistic model size of 140B as it is close to the upper end of models actually being trained on LUMI. The goal was to simulate a real training scenario, so the software stack and parameters were only slightly modified from follow-up models to Poro, currently trained on LUMI. A lot of the effort in collecting the performance data and lessons learned for this post was done in collaboration with SiloGen and TurkuNLP in an effort to achieve the best Megatron-DeepSpeed performance on LUMI.

The results below are intended to serve as pointers to what is achievable for a realistic training scenario, not just some theoretical scenario. In practice, this means, for example, that the batch size is selected to be in the range of actual state-of-the-art models, even if increasing it and model size with node count would make (super)linear scaling trivial.

The configuration for the runs below is as follows: Sequence length of 5120, batch size up to 768, 8-way tensor parallel, 8-way pipeline parallel, data-parallel varying with node count, 6 to 48, sequence parallel enabled, micro-batch size of one (1), and gradient accumulation steps varying with batch size, 8 to 128.

The maximal batch size of 768 and sequence length of 5120 results in a total batch size in tokens of 3.9M, which is a realistic number close to what is used in state-of-the-art research. As MI250X is a dual-chip module, 768 nodes are 6144 virtual GPUs, and thus communication “ranks” visible in the software, making this level of distributed training of a single model a real test of the performance and stability of the whole system. Initially, the plan was to run on the entire LUMI, and it was tried, but currently, a PyTorch bug is making initializing a distributed environment of over 4k ranks quite unreliable, if not impossible. A projection estimated that a 500B model could be trained on the whole LUMI with over one exaflop of total throughput, but sadly, that experiment is postponed until the initialization issue can be resolved. At this point, a 768-node job was the maximum size that could reliably be launched, and a smaller 140B model size was more realistic for this node count.

Below are benchmark results for weak and strong scaling. Weak scaling refers to increasing the workload alongside the processor count. Strong scaling refers to keeping the workload constant when increasing the processing units. In the case of LLM training, it means keeping the model and batch sizes fixed for all runs.

However, keeping the problem the same throughout the increase in processor count is challenging as the large setting can lack sufficient data samples to schedule for all the devices. In any case, the time to result for exactly the same problem is a very important metric for the ability to do research on a system, and this time to result is what the strong scaling results aim to convey.

Weak scaling

The weak scaling experiments were done, holding everything else constant except matching the batch size to the number of nodes. This way, gradient accumulation steps, and thus the ratio of computation to communication and various overheads, stayed constant. However, this means the results would not be equivalent for the different training runs.

Number of MI250X accelerators

Figure 1: Weak scaling, per-GPU FLOPS

Weak scaling, total petaFLOPS with BF16

Figure 2: Weak scaling, total petaFLOPS with BF16

Figures 1 and 2 show that there is only a minimal drop in performance in the weak scaling experiments. This is expected as there is no fundamental reason for the performance to drop. The small drop is expected to be caused by increased communication bandwidth (bigger number of ranks in gradient all-reduce), and just normal variations in individual GPUs, as the throughput is limited by the slowest device. This result shows that the Slingshot 11 network is very capable of handling big LLM training runs on LUMI.

Strong scaling

In strong scaling experiments, everything else was kept constant, including the batch size of 768. This means that smaller node counts can run computation for much longer before they need to exchange results, meaning smaller node counts are expected to perform better in terms of FLOPS utilization, as their ratio of computation to communication is much higher.

Complete training cycle TFLOPS per MI250X

Figure 3: Strong scaling, per-GPU

Strong scaling, total petaFLOPS with BF16

Figure 4: Strong scaling, total FLOPS

Figures 3 and 4 show the performance numbers for the strong scaling experiment. The results align with the theoretical predictions calculated from the inherent idle time caused by a pipeline bubble. In practice, some time is lost when a GPU is waiting for the results from a previous pipeline stage or waiting for the next stage to finish. The overall scaling is still very good and in a reasonable range, with the largest run still maintaining 80% of the per-GPU throughput while increasing the GPU count 16-fold.

As the peak BF16 performance of MI250X is 383 TFLOPS, the 161.2 TFLOPS of the 48-node setup achieved an impressive FLOPS utilization of 42% over the whole training loop. This is a good result considering the already high node count and degrees of model parallelism.
This result also hints that optimizing for FLOPS utilization, even better results should be achievable with a larger micro-batch size, less tensor-parallelism, and methods that can fully overlap computation and communication, like FSDP. However, this experiment aimed for the best performance at thousands of ranks.

Common assumptions often made in optimizing FLOPS utilization that are favorable for small-scale would be unrealistic or incompatible with the large run of thousands of GPUs with realistic and efficient model and batch size. For example, if the batch size grows to tens of millions of tokens on large node counts, the sample efficiency suffers considerably with current optimization methods. In addition, pure FSDP would also not be a viable option for running a batch size of 768 with 6144 ranks, and combining tensor-parallelism of at least 16 would be needed to allow some computation and communication overlap, largely abolishing the performance benefit. Thus, this method was chosen for this benchmark.

Conclusion

Optimizing and scaling LLM training is becoming increasingly important. However, the current algorithms for parallelization all have some fundamental drawbacks that need to be carefully considered and tuned for if the plan is to train on thousands of devices in parallel. In this blog post, we analyzed, discussed, and evaluated the ability of LUMI to train the largest language models. The results indicate that there are no fundamental bottlenecks in the system and that the software allows training and scaling to thousands of GPUs efficiently.

Author: Väinö Hatanpää, Machine learning specialist, CSC – IT Center for Science.

The author supports large-scale training of AI models and advancing international AI collaboration and communication at CSC.

EuroHPC JU

LUMI AI Factory

LUMI supercomputer

Resource allocation

LUMI consortium

Sustainable future

Newsletters

Open positions

Contact us

LUMI Documentation

FAQ

Need help?

LUMI Service Status