[Updated] Jobs fail with “Error configuring interconnect”

[Update – 22.5.23 13:00] The workaround has been implemented which should fix the issue for most cases but it is still possible for the error to occur. It is not yet known when the final fix of the root cause can be applied.

[Update – 17.4.23 16:30] We have managed to understand the root cause of the issue. The problem is the slurm slingshot switch plugin failing to release resources properly when a job ends. We are working on creating a workaround to improve the situation considerably if not completely. The root cause should be fixed in slurm which comes with cpe 23.05.

Currently many slurm jobs fail or hang with `srun: error: task xxx launch failed: Error configuring interconnect`.

We have started to observe these errors after last week’s system upgrade. It is being currently investigated by the system admins. We will update this page, once we know more.

The only way to work around these failures at the moment is to try to identify the nodes affected and exclude them explicitly during job submission with `–exclude` flag.