Skip to main content

[Resolved] Slurm losing connection with a large number of nodes

[22.10.24 16:30 CEST (17:30 EEST)]

The issue turned out to be in a single switch in one of the Lumi-C cabinets. Resetting the switch seems to have resolved the issue.
Please let the LUMI User Support Team know if you still face any issue. Thank you.

[22.10.24 16:00 CEST (17:00 EEST)]

Slurm was loosing connection with a large number of nodes earlier today until it has been moved to a healthier worker node.

Some jobs that rely on constant communication may have failed. Slingshot is still not working correctly at this time. System administrators are currently investigating the root cause.

Please contact the LUMI User Support Team if you face any issue. Thank you.