Jobs fail with more than 1 per node

caldodge · July 27, 2021, 11:11pm

I’m encountering a problem I haven’t seen before.
Running PBS Pro CE on Ubuntu 18.04.
If I submit large array jobs, the jobs fail to start (or fail to finish) if more than one job per node is running at a time. If I configure the jobs so that each node runs only one at a time, the jobs run fine.
Typically, there are two results;

multiple pbs_mom processes are spawned, but fail to su to my user account, or
the pbs_mom processes su to my account, then become defunct
Occasionally these failures are accompanied by mom_log messages stating that stdout/stderr files couldn’t be opened in the working directory.
Has anyone else experienced this error? Got any clues on how to diagnose this?

caldodge · July 29, 2021, 9:59pm

I found the solution. The nodes authenticate to an LDAP server which is part of the cluster. For an unknown reason, non-root logins to the nodes would hang for 30 seconds after logging in. I say “unknown” because as near as we could tell the compute nodes’ LDAP configuration was identical to the head node. I stumbled across the solution - install libnss-ldapd, and then reboot the server. Now logins execute in the expected amount of time, and array jobs are running as expected on all nodes.

Topic		Replies	Views
Cannot run job on multiple node Users/Site Administrators	5	4102	June 29, 2022
Jobs maybe running in one node, possible reason for getting killed Users/Site Administrators	7	210	July 9, 2024
Cannot run a job on multiple nodes Users/Site Administrators	2	461	March 15, 2024
Cannot run job on multiple nodes Users/Site Administrators	5	3570	March 21, 2019
Job immediately stops and notification says its completed Users/Site Administrators	2	24	May 8, 2025

Jobs fail with more than 1 per node

Related topics