After reducing the ncpus to four or lower, the job got assigned to a node (and it was always the same node - node0108 - regardless of whether the ncpus was 1, 2 or 3).
When I increased the ncpus count to 5 or higher, I got the behavior above.
Keep in mind there are 90 fully free nodes in the cluster so there are more than enough cores/mem to satisfy this request. It seems to me PBS Pro is stuck on feeding node0108 rather than the other free nodes. This behavior persisted even after rebooting PBS Pro, scheduler and all other processes (except the moms on the vnodes). We even modified the smp_cluster_dist in the $PBS_HOME/sched_priv/sched_config and rebooted the PBS Pro server to no avail.
Next, I took node0108 offline, and to my amazement, the interactive jobs with ncpus=4 or lower which once were successfully getting assigned to node0108 began starting and completing immediately as previously observed above for the ncpus=5 or higher. So it seems PBS Pro is stuck on having node0108 regardless of its status.
Can someone please explain this behavior? I expect that PBS Pro scheduler would move to another available node if a previously selected node becomes hosed or unavailable for whatever reason. And if the job (interactive or not) cannot be satisfied at all due to unavailable resources, then it should sit in queue till resources come available rather than start and quit immediately.
Also, how do I fix this? Since strangely, newly submitted jobs are getting held and spamming users email with at least 20 failure emails which I suspect is related to this persistent behavior of feeding just that one node…
Hey @sijisaula
The scheduler is deterministic. If node0108 is the first free node it finds, it will schedule a job on that node. If you start a job and it immediately ends, then the node is free again to start a new job on. The issue isn’t the scheduler, it’s the fact that your jobs immediately end. I’d look in the mom logs and see if it had a reason for why the job had problems.
If you want PBS to stop using a node it thinks is up and perfectly fine, you will need to tell PBS that. This is usually done in a mom hook. The execjob_begin and execjob_prologue hook events are run before the job starts. You can do some node health checks in the hook and if needed, put the node in the offline state. If the hook event is rejected, the job is requeued. Since the node is now in the offline state, the scheduler will ignore that node in its future scheduling decisions. If python is not your language of choice, you can use our older prologue. It is a shell script that is run before the job starts.
Other interesting things:
When you said you offlined the node, did you mean put it in the offline state (pbsnodes -o node0108)? If so, the scheduler should ignore it from that point forward. There is a short race condition that if the scheduler is in cycle when you offline the node, the scheduler won’t notice this fact until the next cycle.
I don’t think smp_cluster_dist will do what you want it to do. I’d avoid using it since it has been deprecated for many years now. If set to round_robin, it will try to cycle around the nodes during the same cycle. From the sounds of it you are running one job per cycle. It won’t help in this case. In any case, the more modern method of achieving round robin is to set node_sort_key: “HIGH unused”. This will sort your nodes based on the number of unused cpus. Once some of the cpus on a node are used, the node is lowered in priority.
The reason your users get 20 emails is that PBS will try and run a job 20 times before holding it. It figures that if a job has been run 20 times, there is something wrong and continuing to run this job will not help anything.
So once again, the problem is on the mom side. Please take a look at the mom log in the period of when the job starts. It might give you more information on why the job immediately is ending.
I’ve checked the mom_logs directory but that has no logs at all and my mom_priv/config looks like this:
#cat mom_priv/config
$clienthost bright01-thx
Additionally, I did some more testing and noticed that when the mem=8gb is reduced to 1gb, the job does not exist immediately. However, for mem values greater than 1gb it will:
Is there some setting that may explain this behavior? node0108 is out of the question now. And the node0115 the job is being assigned to surely has enough memory: