Former occupied node not relaesed

Hello everyone,

I am getting an error while trying to run a job on pbs pro 14.1 resulting in holding the job without being able to release it.
The problem seems to be related to the fact that cpusets are not released (there is still a folder of “jobid.server” in the pbs folder) even though the job itself ended, and the pbs thinks that this cpuset is free, hence trying to submit new jobs to it.
When trying to submit a new job to the same cpuset, I’m getting an error - “cpus_this_vnode != hv_ncpus” and the new job continues holding.
I’ve tried to clear the cpuset using “cpuset x” command, which deleted the “jobid.server” folder, however the cpuset is still not cleared.
The only resolution I could find is to restart the pbs server.
Is there any other solution which will not require me to stop all currently running jobs?

Thanks in advance,
L

Would it be possible to give more information about your environment and setup?
Also share the server and mom logs of that job which is stuck ?
tracejob information would be helpful as well ?

I have a shared memory server, the PBSpro version is 14.1.0
The mom log conatines the following lines for the stuck job:

Job:19842.server;Resource_List.place = scatter
Job:19842.server;make_cpuset, vnode server[1]: cpus_this_vnode (0) != hv_ncpus (16)
Job:19842.server;kill_job
Job:19842.server;no active tasks
Job:19842.server;0bit sent
Job:19842.server;delete job request recieved
Job:19842.server;kill_job

and this continues again and again, until I stop and start the pbs server.
The problem is the fact that node server[1] is part of the cpuset of another, already finished, job, which was not cleared (the job itsel does not appear in qstat). So the cpuset for the new job is not even created.
For the uncleared cpuset I get the following mom_log:

Job:19666.server;delete job request received
Job:19666.server;remove_cpuset_procs, failed to kill pid 123456 in set /PBSPro/19666.server
Job:19666.server;Success (0) in remove_cpuset_procs, PID 123456 moved to set /
Job:19666.server;Device or resource busy (16) in try_remove_set, cpuset_delete cpuset /PBSPro/19666.server failed
Job:19666.server;Device or resource busy (16) in logprocinfo, PID 123456: comm (prog.exe), state R, PPID 1…
Device or resource busy (16) in try_remove_set, 1 tasks in set /PBSPro/19666.server
Job:19666.server;Inappropriate ioctl for device (25) in del_cpusetfile, removal of cpuset /PBSPro/19666.server failed

Unfourtantly, I don’t have the tracejob info…

Thank you lior for the above information.

Please try

  • setting node_fail_requeue to 0 ( qmgr : set server node_fail_requeue = 0 )
  • use the cpuset_destroy_delay in the mom_priv/config and see whether this resolves your issue

cpuset_destroy_delay
MoM waits up to delay seconds before destroying a cpuset of a just-completed job, but not longer than necessary. This gives the operating system more time to clean up leftover processes after they have been killed.
Default: 0.
Format: Integer.
Example:
cpuset_destroy_delay 10

Please check this document https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf and the below section 12.13 Cpuset-specific Configuration Parameters

  • restart the pbs services after the above configuration and try.

Hi adarsh,

Thank you for the quick response, I’ll give it a try as soon as possible and update.

Just to make sure I understood the logic behind your suggestions -

  1. the node_fail_requeue is meant to prevent jobs from entering the queue again and again in case I get the same situation of unavailable nodes
  2. the cpuset_destroy_delay is increased in order to allow the processes to terminate completely before the cpuset is destroyed in the MoM but not is not cleared because of leftover processes, thus it will prevent (hopefully) the situation of uncleared cpusets…

Thanks,
L

Sure

if node_fail_requeue is set to 0 , then that means:

Jobs are not requeued; they are left in the Running state until the execution vnode is recovered, whether or not the server has contact with their Mother Superior.

Thats correct

Thank you.