Former occupied node not relaesed

lior · April 14, 2020, 8:06am

Hello everyone,

I am getting an error while trying to run a job on pbs pro 14.1 resulting in holding the job without being able to release it.
The problem seems to be related to the fact that cpusets are not released (there is still a folder of “jobid.server” in the pbs folder) even though the job itself ended, and the pbs thinks that this cpuset is free, hence trying to submit new jobs to it.
When trying to submit a new job to the same cpuset, I’m getting an error - “cpus_this_vnode != hv_ncpus” and the new job continues holding.
I’ve tried to clear the cpuset using “cpuset x” command, which deleted the “jobid.server” folder, however the cpuset is still not cleared.
The only resolution I could find is to restart the pbs server.
Is there any other solution which will not require me to stop all currently running jobs?

Thanks in advance,
L

adarsh · April 14, 2020, 2:23pm

Would it be possible to give more information about your environment and setup?
Also share the server and mom logs of that job which is stuck ?
tracejob information would be helpful as well ?

lior · April 14, 2020, 3:35pm

I have a shared memory server, the PBSpro version is 14.1.0
The mom log conatines the following lines for the stuck job:

Job:19842.server;Resource_List.place = scatter
Job:19842.server;make_cpuset, vnode server[1]: cpus_this_vnode (0) != hv_ncpus (16)
Job:19842.server;kill_job
Job:19842.server;no active tasks
Job:19842.server;0bit sent
Job:19842.server;delete job request recieved
Job:19842.server;kill_job

and this continues again and again, until I stop and start the pbs server.
The problem is the fact that node server[1] is part of the cpuset of another, already finished, job, which was not cleared (the job itsel does not appear in qstat). So the cpuset for the new job is not even created.
For the uncleared cpuset I get the following mom_log:

Job:19666.server;delete job request received
Job:19666.server;remove_cpuset_procs, failed to kill pid 123456 in set /PBSPro/19666.server
Job:19666.server;Success (0) in remove_cpuset_procs, PID 123456 moved to set /
Job:19666.server;Device or resource busy (16) in try_remove_set, cpuset_delete cpuset /PBSPro/19666.server failed
Job:19666.server;Device or resource busy (16) in logprocinfo, PID 123456: comm (prog.exe), state R, PPID 1…
Device or resource busy (16) in try_remove_set, 1 tasks in set /PBSPro/19666.server
Job:19666.server;Inappropriate ioctl for device (25) in del_cpusetfile, removal of cpuset /PBSPro/19666.server failed

Unfourtantly, I don’t have the tracejob info…

adarsh · April 14, 2020, 8:49pm

Thank you lior for the above information.

Please try

setting node_fail_requeue to 0 ( qmgr : set server node_fail_requeue = 0 )
use the cpuset_destroy_delay in the mom_priv/config and see whether this resolves your issue

cpuset_destroy_delay
MoM waits up to delay seconds before destroying a cpuset of a just-completed job, but not longer than necessary. This gives the operating system more time to clean up leftover processes after they have been killed.
Default: 0.
Format: Integer.
Example:
cpuset_destroy_delay 10

Please check this document https://www.altair.com/pdfs/pbsworks/PBS19.2.3_BigBook.pdf and the below section 12.13 Cpuset-specific Configuration Parameters

restart the pbs services after the above configuration and try.

lior · April 15, 2020, 9:31am

Hi adarsh,

Thank you for the quick response, I’ll give it a try as soon as possible and update.

Just to make sure I understood the logic behind your suggestions -

the node_fail_requeue is meant to prevent jobs from entering the queue again and again in case I get the same situation of unavailable nodes
the cpuset_destroy_delay is increased in order to allow the processes to terminate completely before the cpuset is destroyed in the MoM but not is not cleared because of leftover processes, thus it will prevent (hopefully) the situation of uncleared cpusets…

Thanks,
L

adarsh · April 15, 2020, 12:54pm

Sure

if node_fail_requeue is set to 0 , then that means:

Jobs are not requeued; they are left in the Running state until the execution vnode is recovered, whether or not the server has contact with their Mother Superior.

Thats correct

Thank you.

Topic		Replies	Views
GPU queue not runing jobs Users/Site Administrators	19	1716	April 5, 2023
Jobs stuck in R status after power failure Users/Site Administrators	9	106	July 4, 2024
Jobs were not dispatched even though there were sufficient nodes and sufficient resources for appropriate node_pool Users/Site Administrators	5	65	September 9, 2024
Jobs Immediately Exiting (Email Spamming) Users/Site Administrators	3	1541	June 28, 2018
Schedulers doesn't seem to be holding jobs Users/Site Administrators	11	1626	June 18, 2019

Former occupied node not relaesed

Related topics