Node/MoM has negative resources

tuthmose · August 1, 2024, 2:30pm

Hi all,

apologies for so many questions lately but here I am with another thing that I do not fully understand (even if I more or less solved it). Yesterday I noticed that some nodes had negative i. .e less than zero resources:

[davide@gmtest01 ~]$ pbsnodes -ajS |grep gnode27
gnode27         <various>            8     8      0  376gb/376gb   -4/20     0/0    -4/4 433454,429112,433454,429116,433454,428571,428574,433454

with the exception of job ID 433454 all other jobs have terminated and were all failed very soon:

[root@gmtest01 ~]# for jobid in $(pbsnodes -ajS |grep gnode27 | awk '{gsub(/,/," ",$NF);print $NF}'); do echo $jobid; qstat -fx $jobid|grep -e qtime -e stime -e mtime; done
429112
    mtime = Thu Jul 18 11:22:14 2024
    qtime = Wed Jul 17 14:53:02 2024
    stime = Thu Jul 18 11:22:04 2024
433454
    mtime = Mon Jul 29 19:57:48 2024
    qtime = Mon Jul 29 13:50:42 2024
    stime = Mon Jul 29 13:50:43 2024
429116
    mtime = Thu Jul 18 11:22:22 2024
    qtime = Wed Jul 17 14:53:03 2024
    stime = Thu Jul 18 11:22:16 2024
428571
    mtime = Wed Jul 17 09:28:40 2024
    qtime = Mon Jul 15 23:41:35 2024
    stime = Wed Jul 17 09:28:32 2024
428574
    mtime = Wed Jul 17 09:28:48 2024
    qtime = Mon Jul 15 23:41:35 2024
    stime = Wed Jul 17 09:28:41 2024

I checked the server logs for hints and, for example, job 429112 seemed to be ok if I am interpreting correctly the substates:

648904 07/18/2024 11:22:13;0400;Server@gmtest01;Job;429112.gmtest01;Updated job state to 69 and substate to 52  
 648905 07/18/2024 11:22:13;0400;Server@gmtest01;Job;429112.gmtest01;Updated job state to 69 and substate to 53

since substate 53 should be “MoM releasing resources”. In the scheduler logs I see:

07/18/2024 11:22:03;0800;pbs_sched;Job;429112.gmtest01;user davide max_*user_run (-2, 6), used 3
07/18/2024 11:22:03;0800;pbs_sched;Job;429112.gmtest01;user davide max_*user_run_soft (-2, 2), used 4

I have checked sys to see if it was a cgroup problem but on that front all seemed to be ok; the only pbs related cgroup was relative to the job ID that was currently active and which was running without problems.
To remove these phantom jobs I had to do:

qstat -xW force <job ID> since the jobs were finished
delete the vnodes associated to the physical node with qmgr
restart the PBS service on the MoM to bring back the vnodes

I have two big problems with this procedure which are:

why? where I could find more hints? I did not find any hints from Zabbix graphs as well. The only suspect thing is that all these “phantom jobs” were very quick to fail
there is any way of fixing with less steps and less error prone?

Thank you in advance

Topic		Replies	Views
Nodes resources not getting released to the resource pool after the job is run and exited Users/Site Administrators	4	494	September 16, 2022
Revalidating nodes Users/Site Administrators	5	449	November 15, 2023
Problems with resources_max.nodes Users/Site Administrators	3	4548	August 12, 2016
No available resources on nodes Users/Site Administrators	1	713	April 12, 2023
Hook;pbs_python;Server and MoM vnode names may not be consistent Users/Site Administrators	0	542	October 6, 2020

Node/MoM has negative resources

Related topics