Node/MoM has negative resources

Hi all,

apologies for so many questions lately but here I am with another thing that I do not fully understand (even if I more or less solved it). Yesterday I noticed that some nodes had negative i. .e less than zero resources:

[davide@gmtest01 ~]$ pbsnodes -ajS |grep gnode27
gnode27         <various>            8     8      0  376gb/376gb   -4/20     0/0    -4/4 433454,429112,433454,429116,433454,428571,428574,433454

with the exception of job ID 433454 all other jobs have terminated and were all failed very soon:

[root@gmtest01 ~]# for jobid in $(pbsnodes -ajS |grep gnode27 | awk '{gsub(/,/," ",$NF);print $NF}'); do echo $jobid; qstat -fx $jobid|grep -e qtime -e stime -e mtime; done
429112
    mtime = Thu Jul 18 11:22:14 2024
    qtime = Wed Jul 17 14:53:02 2024
    stime = Thu Jul 18 11:22:04 2024
433454
    mtime = Mon Jul 29 19:57:48 2024
    qtime = Mon Jul 29 13:50:42 2024
    stime = Mon Jul 29 13:50:43 2024
429116
    mtime = Thu Jul 18 11:22:22 2024
    qtime = Wed Jul 17 14:53:03 2024
    stime = Thu Jul 18 11:22:16 2024
428571
    mtime = Wed Jul 17 09:28:40 2024
    qtime = Mon Jul 15 23:41:35 2024
    stime = Wed Jul 17 09:28:32 2024
428574
    mtime = Wed Jul 17 09:28:48 2024
    qtime = Mon Jul 15 23:41:35 2024
    stime = Wed Jul 17 09:28:41 2024

I checked the server logs for hints and, for example, job 429112 seemed to be ok if I am interpreting correctly the substates:

648904 07/18/2024 11:22:13;0400;Server@gmtest01;Job;429112.gmtest01;Updated job state to 69 and substate to 52  
 648905 07/18/2024 11:22:13;0400;Server@gmtest01;Job;429112.gmtest01;Updated job state to 69 and substate to 53

since substate 53 should be “MoM releasing resources”. In the scheduler logs I see:

07/18/2024 11:22:03;0800;pbs_sched;Job;429112.gmtest01;user davide max_*user_run (-2, 6), used 3
07/18/2024 11:22:03;0800;pbs_sched;Job;429112.gmtest01;user davide max_*user_run_soft (-2, 2), used 4

I have checked sys to see if it was a cgroup problem but on that front all seemed to be ok; the only pbs related cgroup was relative to the job ID that was currently active and which was running without problems.
To remove these phantom jobs I had to do:

  1. qstat -xW force <job ID> since the jobs were finished
  2. delete the vnodes associated to the physical node with qmgr
  3. restart the PBS service on the MoM to bring back the vnodes

I have two big problems with this procedure which are:

  • why? where I could find more hints? I did not find any hints from Zabbix graphs as well. The only suspect thing is that all these “phantom jobs” were very quick to fail
  • there is any way of fixing with less steps and less error prone?

Thank you in advance