Hi all,
apologies for so many questions lately but here I am with another thing that I do not fully understand (even if I more or less solved it). Yesterday I noticed that some nodes had negative i. .e less than zero resources:
[davide@gmtest01 ~]$ pbsnodes -ajS |grep gnode27
gnode27 <various> 8 8 0 376gb/376gb -4/20 0/0 -4/4 433454,429112,433454,429116,433454,428571,428574,433454
with the exception of job ID 433454 all other jobs have terminated and were all failed very soon:
[root@gmtest01 ~]# for jobid in $(pbsnodes -ajS |grep gnode27 | awk '{gsub(/,/," ",$NF);print $NF}'); do echo $jobid; qstat -fx $jobid|grep -e qtime -e stime -e mtime; done
429112
mtime = Thu Jul 18 11:22:14 2024
qtime = Wed Jul 17 14:53:02 2024
stime = Thu Jul 18 11:22:04 2024
433454
mtime = Mon Jul 29 19:57:48 2024
qtime = Mon Jul 29 13:50:42 2024
stime = Mon Jul 29 13:50:43 2024
429116
mtime = Thu Jul 18 11:22:22 2024
qtime = Wed Jul 17 14:53:03 2024
stime = Thu Jul 18 11:22:16 2024
428571
mtime = Wed Jul 17 09:28:40 2024
qtime = Mon Jul 15 23:41:35 2024
stime = Wed Jul 17 09:28:32 2024
428574
mtime = Wed Jul 17 09:28:48 2024
qtime = Mon Jul 15 23:41:35 2024
stime = Wed Jul 17 09:28:41 2024
I checked the server logs for hints and, for example, job 429112 seemed to be ok if I am interpreting correctly the substates:
648904 07/18/2024 11:22:13;0400;Server@gmtest01;Job;429112.gmtest01;Updated job state to 69 and substate to 52
648905 07/18/2024 11:22:13;0400;Server@gmtest01;Job;429112.gmtest01;Updated job state to 69 and substate to 53
since substate 53 should be “MoM releasing resources”. In the scheduler logs I see:
07/18/2024 11:22:03;0800;pbs_sched;Job;429112.gmtest01;user davide max_*user_run (-2, 6), used 3
07/18/2024 11:22:03;0800;pbs_sched;Job;429112.gmtest01;user davide max_*user_run_soft (-2, 2), used 4
I have checked sys
to see if it was a cgroup problem but on that front all seemed to be ok; the only pbs related cgroup was relative to the job ID that was currently active and which was running without problems.
To remove these phantom jobs I had to do:
qstat -xW force <job ID>
since the jobs were finished- delete the vnodes associated to the physical node with
qmgr
- restart the PBS service on the MoM to bring back the vnodes
I have two big problems with this procedure which are:
- why? where I could find more hints? I did not find any hints from Zabbix graphs as well. The only suspect thing is that all these “phantom jobs” were very quick to fail
- there is any way of fixing with less steps and less error prone?
Thank you in advance