I am a part of a team operating a medium size cluster (~16k core, mixture of skylake and cascade lake nodes) managed by Bright Cluster Managment software. We are running OpenPBS suite ver. 20.0.1 (openpbs20-20.0.1-100350_cm9.0_617a908b87.x86_64).
Recently we have started to face the following issue. Some of the nodes are throwing the following error message in the mom_logs:
pbs_mom;Svr;pbs_mom;Resource temporarily unavailable (11) in record_finish_exec, read of pipe for pid job 2919694[38].pbsha.ib.sockeye got 0 not 8
pbs_mom;Job;2919694[38].pbsha.ib.sockeye;start failed, improper sid
which renders the node unusable (all subsequent jobs directed to that mom are rejected) and the only way to alleviate the issue is to take the node off-line and reboot or restart the pbs_mom. Your insight and help how to avoid such an error will be very appreciated.
The error can pop out on a idle node:
09/15/2021 02:10:15;0008;pbs_mom;Job;2800852.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:29;0008;pbs_python;Job;2893275.pbsha.ib.sockeye;update_job_usage: CPU percent: 0
09/15/2021 08:15:29;0008;pbs_python;Job;2893275.pbsha.ib.sockeye;update_job_usage: CPU usage: 0.000 secs
09/15/2021 08:15:29;0008;pbs_python;Job;2893275.pbsha.ib.sockeye;update_job_usage: Memory usage: mem=0b
09/15/2021 08:15:29;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:29;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;nprocs: 630, cantstat: 0, nomem: 0, skipped: 558, cached: 0
09/15/2021 08:15:29;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:29;0001;pbs_mom;Svr;pbs_mom;No child processes (10) in record_finish_exec, read of pipe for pid job 2893275.pbsha.ib.sockeye got 0 not 8
09/15/2021 08:15:29;0001;pbs_mom;Job;2893275.pbsha.ib.sockeye;start failed, improper sid
09/15/2021 08:15:30;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;kill_job
09/15/2021 08:15:30;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:30;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:30;0080;pbs_mom;Job;2893275.pbsha.ib.sockeye;delete job request received
09/15/2021 08:15:30;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:30;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@192.168.15.249:15001
09/15/2021 08:15:30;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;kill_job
Or on a node which processing other jobs:
systemctl status pbs
● pbs.service - Portable Batch System
Loaded: loaded (/cm/shared/apps/openpbs/current/libexec/pbs_init.d; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/pbs.service.d
└─99-cmd.conf
Active: active (running) since Tue 2021-08-17 13:01:37 PDT; 1 months 0 days ago
Tasks: 2
Memory: 19.5M
CGroup: /system.slice/pbs.service
└─5668 /cm/shared/apps/openpbs/20.0.1/sbin/pbs_mom
09/17/2021 00:28:18;0080;pbs_mom;Job;2916665.pbsha.ib.sockeye;task 10000001 terminated
09/17/2021 00:28:18;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;KILL_JOB received
09/17/2021 00:28:18;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;kill_job
09/17/2021 00:28:18;0008;pbs_python;Job;2916665.pbsha.ib.sockeye;update_job_usage: CPU percent: 0
09/17/2021 00:28:18;0008;pbs_python;Job;2916665.pbsha.ib.sockeye;update_job_usage: CPU usage: 171261.366 secs
09/17/2021 00:28:18;0008;pbs_python;Job;2916665.pbsha.ib.sockeye;update_job_usage: Memory usage: mem=405492kb
09/17/2021 00:28:18;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:19;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;DELETE_JOB received
09/17/2021 00:28:19;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;kill_job
09/17/2021 00:28:19;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:20;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[38].pbsha.ib.sockeye;update_job_usage: CPU percent: 0
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[38].pbsha.ib.sockeye;update_job_usage: CPU usage: 0.000 secs
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[38].pbsha.ib.sockeye;update_job_usage: Memory usage: mem=0b
09/17/2021 00:28:26;0008;pbs_mom;Job;2919694[38].pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[39].pbsha.ib.sockeye;update_job_usage: CPU percent: 0
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[39].pbsha.ib.sockeye;update_job_usage: CPU usage: 0.000 secs
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[39].pbsha.ib.sockeye;update_job_usage: Memory usage: mem=0b
09/17/2021 00:28:26;0008;pbs_mom;Job;2919694[38].pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0008;pbs_mom;Job;2919694[39].pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0008;pbs_mom;Job;2919694[38].pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0008;pbs_mom;Job;2919694[39].pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0001;pbs_mom;Svr;pbs_mom;Resource temporarily unavailable (11) in record_finish_exec, read of pipe for pid job 2919694[38].pbsha.ib.sockeye got 0 not 8
09/17/2021 00:28:26;0001;pbs_mom;Job;2919694[38].pbsha.ib.sockeye;start failed, improper sid
Job 2916665
comment = Job run at Thu Sep 16 at 22:58 on (se189:ncpus=32:mem=134217728kb
:ngpus=0)+(se218:ncpus=32:mem=134217728kb:ngpus=0) and finished
Again your help will be very appreciated.
All the best
Roman
Have you looked in the system logs with journalctl to see if there are other error messages about the same time?
Here is my guess. It looks to me as if you have one or more runjob hooks, and one of them is causing the pbs_mom job starter process to exit, either directly by killing it, or indirectly by causing the system to kill it, perhaps due to excess memory use. Perhaps there some file that has grown over time so when the hook reads it all, the hook uses too much memory, causing the kernel to kill the parent process?
FWIW, I think the “Resource temporarily unavailable” part of the message is spurious. What actually happened is that the main MoM was waiting for the job starter to return the job session ID across their communication pipe, but the job starter went away. So, when the main MoM read the pipe it got EOF (zero bytes), as mentioned by the rest of the message.
In addition to that the only other hook we are running is pbs_cgroups.PY but from the logs and journactl it looks like it has never run. Note: All computational nodes in the cluster are stateless.
Again Thanks for all input help and advice.
All the best
Roman
Try the same journalctl command, but leave off the “-t pbs_mom” argument. That is, look for messages not directly associated with pbs_mom. Perhaps a message from oom_killer or the kernel reporting a segfault, etc.
In the messages in the initial post, weren’t the “update_job_usage” messages from the pbs_cgroups hook? So, the cgroups hook appears to run sometimes.
Are the hooks necessary for your site, or could you disable both hooks for a while to see if the problem goes away?
I have run the ‘journalctl’ + ‘dmesg’ commands on that node and definitely there was not any issues related to OOM or segfault, that is why I have grabbed only the relevant ‘pbs_mom’ related output.
Since the cluster is shared by many users and different workflows we definitely need pbs_hook in our running environment. However as soon as the ‘improper sid’ error is encountered it affects only the new jobs, all other which were already running on the affected node, keep going so “update_job_usage” is very normal.
Yesterday we had a system maintenance window for the entire cluster (new image, but the pbs_mom still the same) and as result all compute nodes have been rebooted. So far I have not seen any ‘improper sid’ error messages, but at the same time I lost access to all old ‘dmesg’ and journalctl.
I will keep an eye and as soon as I have something new I will update that post.
On the other hand what you would suggest ?
strace -p pbs_mom_ID -f -o /tmp/ouput_of_strace (any other options?)
or attach dbg to a running mom ? To narrow down the issue .
Again thanks…