The 'start failed, improper sid ' error within pbs_mom

roman · September 18, 2021, 12:12am

Dear All,

I am a part of a team operating a medium size cluster (~16k core, mixture of skylake and cascade lake nodes) managed by Bright Cluster Managment software. We are running OpenPBS suite ver. 20.0.1 (openpbs20-20.0.1-100350_cm9.0_617a908b87.x86_64).

Recently we have started to face the following issue. Some of the nodes are throwing the following error message in the mom_logs:

pbs_mom;Svr;pbs_mom;Resource temporarily unavailable (11) in record_finish_exec, read of pipe for pid job 2919694[38].pbsha.ib.sockeye got 0 not 8
pbs_mom;Job;2919694[38].pbsha.ib.sockeye;start failed, improper sid

which renders the node unusable (all subsequent jobs directed to that mom are rejected) and the only way to alleviate the issue is to take the node off-line and reboot or restart the pbs_mom. Your insight and help how to avoid such an error will be very appreciated.

The error can pop out on a idle node:

09/15/2021 02:10:15;0008;pbs_mom;Job;2800852.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:29;0008;pbs_python;Job;2893275.pbsha.ib.sockeye;update_job_usage: CPU percent: 0
09/15/2021 08:15:29;0008;pbs_python;Job;2893275.pbsha.ib.sockeye;update_job_usage: CPU usage: 0.000 secs
09/15/2021 08:15:29;0008;pbs_python;Job;2893275.pbsha.ib.sockeye;update_job_usage: Memory usage: mem=0b
09/15/2021 08:15:29;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:29;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;nprocs:  630, cantstat:  0, nomem:  0, skipped:  558, cached:  0
09/15/2021 08:15:29;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:29;0001;pbs_mom;Svr;pbs_mom;No child processes (10) in record_finish_exec, read of pipe for pid job 2893275.pbsha.ib.sockeye got 0 not 8
09/15/2021 08:15:29;0001;pbs_mom;Job;2893275.pbsha.ib.sockeye;start failed, improper sid
09/15/2021 08:15:30;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;kill_job
09/15/2021 08:15:30;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:30;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:30;0080;pbs_mom;Job;2893275.pbsha.ib.sockeye;delete job request received
09/15/2021 08:15:30;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;no active tasks
09/15/2021 08:15:30;0080;pbs_mom;Req;req_reject;Reject reply code=15059, aux=0, type=6, from root@192.168.15.249:15001
09/15/2021 08:15:30;0008;pbs_mom;Job;2893275.pbsha.ib.sockeye;kill_job


Or on a node which processing other jobs:
systemctl status pbs
● pbs.service - Portable Batch System
   Loaded: loaded (/cm/shared/apps/openpbs/current/libexec/pbs_init.d; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/pbs.service.d
           └─99-cmd.conf
   Active: active (running) since Tue 2021-08-17 13:01:37 PDT; 1 months 0 days ago
    Tasks: 2
   Memory: 19.5M
   CGroup: /system.slice/pbs.service
           └─5668 /cm/shared/apps/openpbs/20.0.1/sbin/pbs_mom

09/17/2021 00:28:18;0080;pbs_mom;Job;2916665.pbsha.ib.sockeye;task 10000001 terminated
09/17/2021 00:28:18;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;KILL_JOB received
09/17/2021 00:28:18;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;kill_job
09/17/2021 00:28:18;0008;pbs_python;Job;2916665.pbsha.ib.sockeye;update_job_usage: CPU percent: 0
09/17/2021 00:28:18;0008;pbs_python;Job;2916665.pbsha.ib.sockeye;update_job_usage: CPU usage: 171261.366 secs
09/17/2021 00:28:18;0008;pbs_python;Job;2916665.pbsha.ib.sockeye;update_job_usage: Memory usage: mem=405492kb
09/17/2021 00:28:18;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:19;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;DELETE_JOB received
09/17/2021 00:28:19;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;kill_job
09/17/2021 00:28:19;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:20;0008;pbs_mom;Job;2916665.pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[38].pbsha.ib.sockeye;update_job_usage: CPU percent: 0
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[38].pbsha.ib.sockeye;update_job_usage: CPU usage: 0.000 secs
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[38].pbsha.ib.sockeye;update_job_usage: Memory usage: mem=0b
09/17/2021 00:28:26;0008;pbs_mom;Job;2919694[38].pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[39].pbsha.ib.sockeye;update_job_usage: CPU percent: 0
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[39].pbsha.ib.sockeye;update_job_usage: CPU usage: 0.000 secs
09/17/2021 00:28:26;0008;pbs_python;Job;2919694[39].pbsha.ib.sockeye;update_job_usage: Memory usage: mem=0b
09/17/2021 00:28:26;0008;pbs_mom;Job;2919694[38].pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0008;pbs_mom;Job;2919694[39].pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0008;pbs_mom;Job;2919694[38].pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0008;pbs_mom;Job;2919694[39].pbsha.ib.sockeye;no active tasks
09/17/2021 00:28:26;0001;pbs_mom;Svr;pbs_mom;Resource temporarily unavailable (11) in record_finish_exec, read of pipe for pid job 2919694[38].pbsha.ib.sockeye got 0 not 8
09/17/2021 00:28:26;0001;pbs_mom;Job;2919694[38].pbsha.ib.sockeye;start failed, improper sid

Job 2916665
    comment = Job run at Thu Sep 16 at 22:58 on (se189:ncpus=32:mem=134217728kb
        :ngpus=0)+(se218:ncpus=32:mem=134217728kb:ngpus=0) and finished


Again your help will be very appreciated.

All the best
Roman

dtalcott · September 20, 2021, 5:54pm

Have you looked in the system logs with journalctl to see if there are other error messages about the same time?

Here is my guess. It looks to me as if you have one or more runjob hooks, and one of them is causing the pbs_mom job starter process to exit, either directly by killing it, or indirectly by causing the system to kill it, perhaps due to excess memory use. Perhaps there some file that has grown over time so when the hook reads it all, the hook uses too much memory, causing the kernel to kill the parent process?

FWIW, I think the “Resource temporarily unavailable” part of the message is spurious. What actually happened is that the main MoM was waiting for the job starter to return the job session ID across their communication pipe, but the job starter went away. So, when the main MoM read the pipe it got EOF (zero bytes), as mentioned by the rest of the message.

roman · September 20, 2021, 8:15pm

Dear Dale, All,

Does it helps or give any new indication:

journalctl -t pbs_mom -S “2021-09-10 03:30:00” -U “2021-09-10 03:36:00”

-- Logs begin at Tue 2021-08-17 13:39:51 PDT, end at Mon 2021-09-20 12:55:10 PDT. --
Sep 10 03:31:16 se449 pbs_mom[5575]: Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.15.249:15001 on stream 1
Sep 10 03:31:16 se449 pbs_mom[5575]: Svr;im_eof;Server closed connection.
Sep 10 03:33:29 se449 pbs_mom[5575]: Hook;PBS_power.HK;copy hook-related file request received
Sep 10 03:33:29 se449 pbs_mom[5575]: Svr;pbs_mom;req_copy_hookfile, joinjob_alarm_time updated to 120
Sep 10 03:35:18 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks
Sep 10 03:35:18 se449 pbs_mom[5575]: Svr;pbs_mom;Resource temporarily unavailable (11) in record_finish_exec, read of pipe for pid job 2878114.pbsha.ib.sockeye got 0 not 8
Sep 10 03:35:18 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;start failed, improper sid
Sep 10 03:35:19 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks
Sep 10 03:35:21 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks
Sep 10 03:35:22 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;delete job request received
Sep 10 03:35:22 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks
Sep 10 03:35:24 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks
Sep 10 03:35:41 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks
Sep 10 03:35:41 se449 pbs_mom[5575]: Svr;pbs_mom;Resource temporarily unavailable (11) in record_finish_exec, read of pipe for pid job 2878114.pbsha.ib.sockeye got 0 not 8
Sep 10 03:35:41 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;start failed, improper sid
Sep 10 03:35:41 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks
Sep 10 03:35:42 se449 pbs_mom[102599]: Svr;pbs_mom;No such process (3) in get_hook_results,  Traceback (most recent call last):: no value given
Sep 10 03:35:42 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks
Sep 10 03:35:42 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;copy file request received
Sep 10 03:35:42 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks
Sep 10 03:35:42 se449 pbs_mom[5575]: Req;req_reject;Reject reply code=15051, aux=0, type=54, from root@192.168.15.249:15001
Sep 10 03:35:42 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;delete job request received
Sep 10 03:35:42 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;delete job request received
Sep 10 03:35:42 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks
Sep 10 03:35:44 se449 pbs_mom[5575]: Job;2878114.pbsha.ib.sockeye;no active tasks

PBS_power.HK:
type=pbs
enabled=false
user=pbsadmin
event=periodic,execjob_prologue,execjob_epilogue,exechost_periodic,exechost_startup,execjob_begin,execjob_end
order=2000
alarm=180
freq=300

In addition to that the only other hook we are running is pbs_cgroups.PY but from the logs and journactl it looks like it has never run. Note: All computational nodes in the cluster are stateless.

Again Thanks for all input help and advice.
All the best
Roman

dtalcott · September 21, 2021, 2:57am

Try the same journalctl command, but leave off the “-t pbs_mom” argument. That is, look for messages not directly associated with pbs_mom. Perhaps a message from oom_killer or the kernel reporting a segfault, etc.

In the messages in the initial post, weren’t the “update_job_usage” messages from the pbs_cgroups hook? So, the cgroups hook appears to run sometimes.

Are the hooks necessary for your site, or could you disable both hooks for a while to see if the problem goes away?

roman · September 22, 2021, 11:42pm

Dear Dale, All,

I have run the ‘journalctl’ + ‘dmesg’ commands on that node and definitely there was not any issues related to OOM or segfault, that is why I have grabbed only the relevant ‘pbs_mom’ related output.
Since the cluster is shared by many users and different workflows we definitely need pbs_hook in our running environment. However as soon as the ‘improper sid’ error is encountered it affects only the new jobs, all other which were already running on the affected node, keep going so “update_job_usage” is very normal.
Yesterday we had a system maintenance window for the entire cluster (new image, but the pbs_mom still the same) and as result all compute nodes have been rebooted. So far I have not seen any ‘improper sid’ error messages, but at the same time I lost access to all old ‘dmesg’ and journalctl.
I will keep an eye and as soon as I have something new I will update that post.
On the other hand what you would suggest ?
strace -p pbs_mom_ID -f -o /tmp/ouput_of_strace (any other options?)
or attach dbg to a running mom ? To narrow down the issue .
Again thanks…

All the best
Roman

dtalcott · September 28, 2021, 1:02am

This is what I would do, once you have a MoM in the funny state.

Topic		Replies	Views
Jobs stuck in R status after power failure Users/Site Administrators	9	110	July 4, 2024
Pbs_mom dumps core when jobs are preempted / canceled Users/Site Administrators	6	1799	August 20, 2019
Pbs_mom can't start normally Users/Site Administrators	1	781	March 5, 2021
MOM fails on start up Users/Site Administrators	29	11277	February 24, 2023
Offlined by hook 'pbs_cgroups' due to hook error Users/Site Administrators	9	1625	March 8, 2022

The 'start failed, improper sid ' error within pbs_mom

Related topics