Offlined by hook 'pbs_cgroups' due to hook error

Hello,

From time to time, a node is offlined by pbs with “offlined by hook ‘pbs_cgroups’ due to hook error” note displayed by pbsnodes -l. pbsnodes -c typically fixes it; sometimes I need to restart the pbs server on the affected node. The log there has entries like this: 12/02/2021 17:11:19;0001;pbs_mom;Svr;pbs_mom;run_hook, execv of /opt/pbs/bin/pbs_python resulted in nonzero exit status=-4. How to properly debug this?

It seems each such event corresponds to IP trap(s) in MoM:

[Wed Mar 2 15:55:01 2022] traps: pbs_mom[27885] general protection ip:4762d7 sp:7fff7ae075e0 error:0 in pbs_mom[400000+cb000]
[Wed Mar 2 15:55:01 2022] traps: pbs_mom[27887] general protection ip:4762d7 sp:7fff7ae075e0 error:0 in pbs_mom[400000+cb000]
[Wed Mar 2 15:55:01 2022] traps: pbs_mom[27889] general protection ip:4762d7 sp:7fff7ae075e0 error:0 in pbs_mom[400000+cb000]

Looking at the sources, something went wrong in the MoM when it was getting ready to run a hook. What I would do next is try to figure out where the MoM was, using the IP address from the trap. Something like:

$ gdb -q /opt/pbs/sbin/pbs_mom
Reading symbols from /opt/pbs/sbin/pbs_mom...done.
(gdb)  x/10i 0x4762d7
   0x4762d7 <add_conn+199>:	callq  0x481730 <append_link>
   0x4762dc <add_conn+204>:	
    mov    0x27793d(%rip),%rdi        # 0x6edc20 <poll_context>
   0x4762e3 <add_conn+211>:	mov    $0x19,%edx
   0x4762e8 <add_conn+216>:	mov    %ebx,%esi
   0x4762ea <add_conn+218>:	callq  0x48bbf0 <tpp_em_add_fd>
   0x4762ef <add_conn+223>:	test   %eax,%eax
   0x4762f1 <add_conn+225>:	js     0x476387 <add_conn+375>
   0x4762f7 <add_conn+231>:	
    mov    0x27793a(%rip),%rax        # 0x6edc38 <svr_conn>
   0x4762fe <add_conn+238>:	mov    (%rax,%rbp,8),%rax
   0x476302 <add_conn+242>:	add    $0x18,%rsp
(gdb) quit

The 0x4762d7 value comes from your traps messages. For my build of the MoM, that was somewhere in the add_conn routine. Yours will likely be different.

Even better would be if you are getting core files somewhere under /var/spool/pbs/mom_priv. If so, get a traceback using gdb:

# gdb -q -c /path/to/core/file /opt/pbs/sbin/pbs_mom
(gdb) bt
[...Lots of good stuff...]
(gdb) quit

Thank you very much. Here is the output in my case:

(gdb)  x/10i 0x4762d7
   0x4762d7 <_destroy_connection+23>:   mov    0x8(%rax),%rdi
   0x4762db <_destroy_connection+27>:   test   %rdi,%rdi
   0x4762de <_destroy_connection+30>:
    je     0x4762e8 <_destroy_connection+40>
   0x4762e0 <_destroy_connection+32>:   callq  0x41bc40 <free@plt>
   0x4762e5 <_destroy_connection+37>:   mov    (%rbx),%rax
   0x4762e8 <_destroy_connection+40>:   lea    0x10(%rax),%rdi
   0x4762ec <_destroy_connection+44>:
    callq  0x41bc10 <pthread_mutex_destroy@plt>
   0x4762f1 <_destroy_connection+49>:   mov    (%rbx),%rdi
   0x4762f4 <_destroy_connection+52>:   callq  0x41bc40 <free@plt>
   0x4762f9 <_destroy_connection+57>:
    subl   $0x1,0x26d1a0(%rip)        # 0x6e34a0 <allocated_connection>
(gdb)

There are no core files under /var/spool/pbs/mom_priv. Do I understand correctly that “error:0” in the syslog means a divide-by-zero trap?

How up to date are you? This issue might be the one fixed by Fix for crash while destroying entry in connection table by hirenvadalia · Pull Request #2155 · openpbs/openpbs · GitHub.

Also, error:0 does not always mean divide by zero error.

Very interesting. I use (almost) stock v20.0.1, so without this fix.

Notably, this happens only on nodes with the hyperthreading disabled (in BIOS). Does it make sense?

I don’t see any obvious link between the fix and hyperthreading.

Sorry for the delay in replying. Do you understand what kind of job it should be to trigger this bug? It happens very rarely, often hundreds of jobs can pass through a node until it happens. Thanks.

The crash is caused by reuse of freed data, so there is probably no direct relationship between a job and the crash. Since the patch is so straight-forward, I would rebuild pbs_mom with the patch and see if the problem goes away.

Of course, but rolling out any update in a large production cluster is a hassle. So ideally, I would update pbs_mom only on a small subset of nodes for testing purposes - if I only knew what kind of job can reliably trigger the bug and verify the patched version is immune.

I do suspect it is sensitive to something in a specific type of job. Sometimes, a week may pass without a single node offlined; then within a few days, a dozen of nodes get offline with the same error message.

Of course, it may be that what I observe is an unrelated issue to the one fixed by #2155.

I’ll keep you updated, thanks again.