Pbs_comm core dump

Running OpenPBS 23.06.06 with patch b1249644 installed on Oracle Linux 8.9.

Had pbs_comm die with a segfault.

Site has about 2100 nodes.

Info from core dump:

warning: Section `.reg-xstate/3383770’ in core file too small.
#0 0x000000000040f062 in handle_disconnect (conn=conn@entry=0x7f8e0c016be0) at tpp_transport.c:1736
1736 conns_array[tfd].slot_state = TPP_SLOT_FREE;
[Current thread is 1 (Thread 0x7f8e0bfff700 (LWP 3383770))]
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-236.0.1.el8_9.12.x86_64 keyutils-libs-1.5.10-9.el8.x86_64 krb5-libs-1.18.2-26.0.1.el8_9.x86_64 libblkid-2.32.1-44.0.1.el8_9.1.x86_64 libcom_err-1.45.6-5.el8.x86_64 libgcc-8.5.0-20.0.3.el8.x86_64 libmount-2.32.1-44.0.1.el8_9.1.x86_64 libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 libselinux-2.9-8.el8.x86_64 libtirpc-1.1.4-8.el8.x86_64 libuuid-2.32.1-44.0.1.el8_9.1.x86_64 libxcrypt-4.1.1-6.el8.x86_64 nss_nis-3.0-8.el8.x86_64 openssl-libs-1.1.1k-12.el8_9.x86_64 pcre2-10.32-3.el8_6.x86_64 systemd-libs-239-78.0.4.el8.x86_64 zlib-1.2.11-25.el8.x86_64
(gdb) bt
#0 0x000000000040f062 in handle_disconnect (conn=conn@entry=0x7f8e0c016be0) at tpp_transport.c:1736
#1 0x000000000040f277 in handle_incoming_data (conn=conn@entry=0x7f8e0c016be0) at tpp_transport.c:1824
#2 0x000000000040ffe5 in work (v=0x801f40) at tpp_transport.c:1560
#3 work (v=0x801f40) at tpp_transport.c:1436
#4 0x00007f8e1655c1da in start_thread () from /lib64/libpthread.so.0
#5 0x00007f8e15b83e73 in clone () from /lib64/libc.so.6

(gdb) down
#0 0x000000000040f062 in handle_disconnect (conn=conn@entry=0x7f8e0c016be0) at tpp_transport.c:1736
1736 conns_array[tfd].slot_state = TPP_SLOT_FREE;
(gdb) up
#1 0x000000000040f277 in handle_incoming_data (conn=conn@entry=0x7f8e0c016be0) at tpp_transport.c:1824
1824 handle_disconnect(conn);
(gdb) up
#2 0x000000000040ffe5 in work (v=0x801f40) at tpp_transport.c:1560
1560 handle_incoming_data(conn);
(gdb) up
#3 work (v=0x801f40) at tpp_transport.c:1436
1436 work(void *v)
(gdb) up
#4 0x00007f8e1655c1da in start_thread () from /lib64/libpthread.so.0
(gdb) up
#5 0x00007f8e15b83e73 in clone () from /lib64/libc.so.6
(gdb) up
Initial frame selected; you cannot go up.

Hi @jwmatthe! It seems like the bug we faced: tpp connections: change round robin to fixed assignment of threads by vchlum · Pull Request #2641 · openpbs/openpbs · GitHub . It is fixed on the master branch.

Thanks for the reply. We have system maintenance scheduled next week and will plan to put the patch in place then.