When changing a lot of node attributes in batch (status from offline to online and clearing ‘comments’) the server crashed with a segfault at pnode = psvrmom->msr_children[momidx]; in req_manager.c:2145 when it got to the one node we have that currently has child vnodes.
OS: Oracle Linux 8.8 running RHEL kernel 4.18.0-477.27.1.el8_8.x86_64
PBS Server: 23.06.06 with patch b1249644 installed
Info from core dump:
Core was generated by `/usr/local/pkgs/openpbs/sbin/pbs_server.bin’.
Program terminated with signal SIGSEGV, Segmentation fault.
warning: Section `.reg-xstate/1211784’ in core file too small.
#0 0x0000000000463e2d in mgr_node_set (preq=0x7276ab0) at req_manager.c:2145
2145 pnode = psvrmom->msr_children[momidx];
[Current thread is 1 (Thread 0x7f45e9323840 (LWP 1211784))]
Missing separate debuginfos, use: yum debuginfo-install cyrus-sasl-lib-2.1.27-6.el8_5.x86_64 expat-2.2.5-11.0.1.el8.x86_64 glibc-2.28-225.0.4.el8_8.6.x86_64 gssproxy-0.8.0-21.el8.x86_64 keyutils-libs-1.5.10-9.el8.x86_64 krb5-libs-1.18.2-25.0.1.el8_8.x86_64 libblkid-2.32.1-42.el8_8.x86_64 libcom_err-1.45.6-5.el8.x86_64 libgcc-8.5.0-18.0.6.el8.x86_64 libical-3.0.3-3.el8.x86_64 libicu-60.3-2.el8_1.x86_64 libmount-2.32.1-42.el8_8.x86_64 libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 libpq-13.5-1.el8.x86_64 libselinux-2.9-8.el8.x86_64 libstdc+±8.5.0-18.0.6.el8.x86_64 libtirpc-1.1.4-8.el8.x86_64 libxcrypt-4.1.1-6.el8.x86_64 nss_nis-3.0-8.el8.x86_64 openldap-2.4.46-18.el8.x86_64 openssl-libs-1.1.1k-9.el8_7.x86_64 pcre2-10.32-3.el8_6.x86_64 python3-libs-3.6.8-51.0.1.el8_8.2.x86_64 systemd-libs-239-74.0.6.el8_8.5.x86_64 zlib-1.2.11-21.el8_7.x86_64
(gdb) up
#1 req_manager (preq=0x7276ab0) at req_manager.c:4483
4483 mgr_node_set(preq);
(gdb) up
#2 0x0000000000455844 in process_request (sfds=18) at process_request.c:720
720 dispatch_request(sfds, request);
(gdb) up
#3 0x00000000004c1eae in process_socket (sock=sock@entry=18) at net_server.c:510
510 svr_conn[idx]->cn_func(svr_conn[idx]->cn_sock);
(gdb) up
#4 0x00000000004c208a in wait_request (waittime=, priority_context=) at net_server.c:623
623 if (process_socket(em_fd) == -1) {
(gdb) up
#5 0x000000000042749e in main (argc=, argv=0x7fff20121cf8) at pbsd_main.c:1398
1398 if (wait_request(waittime, priority_context) != 0) {
Server log at time of crash. Curiously it crashed on the one node that currently contains vnodes:
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6;attributes set: at request of root@pbssrv1
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6;attributes set: state - offline
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6;attributes set: state - down
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6;attributes set: state - offline
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6;attributes set: state - down
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6[0];attributes set: state - offline
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6[0];attributes set: state - down
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6[0];attributes set: state - offline
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6[0];attributes set: state - down
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6[1];attributes set: state - offline
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6[1];attributes set: state - down
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6[1];attributes set: state - offline
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6[1];attributes set: state - down
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6;attributes set: at request of root@pbssrv1
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6;attributes set: comment =
04/22/2024 13:14:09;0004;Server@pbssrv1;Node;k4r0n6;attributes set: comment =