Segfault at snsprintf from req_delete.c on job delete

jwmatthe · April 24, 2024, 4:01am

I initially appended this to my other issue “PBS server core dump on apparent job delete” but now that I have a full core dump it is clear the issue is a little different so I am creating a new topic.

OS: Oracle Linux 8.8 running RHEL kernel 4.18.0-477.27.1.el8_8.x86_64
PBS Server: 23.06.06 with patch b1249644 installed

Segfault occurred when a user was trying to delete a job, but is not consistently reproducible. I have not been able to repeat it thus far, though it has happened twice, a few weeks apart, and both times it was the same user curiously.

Core dump info:

Core was generated by `/usr/local/pkgs/openpbs/sbin/pbs_server.bin’.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/1261205’ in core file too small.
#0 0x00007f67a2e4af17 in __strlen_avx2 () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7f67a5b88840 (LWP 1261205))]
Missing separate debuginfos, use: yum debuginfo-install cyrus-sasl-lib-2.1.27-6.el8_5.x86_64 expat-2.2.5-11.0.1.el8.x86_64 glibc-2.28-225.0.4.el8_8.6.x86_64 gssproxy-0.8.0-21.el8.x86_64 keyutils-libs-1.5.10-9.el8.x86_64 krb5-libs-1.18.2-25.0.1.el8_8.x86_64 libblkid-2.32.1-42.el8_8.x86_64 libcom_err-1.45.6-5.el8.x86_64 libgcc-8.5.0-18.0.6.el8.x86_64 libical-3.0.3-3.el8.x86_64 libicu-60.3-2.el8_1.x86_64 libmount-2.32.1-42.el8_8.x86_64 libnsl2-1.2.0-2.20180605git4a062cf.el8.x86_64 libpq-13.5-1.el8.x86_64 libselinux-2.9-8.el8.x86_64 libstdc+±8.5.0-18.0.6.el8.x86_64 libtirpc-1.1.4-8.el8.x86_64 libxcrypt-4.1.1-6.el8.x86_64 nss_nis-3.0-8.el8.x86_64 openldap-2.4.46-18.el8.x86_64 openssl-libs-1.1.1k-9.el8_7.x86_64 pcre2-10.32-3.el8_6.x86_64 python3-libs-3.6.8-51.0.1.el8_8.2.x86_64 systemd-libs-239-74.0.6.el8_8.5.x86_64 zlib-1.2.11-21.el8_7.x86_64
(gdb) up
#1 0x00007f67a2de790d in vfprintf () from /lib64/libc.so.6
(gdb) up
#2 0x00007f67a2e0e044 in vsnprintf () from /lib64/libc.so.6
(gdb) up
#3 0x00007f67a2dee083 in snprintf () from /lib64/libc.so.6
(gdb) up
#4 0x000000000045986f in req_deletejob (preq=0x16fe32a0) at req_delete.c:617
617 snprintf(jid, sizeof(jid), “%s”, jobids[j]);
(gdb) up
#5 0x0000000000455844 in process_request (sfds=19) at process_request.c:720
720 dispatch_request(sfds, request);
(gdb) up
#6 0x00000000004c1eae in process_socket (sock=sock@entry=19) at net_server.c:510
510 svr_conn[idx]->cn_func(svr_conn[idx]->cn_sock);
(gdb) up
#7 0x00000000004c208a in wait_request (waittime=, priority_context=) at net_server.c:623
623 if (process_socket(em_fd) == -1) {
(gdb) up
#8 0x000000000042749e in main (argc=, argv=0x7ffdd1da8988) at pbsd_main.c:1398
1398 if (wait_request(waittime, priority_context) != 0) {

Reported in the server log at time of crash:

04/17/2024 19:10:54;0080;Server@pbssrv1;Job;3963228.pbssrv1;delete job request received
04/17/2024 19:10:54;0008;Server@pbssrv1;Job;3963228.pbssrv1;Job to be deleted at request of user4@login3

jwmatthe · June 3, 2024, 8:51pm

I have a little more info on this. This appears to happen when I user makes a mistake and pastes a whole bunch of jobsids after qdel or perhaps also includes some other characters. This has happened several times now, so we are still hoping for a resolution.

vchlum · June 18, 2024, 5:50am

Hi @jwmatthe! I can not reproduce this problem. Could you please dig for more details from the coredump? like:

select-frame 4
print preq->rq_type
print *jobids@count
print start_jobid
print j
print count
print jobids[j]

jwmatthe · July 18, 2024, 7:41am

Sorry for the late reply, for some reason I missed your reply.

Here is what you asked for, but unfortunately most of the data seems to be optimized out:

(gdb) select-frame 4
(gdb) print preq->rq_type
$1 = 100
(gdb) print *jobids@count
value has been optimized out
(gdb) print start_jobid
$2 = <optimized out>
(gdb) print j
$3 = <optimized out>
(gdb) print count
$4 = <optimized out>
(gdb) print jobids[j]
value has been optimized out

The user described pasting tons of random characters by mistake, which he managed to do 3 times over a couple of months. I wish I had more info on exactly what that looked like.

vchlum · August 10, 2024, 4:29pm

I am quite sure, this segfault is caused by qdel in form qdel 123 123.svr.name. A PR with a suggested solution is created.

jwmatthe · August 18, 2024, 6:48am

vchlum,

You are correct. I was able to crash the server using the following syntax:

qdel 123 123.servername

Topic		Replies	Views
Pbs_mom dumps core when jobs are preempted / canceled Users/Site Administrators	6	1799	August 20, 2019
Pbs_server.bin Segfault Users/Site Administrators	1	599	February 4, 2019
Pbs_comm core dump Users/Site Administrators	2	46	July 18, 2024
Segfault in mgr_node_set when changing child vnode status Users/Site Administrators	1	87	April 24, 2024
PBS server core dump on apparent job delete Users/Site Administrators	5	396	April 3, 2024

Segfault at snsprintf from req_delete.c on job delete

Related topics